-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathkmeans_viz_template.Rmd
86 lines (62 loc) · 2.78 KB
/
kmeans_viz_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
title: "Variable Clusters"
author: "srivera"
date: "11/6/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(csv)
library(ggplot2)
library(scales)
library(factoextra)
library(NbClust)
source('~/Projects/dataninja/src/main/R/lib/ggplot_number_format.R')
spotx_colors=c("#8ec641","#1b9dd0","#6b6c6f","black","#74a363","#1f4389","darkgrey")
```
# <<Variable>> Clusters
It is useful to clustering high cardinality variables for training models. I am clustering by average bid value, average number of bids per auction, and coverage rate. These metrics have been calculated using 1% of a day's data. The first step is to determine the best number of clusters to use. There are a number of methods to do so. The most popular being the following three. Given how subjective results can be interpreted, I have used all three methods and gone with the majority vote when/if the methods disagree.
## Choosing the number of clusters
The following three figures show that <<the two out of three>> methods indicate the optimal number of clusters is equal to <<2>>.
```{r, echo=FALSE}
df <- read.csv("~/Projects/Hyperbolic/dynamic_price_floors/channel_segmentation/cluster_data/publisher_cluster_data.csv", header = FALSE)
names(df) <- c("publisher_id","avg_bid_value","avg_num_bids","coverage_rate")
df$publisher_id <- as.factor(df$publisher_id)
# Standardize the data
sdf <- scale(df[,c("avg_bid_value","avg_num_bids","coverage_rate")])
# Elbow method
fviz_nbclust(sdf, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)+
labs(subtitle = "Elbow method")
```
```{r,echo=FALSE}
# Silhouette method
fviz_nbclust(sdf, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
```
```{r, echo=FALSE}
# Gap statistic
# nboot = 50 to keep the function speedy.
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(sdf, kmeans, nstart = 25, method = "gap_stat", nboot = 500,verbose = FALSE) +
labs(subtitle = "Gap statistic method")
```
```{r pubs, echo=FALSE}
cl <- kmeans(spdf,3)
df$cluster <- as.factor(cl$cluster)
ggplot() +
geom_point(data = df, aes(avg_bid_value,avg_num_bids,color=cluster))#+
#geom_point(data = as.data.frame(cl$centers), aes(avg_bid_value,avg_num_bids,fill="black"))
```
```{r pubs2, echo=FALSE}
ggplot() +
geom_point(data = df, aes(avg_bid_value,coverage_rate,color=cluster))#+
# geom_point(data = as.data.frame(cl$centers), aes(avg_bid_value,avg_num_bids,fill="black"))#,size=coverage_rate)) +
```
```{r pubs3, echo=FALSE}
ggplot() +
geom_point(data = df, aes(avg_num_bids,coverage_rate,color=cluster))#+
# geom_point(data = as.data.frame(cl$centers), aes(avg_bid_value,avg_num_bids,fill="black"))#,size=coverage_rate)) +
```