kmeans_viz_template.Rmd

---
title: "Variable Clusters"
author: "srivera"
date: "11/6/2019"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(csv)
library(ggplot2)
library(scales)
library(factoextra)
library(NbClust)
source('~/Projects/dataninja/src/main/R/lib/ggplot_number_format.R')

spotx_colors=c("#8ec641","#1b9dd0","#6b6c6f","black","#74a363","#1f4389","darkgrey")
```

# <<Variable>> Clusters
It is useful to clustering high cardinality variables for training models. I am clustering by average bid value, average number of bids per auction, and coverage rate. These metrics have been calculated using 1% of a day's data. The first step is to determine the best number of clusters to use. There are a number of methods to do so. The most popular being the following three. Given how subjective results can be interpreted, I have used all three methods and gone with the majority vote when/if the methods disagree. 

## Choosing the number of clusters

The following three figures show that <<the two out of three>> methods indicate the optimal number of clusters is equal to <<2>>. 
```{r, echo=FALSE}
df <- read.csv("~/Projects/Hyperbolic/dynamic_price_floors/channel_segmentation/cluster_data/publisher_cluster_data.csv", header = FALSE)
names(df) <- c("publisher_id","avg_bid_value","avg_num_bids","coverage_rate")
df$publisher_id <- as.factor(df$publisher_id)

# Standardize the data
sdf <- scale(df[,c("avg_bid_value","avg_num_bids","coverage_rate")])
# Elbow method
fviz_nbclust(sdf, kmeans, method = "wss") +
    geom_vline(xintercept = 3, linetype = 2)+
  labs(subtitle = "Elbow method")

```
```{r,echo=FALSE}
# Silhouette method
fviz_nbclust(sdf, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")
```
```{r, echo=FALSE}
# Gap statistic
# nboot = 50 to keep the function speedy. 
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(sdf, kmeans, nstart = 25,  method = "gap_stat", nboot = 500,verbose = FALSE) +
  labs(subtitle = "Gap statistic method")
```


```{r pubs, echo=FALSE}
cl <- kmeans(spdf,3)
df$cluster <- as.factor(cl$cluster)

ggplot() +
  geom_point(data = df, aes(avg_bid_value,avg_num_bids,color=cluster))#+
  #geom_point(data = as.data.frame(cl$centers), aes(avg_bid_value,avg_num_bids,fill="black"))

```

```{r pubs2, echo=FALSE}
ggplot() +
  geom_point(data = df, aes(avg_bid_value,coverage_rate,color=cluster))#+
  # geom_point(data = as.data.frame(cl$centers), aes(avg_bid_value,avg_num_bids,fill="black"))#,size=coverage_rate)) + 
```
```{r pubs3, echo=FALSE}
ggplot() +
  geom_point(data = df, aes(avg_num_bids,coverage_rate,color=cluster))#+
  # geom_point(data = as.data.frame(cl$centers), aes(avg_bid_value,avg_num_bids,fill="black"))#,size=coverage_rate)) + 
```