-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
240 lines (148 loc) · 7.43 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
title: "Pokemons Classification based on Characteristics"
author: "Imen Bouzidi"
date: "11/3/2019"
output:
html_document:
theme: united
highlight: tango
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Data Set:
Pokemon dataset contains information on a total of 801 Pokemon.
it includes:
* The English name of the Pokemon
* percentage_male: The percentage of the species that are male. Blank if the Pokemon is genderless.
* height_m: Height of the Pokemon in metres
* weight_kg: The Weight of the Pokemon in kilograms
* hp: The Base HP of the Pokemon
* attack: The Base Attack of the Pokemon
* defense: The Base Defense of the Pokemon
* speed: The Base Speed of the Pokemon
* base_egg_steps: The number of steps required to hatch an egg of the Pokemon
* base_happiness: Base Happiness of the Pokemon
* capture_rate: Capture Rate of the Pokemon
* experience_growth: The Experience Growth of the Pokemon
* sp_attack: The Base Special Attack of the Pokemon
* sp_defense: The Base Special Defense of the Pokemon
* generation: The numbered generation which the Pokemon was first introduced
* is_legendary: Denotes if the Pokemon is legendary(two major classes: legendary pokemons and not legendary pokemons)
Data are downloaded from: https://www.kaggle.com/rounakbanik/pokemon/data
## Packages needed:
```{r loadlib, echo=T, results='hide', message=F, warning=F}
library('DT')
library('missMDA')
library('NbClust')
library("FactoMineR")
library('factoextra')
library('fossil')
library("corrplot")
library('plotly')
library('kohonen')
library('mclust')
```
```{r data}
D=read.csv('Pokemon.csv',header = T,row.names = 1)
datatable(D, rownames = 1, filter="top", options = list(pageLength = 5, scrollX=T) )
summary(D)
```
## PCA on Characteristics of Pokemons:
Variables from height_m to sp_defense are the quantitatives variables for the PCA.
Supplementary qualitative variable are generation and 'is_legendary'.
```{r pca }
impData=imputePCA(D[,3:14], ncp = 2, scale = TRUE, method = c("Regularized","EM")) # Impute missing data
ImputData=data.frame(cbind(impData[["completeObs"]],D[,15:16]))
res.pca = PCA(ImputData[,1:14], graph = FALSE,quali.sup = c(13,14))
head(res.pca$eig,4)
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))
```
In our case, we are studying the 2 first dimensions ( but taking three is the optimal choice with a 60.44% variance cumulative percentage and eigenvalues >1)
### Visualizing the first 2 dimensions:
```{r carte}
corrplot(res.pca$var$cos2, is.corr=FALSE)
fviz_pca_var(res.pca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = T
)
```
From the plots above (corrplot and variables plot), The first dimension represents the variables : height_m, weight_kg, hp, attack, Capture_rate(negatively), sp_attack and sp_defense while the second dimension represents especially the base_happinesss variables and experience_growth(negatively).
Pokemons will be projected on the factor map after clustering.
## Optimal number of clusters:
NbCLust will be used to determine the optimal number of clusters for the k means and for hierarchical clustering, bellow is the code:
```{r Validation }
X=scale(ImputData[,1:12]) # Contains scaled charasteristics of each pokemon
res_nbclust<-NbClust(X,min.nc = 2, max.nc = 20, index="silhouette",method = "kmeans")
res_nbclust$All.index
```
In the following, 3 is considered the optimal cluster's number with a silhouette index equal to 0.2342.
## Clustering Using kmeans:
```{r kmeans}
km=kmeans(X,3)
fviz_cluster(list(data = X, cluster = km$cluster), geom = "point", stand = FALSE )+
scale_colour_manual(values = c("#ffa64d","#00b33c", "#ff3333"))+
scale_fill_manual(values = c("#ffa64d","#00b33c", "#ff3333"))
```
Group 1 is characterised mainly by high score in every characteristics of dimension 1( height, weight, hp, attack, sp_attack and sp_defense). It seems that Pokemons of group1 are the most powerful Pokemons but don't have high values in base_happiness.
Group2 is characterised by lower values on dimension 1 but higher values on dimension 2 (having higher base happiness than group 2 (see RadarPlot)).
Group 3 of Pokemons have the lowest values on the two dimensions but have a higher capture rate than other Pokemons.
```{r plot}
Centers.km=data.frame(km$centers) #Contains the centroids of each class
datatable(Centers.km, filter="top", options = list(pageLength = 3, scrollX=T) )
plot_ly( type = 'scatterpolar',fill = 'toself',mode='markers') %>%
add_trace(r = as.numeric(Centers.km[1,]),theta = colnames(Centers.km),name = 'Group 1')%>%
add_trace(r = as.numeric(Centers.km[2,]),theta = colnames(Centers.km),name = 'Group 2')%>%
add_trace(r = as.numeric(Centers.km[3,]),theta = colnames(Centers.km),name = 'Group 3')%>%
layout(polar = list(radialaxis = list(visible = T,range = c(-3,3))))
```
To conclude, Group1 is the most powerful group of Pokemons, Group2 is less powerful than Group1 and Group 3 is the least powerful group of pokemons but with higher Base_happiness and capture_rate.
## Clustering using Hierarchical clustering:
```{r hier}
d=dist(X,method = 'euclidean')
hc=hclust(d,method = 'ward.D')
classesHC=cutree(hc,k=3)# Return Classes of the hierarchical clustering
# function to find centroid in cluster i
clust.centroid = function(i, dat, classes) {
ind = ( classes == i)
colMeans(dat[ind,])
}
Centers.hc=sapply(unique(classesHC), clust.centroid, X, classesHC)
Centers.hc=data.frame(t(Centers.hc))
table(km$cluster,classesHC)
```
Group1 of Pokemons of kmeans is mainly classified in Group3 of hierarchical clustering, Group2 is mainly classified in Group1 of hierarchical clustering(419 pokemons are in group1 and only 2 are in group3) and Group3 is divided between Group2 and Group3.
```{r randind}
rand.index(km$cluster,classesHC)
```
A Rand index of 0.72 indicates a similarity between kmeans results and hierarchical clustering results.
```{r clustval}
plot_ly( type = 'scatterpolar',fill = 'toself',mode='markers') %>%
add_trace(r = as.numeric(Centers.hc[1,]),theta = colnames(Centers.hc),name = 'Group 1')%>%
add_trace(r = as.numeric(Centers.hc[2,]),theta = colnames(Centers.hc),name = 'Group 2')%>%
add_trace(r = as.numeric(Centers.hc[3,]),theta = colnames(Centers.hc),name = 'Group 3')%>%
layout(polar = list(radialaxis = list(visible = T,range = c(-3,3))))
```
Due to changes of switching Pokemons from a cluster to another in hierarchical clustering, the characteristics of the groups are mainly the same. Whereas Group2 have the loawest values of each score other than Base_happiness abd Capture_rate.
## Clustering Usins SOM (Self-Organizing Map):
```{r som}
set.seed(7)
#create SOM grid
sommap <- som(X, grid = somgrid(3, 1, "hexagonal"))
plot(sommap)
```
The first Group is characterised by a high capture_rate and base_happiness while the second group of pokemons is characterised by higher scores in every other variable.
The third group is characterised by low values in every variable other than base_happiness.
```{r rand}
somc=sommap$unit.classif
table(km$cluster,somc)
rand.index(km$cluster,somc) # to compare kmeans and Som CLustering
```
Groups of Pokemons that we obtained from Som are nearly the same as the groups we obtained with kmeans.
The rand index in this case is 0.99.
this proves a very good similarity between Som Clustering results and kmeans results.
## Clustering Using EM (Expectation Maximization):
```{r EM}
EMB <- Mclust(X)
summary(EMB)
summary(EMB$BIC)
```