-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathmodeling3.Rpres
262 lines (171 loc) · 6.22 KB
/
modeling3.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
Modelling part 3 -
========================================================
autosize: true
# Machine Learning
## R-Ladies Freiburg
Wednesday 4th December
Elisa Schneider
Random Forests
========================================================
Tree Based Methods
========================================================
<small> We use the **Hitters** data set to predict a baseball player's **Salary** based on
**Years**: the number of years that he has played in the major leagues and
**Hits**: the number of hits that he made in the previous year. </small>
***
```{r, echo=FALSE}
library(ISLR)
library(ggplot2)
Hitters <- Hitters[!is.na(Hitters$Salary),]
ggplot(Hitters, aes(x=Years, y=Hits, color=Salary)) +
geom_point(size=3) +
theme(text = element_text(size=28))
#geom_hline(yintercept=117.5, )
#Heart <- read_csv("Heart.csv")
```
Tree Based Methods
========================================================
```{r, echo=FALSE}
library(ISLR)
library(ggplot2)
Hitters <- Hitters[!is.na(Hitters$Salary),]
ggplot(Hitters, aes(x=Years, y=Hits, color=Salary)) +
geom_point(size=3) +
geom_vline(xintercept=4.5, linetype="dashed",
color = "orange", size=2) +
geom_segment(aes(x = 4.5, y = 117.5, xend = 25, yend = 117.5),linetype="dashed",
color = "orange", size=2) +
theme(text = element_text(size=28))
```
***

Tree Based Method Vs. Linear Model
========================================================
<style>
/* heading for slides with two hashes ## */
.reveal .slides section .slideContent h2 {
font-size: 40px;
font-weight: bold;
color: violet;
}
/* ordered and unordered list styles */
.reveal ul,
.reveal ol {
font-size: 25px;
}
</style>
Which model is better?
It depends on the problem at hand.
- If the relationship between the features and the response is well approximated
by a linear model, then an approach such as linear regression
will likely work well and will outperform a method such as a regression
tree.
- If instead there is a highly
non-linear and complex relationship between the features and the response
as indicated by model, then decision trees may outperform classical
approaches.
***

High Variance of Trees
========================================================
- The decision trees suffer from high variance: if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.
- A procedure with low variance will yield similar
results; linear regression tends to have low variance.
- A natural way to reduce the variance and hence increase the prediction
accuracy is to take many training sets
from the population, build a separate prediction model using each training
set, and average the resulting predictions.
***

<style>
.small-code pre code {
font-size: 1em;
}
</style>
Correlation of Trees
========================================================
- If we create six decision trees with sub-samples of the Boston housing data, we see that the top of the trees all have a very similar structure: Although there are 15 predictor variables, all six trees have both lstat and rm variables driving the first few splits.
- Tree correlation prevents variance reduction. The way Random Forest solve this is randomly choosing a subset of all the predictors that are used at each split.
***

Example
========================================================
class: small-code
```{r}
library(readr)
library(randomForest)
library(MASS)
housing <- Boston
RFmodel <- randomForest(medv~., ntree=500, mtry=6, data=housing)
RFmodel
```
Example
========================================================
class: small-code
```{r}
plot(RFmodel)
```
***
```{r}
varImpPlot(RFmodel)
```
Artifitial Neural Network
========================================================
autosize: true
- ANN is an information processing model inspired by the biological neuron system.
- It is composed of a large number of interconnected processing elements: the neuron.
- ANN were designed to solve problems which are easy for humans and difficult for machines such as identifying patterns: distingushing pictures of cats and dogs or recognizing numbers in pictures
****

ANN structure
========================================================

Example
========================================================
class: small-code
```{r}
index <- sample(1:nrow(housing),round(0.75*nrow(housing)))
train <- housing[index,]
test <- housing[-index,]
```
```{r}
maxs <- apply(housing, 2, max)
mins <- apply(housing, 2, min)
scaled <- as.data.frame(scale(housing, center = mins, scale = maxs - mins))
train_ <- scaled[index,]
test_ <- scaled[-index,]
```
Example
========================================================
class: small-code
```{r}
library(neuralnet)
n <- names(train_)
f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"], collapse = " + ")))
nn <- neuralnet(f, data=train_,hidden=c(5,3), linear.output=T)
plot (nn)
```

Measuring performance
========================================================
class: small-code
To test the model we can not use the same data used to fit the model. There are different strategies to test the model when we have only one data-set (this could be another hole MeetUp). One is what we did before, split the data set in two. An then calculate differen meassures to test the model:
- Regression
$$ RMSE = \sqrt\frac{\strut{\sum\limits_{i=1}^{n}{(\hat{y}_{i}-y_{i})^2}}}{n} $$
Measuring performance
========================================================
class: small-code
Classification (0,1)
```{r, eval=FALSE}
library(verification)
verification::roc.area(obs, pred)
```
Cassification (More than two categories)
```{r, eval=FALSE}
table(obs, pred)
ggplot(longData, aes(x = Var2, y = Var1)) +
geom_raster(aes(fill=value))
library(caret)
caret::confusion.matrix(obs, pred)
```
[More Info Here](https://www.datascienceblog.net/post/machine-learning/performance-measures-multi-class-problems/)