-
Notifications
You must be signed in to change notification settings - Fork 25
/
Copy path08_basketball_shots.Rmd
536 lines (424 loc) · 21.8 KB
/
08_basketball_shots.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
# Basketball shots
```{julia chap_8_libraries, cache = TRUE, results = FALSE, echo = FALSE}
cd("./08_basketball_shots/")
import Pkg
Pkg.activate(".")
using Markdown
using InteractiveUtils
using Plots
using Random
using CSV
using DataFrames
using StatsPlots
using Images
using StatsBase
using Turing
using StatsFuns
```
```{julia, results = FALSE, echo=FALSE}
Random.seed!(0)
```
When playing basketball we can ask ourselves: how likely is it to score given a position in the court? To answer this question we are going to use
data from NBA games from the 2006 - 2007 season. Lets load the data into our Julia session and inspect the column names:
```{julia}
shots = CSV.read("./08_basketball_shots/data/seasons/shots_season_2006_2007.csv", DataFrame);
names(shots)
```
The columns of our dataset are the following:
- result: Stores a 1 if the shot was successful and a 0 otherwise.
- x: The x-component of the position of the player when the shot was made.
- y: The y-component of the position of the player when the shot was made.
- period: Indicates in which of the four periods of a basketball game the shot was done.
- time: The exact time the shot was made.
Below we show a sketch of a basketball court, its dimensions and how to interpret the data in the table. It would be useful to change
the shot coordinates from cartesian (x and y coordinates) to polar. In this way, we can think about the problem considering the distance
and the angle from the hoop.
```{julia chap_8_plot_1, echo=FALSE}
plot(load("./08_basketball_shots/images/basket_court.png"), axis=nothing, border=:none)
```
```{julia, results = FALSE}
shots[!, :distance] = sqrt.( shots.x .^ 2 + shots.y .^ 2)
shots[!, :angle] = atan.( shots.y ./ shots.x )
filter!(x -> x.distance > 1, shots)
```
So, the $x$ and $y$ axis have their origin at the hoop, and we compute the distance from this point to where the shot was made.
Also, we compute the angle with respect to the $x$ axis, showed as θ in the sketch.
Lets now plot where the shots were made with a two-dimensional histogram. As more shots are made in a certain region of the field, that region
is shown with a brighter color.
```{julia chap_8_plot_2}
histogram2d(shots.y[1:10000], shots.x[1:10000], bins=(50,30));
xlabel!("x axis");
ylabel!("y axis")
```
We see that the shots are very uniformly distributed around the hoop, except for distances very near to the hoop. To see this better, we plot
the histograms for each axis, $x$ and $y$.
As we are interested in the shots that were scored, we filter the shots made and plot the histogram of each axis.
```{julia , results = FALSE}
shots_made = filter(x -> x.result == 1, shots)
```
```{julia chap_8_plot_3}
histogram(shots_made.y[1:10000], legend=false, nbins=40);
xlabel!("x axis");
ylabel!("Counts")
```
```{julia chap_8_plot_4}
histogram(shots_made.x[1:10000], legend=false, nbins=45);
xlabel!("y axis");
ylabel!("Counts")
```
We can also summarize all this information with a wireplot, as shown below
```{julia chap_8_plot_5}
h = fit(Histogram, (shots_made.y, shots_made.x), nbins=40);
wireframe(midpoints(h.edges[2]), midpoints(h.edges[1]), h.weights, zlabel="counts", xlabel="y", ylabel="x", camera=(40,40));
xlabel!("x");
ylabel!("y");
title!("Histogram of shots scored")
```
More shots are made as we get near the hoop, as expected.
It is worth noting that we are not showing the probability of scoring, we are just showing the distribution of shot scored, not how likely
is it to score.
## Modelling the scoring probability
The first model we are going to propose is a Bernoulli model.
A Bernoulli distribution results from an experiment in which we have two possible outcomes, one that is usually called a success and another called
a failure. In our case our success is scoring the shot and the other possible event is failing it.
The only parameter needed in a bernoulli distribution is the probability $p$ of having a success. We are going to model this parameter as a
logistic function:
```{julia chap_8_plot_6}
plot(logistic, legend=false);
xlabel!("x");
ylabel!("Probability");
title!("Logistic function (x)")
```
The reason to choose a logistic function is that we are going to model the probability of shootiing as a function of some variables, for
example the distance to the hoop, and we want that our scoring probability increases as we get closer to it. Also out probability needs to be
between 0 an 1, so a nice function to map our values is the logistic function.
The probabilistic model we are going to propose is
$p\sim logistic(a + b*distance[i] + c*angle[i])$
$outcome[i]\sim Bernoulli(p)$
As you can see, this is a very general model, since we haven't yet specified anything about $a$, $b$ and $c$. Our approach will be to propose
prior distributions for each one of them and check if our proposals make sense.
## Prior Predictive Checks: Part I
Lets start by proposing Normal prior distributions for $a$, $b$ and $c$, i.e., Gaussian distributions with mean 0 and variance 1. Lets sample
and see what are the possible predictions for $p$:
$a\sim N(0,1)$
$b\sim N(0,1)$
$c\sim N(0,1)$
```{julia, results = FALSE}
possible_distances = 0:0.01:1
possible_angles = 0:0.01:π/2
n_samples = 100
# we sample possible values from a and b
a_prior_sampling = rand(Normal(0,1), n_samples)
b_prior_sampling = rand(Normal(0,1), n_samples)
# with the sampled values of a and b, we make predictions about how p would look
predicted_p = []
for i in 1:n_samples
push!(predicted_p, logistic.(a_prior_sampling[i] .+ b_prior_sampling[i] .* possible_distances))
end
```
```{julia chap_8_plot_7}
plot(possible_distances, predicted_p[1], legend = false, color="blue");
for i in 2:n_samples
plot!(possible_distances, predicted_p[i], color=:blue);
end
xlabel!("Normalized distance");
ylabel!("Predicted probability");
title!("Prior predictive values for p")
```
Each one of these plots is the result of the logistic with one combination of $a$ and $b$ as parameters.
We see that some of the predicted values of $p$ don't make sense. For example, if $b$ takes positive values, we are saying that as we
increase our distance from the hoop, the probability of scoring also increases. As we want $b$ to be a negative number, we should propose a
distribution which we can sample from and be sure their values have always the same sign. One example of this is the LogNormal distribution,
which will give us always positive numbers. Multiplying these values by $-1$ gives us the certainty that we will have a negative number.
We update our model as follows:
$a\sim Normal(0,1)$
$b\sim LogNormal(1,0.25)$
$c\sim Normal(0,1)$
Repeating the sampling process with the updated model, we get the following predictions for $p$:
```{julia chap_8_plot_8, results = FALSE}
b_prior_sampling_negative = rand(LogNormal(1,0.25), n_samples)
predicted_p_inproved = []
for i in 1:n_samples
push!(predicted_p_inproved, logistic.(a_prior_sampling[i] .- b_prior_sampling_negative[i].*possible_distances))
end
```
```{julia chap_8_plot_9}
plot(possible_distances, predicted_p_inproved[1], legend = false, color=:blue);
for i in 2:n_samples
plot!(possible_distances, predicted_p_inproved[i], color=:blue);
end
xlabel!("Normalized distance");
ylabel!("Predicted probability");
title!("Prior predictive values with negative LogNormal prior")
```
Now the behavior we can see from the predicted $p$ curves is the expected. As the shooting distance increases, the probability of scoring decreases.
We have set some boundaries in the form of a different prior probability distribution. This process is what is called prior-predictive checks.
Esentially, it is an iterative process where we check if our initial proposals make sense.
Now that we have the expected behaviour for $p$, we define our model and calculate the posterior distributions with our data points.
### Defining our model in Turing and computing posteriors
Now we define our model in the Turing framework in order to sample from it:
```{julia, results = FALSE}
@model logistic_regression(distances, angles, result,n) = begin
N = length(distances)
# Set priors.
a ~ Normal(0,1)
b ~ LogNormal(1,0.25)
c ~ Normal(0,1)
for i in 1:n
p = logistic(a - b*distances[i] + c*angles[i])
result[i] ~ Bernoulli(p)
end
end
```
```{julia, results = FALSE}
n = 1000
```
The output of the sampling tells us also some information about sampled values for our parameters, like the mean, the standard deviation and some
other computations.
```{julia, results = FALSE}
# Sample using HMC.
chain = mapreduce(c -> sample(logistic_regression(shots.distance[1:n] ./ maximum(shots.distance[1:n] ), shots.angle[1:n], shots.result[1:n], n), NUTS(), 1500), chainscat, 1:3);
```
#### Traceplot
In the plot below we show a traceplot of the sampling.
When we run a model and calculate the posterior, we obtain sampled values from the posterior distributions. We can tell our sampler how many
sampled values we want. A traceplot just shows them in sequential order. We also can plot the distribution of those values, and this is
what is showed next to each traceplot.
```{julia chap_8_plot_10}
plot(chain, dpi=60)
```
```{julia, results = FALSE}
a_mean = mean(chain[:a])
b_mean = mean(chain[:b])
c_mean = mean(chain[:c])
```
Now plotting the scoring probability using the posterior distributions of $a$, $b$ and $c$ for an angle of 45°, we get:
```{julia, results = FALSE}
p_constant_angle = []
for i in 1:length(chain[:a])
push!(p_constant_angle, logistic.(chain[:a][i] .- chain[:b][i].*possible_distances .+ chain[:c][i].*π/4));
end
```
```{julia chap_8_plot_11}
plot(possible_distances,p_constant_angle[1], legend=false, alpha=0.1, color=:blue);
for i in 2:1000
plot!(possible_distances,p_constant_angle[i], alpha=0.1, color=:blue);
end
xlabel!("Normalized distance");
ylabel!("Probability");
title!("Scoring probability vs Normalized distance (angle=45°)")
```
We clearly see how our initial beliefs got adjusted with the data. Although initially the decreasing scoring probability with increasing distance
behavior was correct, there was a lot of uncertainty about the expected value. What this last plot shows is how the possible values of `p` narrowed
as we updated our beliefs with data.
We already plotted how the scoring probability changes with respect to the distance from the hoop, but what about the angle? In principle, we
could think that this should not be too relevant, since there is no special advantage on changing the angle from where we make our shot, but let's
see what does our model say when showing some real data to it. Just to keep the distance constant and only take into account the angle change, we
plot for a miiddle distance, i.e., $0.5$ in a normalized distance.
```{julia chap_8_plot_12, results = FALSE}
p_constant_distance = []
for i in 1:length(chain[:a])
push!(p_constant_distance, logistic.(chain[:a][i] .- chain[:b][i].*0.5 .+ chain[:c][i].*possible_angles));
end
```
```{julia chap_8_plot_13}
plot(rad2deg.(possible_angles), p_constant_distance[1], legend=false, alpha=0.1, color=:blue);
for i in 2:1000
plot!(rad2deg.(possible_angles), p_constant_distance[i], alpha=0.1, color=:blue);
end
xlabel!("Angle [deg]");
ylabel!("Probability");
title!("Scoring probability vs Angle (mid distance)")
```
We see that the model predicts almost constant average probability as we vary the angle.
This is consistent with what we already suspected, although we see there is more uncertainty about the scoring probability as we move from
0° to 90°. In conclusion, the angle doesn't seem too important when shooting; the distance from the hoop is the relevant variable.
## New model and prior predictive checks: Part II
<Why was this model chosen?>
Now we propose another model with the form:
$p\sim logistic(a + b^{distance[i]} + c*angle[i])$
*But for what values of b the model makes sense?
Here we basically changed the dependence of the model with distance. We now have the parameter $b$ to the power of the distance from the hoop.
We should ask ourselves, like we did with the other proposed models: Would every possible value of $b$ make sense? Let's dig into this question.
Below we plot four function with four different possible values of $b$, having in mind that the values of $x$, the normalized distance, goes
from 0 to 1. These functions are
```{julia,results = FALSE}
f1(x) = 0.3^x
f2(x) = 1.5^x
f3(x) = -0.3^x
f4(x) = -1.5^x
```
and their plots are
```{julia chap_8_plot_14}
plot(0:0.01:1, f1, label="f1: b < 1 & b > 0", xlim=(0,1), ylim=(-2,2), lw=3);
plot!(0:0.01:1, f2, label="f2: b>1", lw=3);
plot!(0:0.01:1, f3, label="f3: b<0 & b>-1", lw=3);
plot!(0:0.01:1, f4, label="f3: b<-1", lw=3);
xlabel!("Normalized distance");
title!("Prior Predictive influence of distance")
```
As we can see, the qualitative behavior of these functions is different for each of the chosen values for $b$. The only one of these values that
is consistent with our common sense, is the one proposed for $f_1$, as we want it to be a decreasing function for the distance variable. In other
words, we want $b$ to be in the range from 0 to 1, and to be positive.
Now that we have limited the range of values that $b$ can take, we should propose a probability distribution that satisfies the constraints the
parameter should have. For this, we propose a Beta distribution with parameters $α=2$ and $β=2$.
We chose these distribution since it is only defined in the interval we are interesed on, and the value of the parameters so that the distribution
is symmetrical, i.e., not biased towards the start or the end of the range from 0 to 1.
```{julia chap_8_plot_15}
plot(Beta(2,2), xlim=(-0.1,1), legend=false);
title!("Prior distribution for b")
```
### Defining the new model and computing posteriors
We define then our model and calculate the posterior as before.
```{julia}
@model logistic_regression_exp(distances, angles, result, n) = begin
N = length(distances)
# Set priors.
a ~ Normal(0,1)
b ~ Beta(2,2)
c ~ Normal(0,1)
for i in 1:n
p = logistic(a + b .^ distances[i] + c*angles[i])
result[i] ~ Bernoulli(p)
end
end
```
```{julia chap_8_sample_1, results = FALSE}
# Sample using HMC.
chain_exp = mapreduce(c -> sample(logistic_regression_exp(shots.distance[1:n] ./ maximum(shots.distance[1:n] ), shots.angle[1:n], shots.result[1:n], n), HMC(0.05, 10), 1500), chainscat, 1:3)
```
Plotting the traceplot we see again that the variable angle has little importance since the parameter $c$, that can be related to the importance
of the angle variable for the probability of scoring, is centered at 0.
```{julia chap_8_plot_16}
plot(chain_exp, dpi=55)
```
```{julia, results = FALSE}
p_exp_constant_angle = []
for i in 1:length(chain_exp[:a])
push!(p_exp_constant_angle, logistic.(chain_exp[:a][i] .+ chain_exp[:b][i].^possible_distances .+ chain_exp[:c][i].*π/4))
end
```
Employing the posteriors distributions computed, we plot the probability of scoring as function of the normalized distance and obtain the plot
shown below.
```{julia chap_8_plot_17}
plot(possible_distances,p_exp_constant_angle[1], legend=false, alpha=0.1, color=:blue);
for i in 2:1000
plot!(possible_distances,p_exp_constant_angle[i], alpha=0.1, color=:blue);
end
xlabel!("Normalized distance");
ylabel!("Probability");
title!("Scoring probability vs Normalized distance (angle=45°)")
```
Given that we have 2 variables, we can plot the mean probability of scoring as function of the two and obtain a surface plot, too:
```{julia,results = FALSE}
angle_ = collect(range(0, stop=π/2, length=100))
dist_ = collect(range(0, stop=1, length=100))
it = Iterators.product(angle_, dist_)
matrix = collect.(it)
values = reshape(matrix, (10000, 1))
angle_grid = getindex.(values,[1])
dist_grid = getindex.(values,[2])
z = logistic.(mean(chain_exp[:a]) .+ mean(chain_exp[:b]).^dist_grid .+ mean(chain_exp[:c]).*angle_grid)
```
```{julia chap_8_plot_18, echo = FALSE}
plot(load("./08_basketball_shots/images/img1.png"))
```
The plot shows the expected behavior, an increasing probability of scoring as we get near the hoop. We also see that there is almost no variation
of the probability with the angle.
## Does the Period affect the scoring probability?
We have been using the scoring data across all periods, but what about how the scoring probability changes across periods?
We will propose a model and calculate the posterior for its parameters with scoring data of each of the four possible periods. The exact
same model for all four periods will be used. The angle variable will be discarded from the model, as we have already saw that is of little
importance.
We filter our data by period and proceed to estimate our posterior distributions.
```{julia ,results = FALSE}
shots_period1 = filter(x -> x.period == 1, shots)
```
```{julia, results = FALSE}
@model logistic_regression_period(distances, result,n) = begin
N = length(distances)
# Set priors.
a ~ Normal(0,1)
b ~ Beta(2,5)
for i in 1:n
p = logistic(a + b .^ distances[i])
result[i] ~ Bernoulli(p)
end
end
```
```{julia , results = FALSE}
n_ = 500
```
```{julia sample_period_1, results = FALSE}
# Sample using HMC.
chain_period1 = mapreduce(c -> sample(logistic_regression_period(shots_period1.distance[1:n_] ./ maximum(shots_period1.distance[1:n_] ), shots_period1.result[1:n_], n_), HMC(0.05, 10), 1500), chainscat, 1:3)
```
```{julia, results = FALSE}
shots_period2 = filter(x -> x.period == 2, shots)
```
```{julia sample_period_2, results = FALSE}
# Sample using HMC.
chain_period2 = mapreduce(c -> sample(logistic_regression_period(shots_period2.distance[1:n_] ./ maximum(shots_period2.distance[1:n_] ), shots_period2.result[1:n_], n_), HMC(0.05, 10), 1500),
chainscat,
1:3
);
```
```{julia , results = FALSE}
shots_period3= filter(x->x.period==3, shots);
```
```{julia sample_period_3, results = FALSE}
# Sample using HMC.
chain_period3 = mapreduce(c -> sample(logistic_regression_period(shots_period3.distance[1:n_] ./ maximum(shots_period3.distance[1:n_] ), shots_period3.result[1:n_], n_), HMC(0.05, 10), 1500),
chainscat,
1:3
);
```
```{julia}
shots_period4 = filter(x->x.period==4, shots);
```
```{julia sample_period_4, results = FALSE}
# Sample using HMC.
chain_period4 = mapreduce(c -> sample(logistic_regression_period(shots_period4.distance[1:n_] ./ maximum(shots_period4.distance[1:n_]), shots_period4.result[1:n_], n_), HMC(0.05, 10), 1500),
chainscat,
1:3
);
```
```{julia,results = FALSE}
p_period1 = logistic.(mean(chain_period1[:a]) .+ mean(chain_period1[:b]).^possible_distances )
p_period1_std = logistic.((mean(chain_period1[:a]) .+ std(chain_period1[:a])) .+ (mean(chain_period1[:b]) .+ std(chain_period1[:a])).^possible_distances)
p_period2 = logistic.(mean(chain_period2[:a]) .+ mean(chain_period2[:b]).^possible_distances )
p_period2_std = logistic.((mean(chain_period2[:a]) .+ std(chain_period2[:a])) .+ (mean(chain_period2[:b]) .+ std(chain_period2[:a])).^possible_distances)
p_period3 = logistic.(mean(chain_period3[:a]) .+ mean(chain_period3[:b]).^possible_distances)
p_period3_std = logistic.((mean(chain_period3[:a]) .+ std(chain_period3[:a])) .+ (mean(chain_period3[:b]) .+ std(chain_period3[:a])).^possible_distances)
p_period4 = logistic.(mean(chain_period4[:a]) .+ mean(chain_period4[:b]).^possible_distances )
p_period4_std = logistic.((mean(chain_period4[:a]) .+ std(chain_period4[:a])) .+ (mean(chain_period4[:b]) .+ std(chain_period4[:a])).^possible_distances)
```
We plot now for each period the probability of scoring for each period, each mean and one standard deviation from it.
```{julia chap_8_plot_19}
plot(possible_distances, p_period4,ribbon=p_period4_std.-p_period4, color=:magenta, label="period4", fillalpha=.3, ylim=(0,0.6));
plot!(possible_distances, p_period2, color=:green, ribbon=p_period2_std.-p_period2, label="period2", fillalpha=.3);
plot!(possible_distances, p_period3, color=:orange, ribbon=p_period3_std.-p_period3, label="period3",fillalpha=.3);
plot!(possible_distances, p_period1,ribbon=p_period1_std.-p_period1, color=:blue, label="period1", fillalpha=.3);
xlabel!("Normalized distance");
ylabel!("Scoring probability")
```
Comparing the means of each period, we see that for the periods 1 and 4, the first and the last periods, the scoring probability is slightly
higher than the other two periods. There are a bunch of possible explanations for these results. Clearly in the first period, players are most
effective due to being with the most energy. In period 4, new players have substituted the old ones, and also have a lot of energy, while in periods
2 and 3, there is a tendency for more tired players that have not been substituted. These hypothesis should be tested to know if they are correct,
but it is a starting point for more data analysis.
## Summary
In this chapter, we used the NBA shooting data of the season 2006-2007 to analyze how the scoring probability is affected by some variables,
such as the distance from the hoop and the shooting angle.
First, we inspected the data by plotting a heatplot of all the shots made and making histograms of the ones that scored.
As our goal was to study the scoring probability, which is a Bernoulli trial situation, we decided to use a Bernoulli model.
Since the only parameter needed in a Bernoulli distribution is the probability $p$ of having a success, we modeled $p$ as a logistic function:
$p\sim logistic(a+ b*distance[i] + c*angle[i])$
We set the prior probability of the parameters $a$ and $c$ to a Normal distribution and $b$ to a LogNormal one.
Thus, we constructed our logistic regression model and sampled it using the Markov Monte Carlo algorithm (MCMC).
To gain a better understanding of the sampling process, we made a traceplot that shows the sampled values in a sequential order.
Later, we decided to try with a more complex logistic regression model, similar to the first one but this time modifying the distance dependency:
$p\sim logistic(a+ b^{distance[i]} + c*angle[i])$
We set the prior distribution of $b$ to a beta distribution and constructed the second logistic regression model, sampled it and plotted the
results.
Finally, we analyzed the results to see if the period of the game affects the scoring probability.