Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in MVPA between different input methods #1178

Closed
pinweichen opened this issue Jul 31, 2024 · 16 comments
Closed

Discrepancy in MVPA between different input methods #1178

pinweichen opened this issue Jul 31, 2024 · 16 comments

Comments

@pinweichen
Copy link

Hi there,
I was testing the read.my.acc utilization of the GGIR and noticed that there is a discrepancy in the result of the same file if I input it as gt3x or as a customized csv. I am building a customized pipeline to organize data from different actigraphy devices with similar preprocessing steps (e.g., resampling, impute, or calibration).

Here are my two comparisons:
(1) I input gt3x files directly into GGIR and get one result.
(2) I converted a gt3x file into a CSV using read.gt3x::read.gt3x. Then I place a one-line header and include information needed for the "rmc." function. I was able to run through the GGIR and obtained a result.

However, I noticed there is a discrepancy in MVPA values between different input methods. The intensity is larger when I specify the rmc.doresample = T and rmc.check4timegaps = T. If I use an approx function to custom impute all time gaps, and turn off rmc.doresample and rmc.check4timegaps, the intensity becomes way too small that in part 2 there is barely any MVPA.

My question is if there are any calibration or normalization steps that I'm missing that exist for gt3x but not for the customized CSV files. I know there is a CSV input for ActiGraph data. However, I would like to standardize some preprocessing for all data from different actigraphy brands. I want to use the custom CSV input for the GGIR and produce similar results as if I input gt3x directly.

I borrowed the debug issue format here.

To Reproduce
version of GGIR (3.0-0). We started the project when this version was available.

  1. Sensor brand: ActiGraph

  2. Data format: customized CSV with imputation steps using approx function

  3. Approximate recording duration 7 days

  4. Are you using a sleep diary to guide sleep detection: NO

  5. Copy of R command used:

  6. I customized some parts of the rmc. functions to fit the header name reading of my customized csv.
    rmc.firstrow.header = 1,
    rmc.header.length = 1,
    rmc.firstrow.acc = 2, # first row is header
    rmc.col.time = 1,
    rmc.col.acc = 2:4,
    rmc.unit.time = "UNIXsec",
    rmc.headername.sf = "Sampling_frequency",
    rmc.headername.sn = "sensor_type",
    rmc.headername.recordingid = "filename",
    rmc.header.structure = "std",
    rmc.doresample = T,
    rmc.check4timegaps = T

  7. Have you tried processing your data based on GGIR's default argument values? Does the issue you report still appear?
    Yes, I have. The file can run. The results are different.

Expected behavior
I'm hoping to have the same GGIR results between the gt3x direct input and the customized csv of the same file.

I provide config files if that helps.
The original gt3x input
config_direct_input.csv

The customized csv with all time gaps imputed
config_custom_impute.csv

The customized csv without customized impute but turned on rmc.doresample and rmc.check4timegaps.
config_custom_csv_resample_on_timegap_on.csv

I can also provide example data and output folders if you need them.

Desktop (please complete the following information):

  • OS: macOS 13.6.7
  • GGIR Version 3.0-0
  • Chips: Apple M2

Thank you very much.

@vincentvanhees
Copy link
Member

I assumed that GGIR uses the same code for both data formats.

  1. I am only wondering now whether this recent commit is causing trouble by attempting to look for and impute gaps twice when rmc.imputegaps = TRUE.

  2. Could it be that the difference is in how you created the csv file? In GGIR the gt3x file is read with default read.gt3x arguments https://github.com/wadpac/GGIR/blob/master/R/g.readaccfile.R#L275-L276 and without using the imputezeros or clean options that read.gt3x offers.

  3. The MVPA extraction happens much further down the line and is the same code for all data format. So, I think that differences must be explained by the raw data itself or by how it is read. It may be good to compare the output of read.myacc.csv() to read.gt3x() output:

  • Do you see any obvious synchronisation problems when you plot the signals on top of each other, e.g. time zone difference?
  • Are time series the same length?

Note that saving numeric data to csv can in itself introduce some tiny rounding errors.
If that does not lead to a clear explanation then we may need to try comparing the epoch level metrics produced by GGIR part 1 ... to get these load the .RData file form the output subfolder meta/basic folder and inspect object M$metashort. These are the metric values.

@l-k-
Copy link
Collaborator

l-k- commented Aug 7, 2024

@pinweichen have you tried running one of the data files where you see this issue through the latest version of GGIR?

I know you said you need GGIR 3.0-0 for your project, but there have been a few changes to raw data handling since last October when that version came out.

@pinweichen
Copy link
Author

Thank you Lena and Vincent for responding.
To Vicent's questions:

  1. The duplicated steps of time gap imputation do not affect the result. The discrepancy results still exist when both rmc.imputegaps = FALSE and rmc.doresample = FALSE. I also was testing on GGIR 3.0-0.

2 & 3. I've compared all three types of raw data and compared them row by row: 1. load gt3x using default gt3x load method, 2. load my customized version of csv using fread, and 3. load the csv file using read.myacc.csv function from GGIR. All three data are identical when loaded. Time series are in the same length.

I also compared the part 1 meta result by loading the meta/basic file. I notice in C variable, the spheredata are different. So are the scale, and offset. (Original_C = gt3x result. std_C = custom csv result.)

Screenshot 2024-08-08 at 2 56 22 PM
Screenshot 2024-08-08 at 2 56 37 PM

I've also tested the version when I turned on rmc.imputegaps = TRUE and rmc.doresample = TRUE.
Screenshot 2024-08-08 at 3 07 03 PM

It seems like somewhere in the calibration steps creates these errors. Are those calibration steps for custom csv handled differently than the gt3x? Where can I dig more about the potential cause? I do have some header information missing since it didn't read directly from gt3x. What information in the header could be important for the calibration steps?

To Lena,
I haven't adapted the GGIR 3.1-2 version yet. I did make some changes in the read.myacc.csv header reading portion in my current pipeline. I will need to move those changes with the 3.1-2 version before I test the custom csv. However, I've tested the output from read.myacc.csv. It produces the same result as if I read the gt3x.

Thank you for your time and patience.

@vincentvanhees
Copy link
Member

If you insist on using an older version of GGIR then you also have to life with all the bugs and inconsistencies it had.

If you want the issue to be fixed use the latest GGIR version.

If the issue is still present in the latest GGIR version then please clarify as this is currently unclear to me.

@pinweichen
Copy link
Author

pinweichen commented Aug 11, 2024

Hi Vicent and Lena,
Thank you for your patience and help.

I adapted my custom csv file to the current version of read.myacc.csv function in the GGIR (v. 3.1-2). With no modification on the package, I ran my csv into the GGIR and compared it with the gt3x ran. However, the part 1 run became much slower and the same problem persists in which the sphere data calibration used much more data than when reading gt3x. And the MVPA is incorrect compared to the gt3x version.

In addition, the new result has extended data that is longer than the original data. The results showed an imputation was done while rmc.imputegaps = FALSE and rmc.doresample = FALSE. This additional data error only exists in GGIR 3.1-2 but not in GGIR 3.0-0 when the same data was run.

Here is an example of the data (attached) that I fed into GGIR. The data is usually 7 days or 14 days long. I would like to check with you if these header settings and the format are correct. I'm happy to provide the original data if that helps debug this.
example_data.csv

Here are the parameter settings:
GGIR::GGIR(
verbose = T,
nonwear_approach = "2023",
mode = c(1),
datadir = datadir,
outputdir = outputdir,
do.report = c(2,4,5),
HASIB.algo = algo_name,
Sadeh_axis = Sadeh_axis,

do.cal = T,
do.imp = T,
do.enmo = T,
do.anglez = T,
chunksize = 1,
do.parallel = T,

------------

Custom csv settings

------------

rmc.firstrow.header = 1,
rmc.header.length = 8,
rmc.firstrow.acc = 10,
rmc.col.time = 1,
rmc.col.acc = 2:4,
rmc.unit.time = "UNIXsec",
rmc.headername.sf = "sample_frequency",
rmc.headername.sn = "device_serial_number",
rmc.headername.recordingid = "subjectID",
rmc.desiredtz = "EST5EDT",
rmc.doresample = F,
rmc.check4timegaps = F,

------------

strategy = 1,
hrs.del.start = 0, hrs.del.end = 0,
maxdur = 0,
includedaycrit = 1,
qwindow = c(0,24),
mvpathreshold = c(100),
bout.metric = 6,
excludefirstlast = FALSE,
includenightcrit = 1,
cosinor = TRUE,

def.noc.sleep = 1,
outliers.only = TRUE,
criterror = 4,
do.visual = T,

threshold.lig = c(30), threshold.mod = c(100), threshold.vig = c(400),
boutcriter = 0.8, boutcriter.in = 0.9, boutcriter.lig = 0.8,
boutcriter.mvpa = 0.8, boutdur.in = c(1,10,30), boutdur.lig = c(1,10, 30),
boutdur.mvpa = c(1, 5, 10),
includedaycrit.part5 = 1/3,
iglevels = c(seq(0,4000,by=25),8000),
qlevels = c(c(1380/1440),c(1410/1440),c(1430/1440)),
#=====================

Visual report

#=====================
timewindow = c("WW"),
visualreport = T
)

Here is the GGIR 3.1-2 gt3x original report.
Report_ggir312_original.gt3x copy.pdf
Here is the GGIR 3.1-2 ran with csv version of the gt3x report.
Report_ggir312_no_timegap_no_resample_.csv.pdf

Thank you again for your help.

Best Regards,
Benny

@pinweichen
Copy link
Author

Here are the M variables from meta/basics for original gt3x load-in and the custom csv load-in.
Original Load-in
Screenshot 2024-08-11 at 3 51 14 PM

Custom csv load-in
Screenshot 2024-08-11 at 3 51 04 PM

@vincentvanhees
Copy link
Member

Hi Benny, I now see this issue is still open. Can you share the actual csv and gt3x file with me?

@pinweichen
Copy link
Author

pinweichen commented Sep 26, 2024

Yes, I can. Sorry for the delay. I will send it to your dropbox if that works.

@vincentvanhees
Copy link
Member

Thanks for sharing the file. It seems you are are not storing the data in a format function read.myacc.csv can understand and also the timezone specification is not correct.

Please always first check that read.myacc.csv on its own generates correct output. In this and in the documentation you will find examples. There is no point in comparing MVPA estimates and all the other aspects of the GGIR output without first reviewing this initial file reading step.

  • Your timestamps are in format 2023-09-06T08:00:00.000-0400 while the documentation does not indicate that this format can be read by read.myacc.csv. I have tried using your parameter set as input to read.myacc.csv and for me the timestamps in the output do not match the timestamps in the data. If you want this timestamp format to be recognisable then I am happy to work on that as a paid consultancy.

  • You are specifying rmc.desiredtz while the function gives a warning that this argument will be deprecated, please use desiredtz. It looks like this needs to be updated in the documentation. Further, you are setting rmc.desiredtz to "EST5EDT" which is not a valid timezone definition, see documentation and examples for valid values. In your case it may need to be "America/New_York" or similar, because that allows GGIR to automatically account for DST.

  • Further, I am still not sure I understand why you want to convert gt3x to csv. You only risk inconsistencies without a clear advantage for research. For deriving calibration coefficients GGIR does not use a specific standardised amount of data across data formats as we define data volumes in different ways across formats (rows, pages, seconds, blocks), which will explain some (minor) differences in derived calibration values and by that in all acceleration values. It could be standardised but I never did as this would be extra work without a clear advantage for research. So, if you want to compare and take out this influence then maybe turn calibration off with do.cal = FALSE.

If you want me to investigate further, then I am happy to do that as a paid consultancy as this goes beyond simple bug fixing and user support that I try to do for free (it is free software but I need to pay my bills at the end of the month).

@vincentvanhees vincentvanhees mentioned this issue Oct 4, 2024
11 tasks
@pinweichen
Copy link
Author

pinweichen commented Oct 23, 2024 via email

@vincentvanhees
Copy link
Member

vincentvanhees commented Oct 24, 2024

In relation to the split recordings, did you consider parameter maxrecordinginterval? This was implemented for exactly the kind of scenario you describe where shorter recordings need to be appended.

Sounds good about the platform. Let me know if you run into any issues. I am generally very happy to accommodate other algorithms and research interests where possible. The difficulty I had with MIMSunit was its speed, which I tried to work on for a short period mHealthGroup/MIMSunit#36 but with extra efforts by the community it may be possible to make it sufficiently fast to be used more widely.
Best, Vincent

@pinweichen
Copy link
Author

pinweichen commented Nov 7, 2024 via email

@vincentvanhees
Copy link
Member

In your MIMS unit adaptation, did you keep the non-wear and sleep algorithm which is based on
the ENMO units separate from the MIMS unit calculation?

I only focussed on trying to speed up the MIMS units calculation outside GGIR, which happens based on raw data only and does not involve sleep or nonwear detection. Please note that GGIR does not use ENMO for sleep or nonwear detection.

I was wondering if your team has made some progress already.

Just to clarify that I work alone as freelancer. GGIR as open-source project benefits from other contributors, but I do not think I can call them my team. I have not looked at MIMSunit since that exchange in 2021, but one group reached out to me recently to ask for the possibility to hire me to pick up this work. I have given them a quote and they will try to get funding for it.

@pinweichen
Copy link
Author

pinweichen commented Nov 10, 2024 via email

@vincentvanhees
Copy link
Member

You will not find sleep classification in ms3.out folder, because sleep / daytime rest distinction only happens in part 4. Part 3 output only has the sustained inactivity bouts (rest periods).

Part 5 has the option to export the time series with all information (sleep/nonwear) included, and these time series are consistent with the classifications used in part 4 and 5.

I am overwhelmed at the moment by people reaching out to me to ask for free help and advice with their research. GGIR is open-source and I cannot be taking responsibility for everyone's problems without payment. Happy to help you as part of a paid consultancy. Alternatively, see https://wadpac.github.io/GGIR/ for elaborate documentation on how GGIR works.

@pinweichen
Copy link
Author

pinweichen commented Nov 12, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants