Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove corrupted lines in xvg files #126

Closed
hannahbaumann opened this issue May 3, 2021 · 6 comments · Fixed by #183
Closed

Remove corrupted lines in xvg files #126

hannahbaumann opened this issue May 3, 2021 · 6 comments · Fixed by #183
Labels

Comments

@hannahbaumann
Copy link

Hi,

when I want to analyze free energy differences while simulations are still running, the last line of the xvg files is often corrupted (not fully written yet) and alchemlyb fails to do the analysis. Alchemical analysis has a feature that repairs those files, so I usually run that first and then run alchemlyb on the repaired xvg files. Is it possible to move that feature into alchemlyb?
https://github.com/MobleyLab/alchemical-analysis/blob/master/alchemical_analysis/utils/corruptxvg.py

@orbeckst
Copy link
Member

orbeckst commented May 3, 2021

@hannahbaumann how would you want this feature to work, if it were in alchemlyb? Can you outline Python code?

@hannahbaumann
Copy link
Author

I think for right now it would be enough if it checks whether the length of the last line is correct and that it removes the last line if it's too short. Similar to the def removeCorruptLines function in this script in alchemical analysis: https://github.com/MobleyLab/alchemical-analysis/blob/master/alchemical_analysis/utils/corruptxvg.py
The function gets the length of the data from the .xvg header and then checks the length of (in this case) all lines, but could also just be the last line.
Another issue that I've had in the past was that I accidentally restarted a simulation although it was still running and both simulations appended the data to the same .xvg file, resulting in duplicates in the file. In that case it would be helpful if alchemlyb can detect the duplicates and remove one of them. But I haven't written the code for that scenario yet.

@orbeckst
Copy link
Member

orbeckst commented May 4, 2021

Do you want the alchemlyb XVG parser to just ignore corrupt lines or do you want the function to be "somewhere" in alchemlyb so that you can import it to use it as part of your workflow? I am trying to gauge where this would fit in.

@orbeckst
Copy link
Member

orbeckst commented Aug 2, 2021

The current philosophy of the library is to read data and make them available as dataframes. A function that writes out the data does not fit particularly well into this scheme, I feel. However, we could consider adding a slower XVG parser as an alternative to the fast pandas.read_csv() based one

df = pd.read_csv(xvg, sep=r"\s+", header=None, skiprows=header_cnt,
na_filter=True, memory_map=True, names=cols, dtype=np.float64,
float_precision='high')
(which is fairly well optimized and much faster than the simple Python-based XVG reader that we used previously). We could have an option for the extract_* functions that enables reading of corrupt datafiles. This could then switch to the slow line-by-line parser that could be based on the code https://github.com/MobleyLab/alchemical-analysis/blob/master/alchemical_analysis/utils/corruptxvg.py, with the difference that it needs to produce a dataframe in the same way as the existing code, except that incomplete lines are omitted.

I'd be happy to review a PR based along the lines above.

@xiki-tempula
Copy link
Collaborator

This is my understanding of this issue. There are two questions raised on this issue.

removes the last line if it's too short.

This is quite easy to solve. The pd.read_csv will give a line full of NaN when the line is not complete. My solution to the Gromacs praser is to add

    # Drop the incomplete rows
    df.dropna(inplace=True)

to

.

The other problem duplicates in the file is solved with alchemlyb.preprocessing.subsampling.statistical_inefficiency(drop_duplicates=True), which will drop the duplications.

The only question is how do we define the boundary of parser and preprocessing. Should the removal of corrupted lines and drop duplication been put in the parser or they should go to the preprocessing, such that parser retain as much original information as possible.

@orbeckst
Copy link
Member

I would consider it a preprocessing step, like cleaning data.

@orbeckst orbeckst added the GROMACS MD engine label Oct 20, 2021
orbeckst pushed a commit that referenced this issue Apr 13, 2022
* Fix #126 and #171 
* more robust gmx parser: skip NaN and incomplete lines in XVG files with filter=True; performance seems similar
  (see PR #183)
* filter=True is now DEFAULT
* add tests; set older tests to use filter=False for backwards-compatibility
* Update CHANGES
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants