-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPRParser should not index from 0 #2364
Comments
This should be trivial to fix, but I am worried it will break many user scripts. |
That's an important consideration, but for consistency there should be a meaningful distinction between ids/resids (read from the file, indexed from 1) and indices/resindices (derived, indexed from 0). Warnings? |
As @jbarnoud already said, simply changing the behavior will probably get many users into trouble. Even with a warning it might be more annoying than useful for most users.
That's a valid point. Nevertheless, I can't think of any good reasoning for using 1-based indexing apart from the fact that Gromacs does it. How many other formats do we have that support resids? If there is at least one more that uses zero-based indexing, I strongly suppose we keep the current behavior. From my point of view, I don't want MDA to behave like I would expect from a particular format. I want it to behave the same way no matter what kind of format I feed to it. |
Results vary.
PSF and Desmond have atom ids from 0, but resids from 1. No formats other than TPRParser index resid from 0
The PSFParser reads the file and subtracts 1 for the atom id, rather than just indexing found atoms. (Why?) It also reads the file for resid but does not subtract 1.
The DMSParser queries the Desmond database for atom information. Desmond indexes from 0 (17.1.1). Apparently it indexes resids from 1, I guess. Everything else indexes ids and resids from 1, one way or another crd
pdb
Extended pdb
gro
xyz
txyz
prmtop
pqr
pdbqt
mol2
mmtf
GAMESS
DL Poly
HOOMD XML
Thoughts
IMO (as a user) the behaviour of MDAnalysis right now is to read attributes from file formats when it can, which includes ids/resids. This means that the same system in different formats might have different attributes available, which is currently the case. The 'canonical'(?) attributes of indices/resindices wouldn't change across formats.
You could also warn the other way -- keep current behaviour and warn new users? We've already had an indexing error tank a project in our group before, so I'm very alert for new ones! I can note diversions from expected behaviour in the documentation when I've worked out what expected behaviour is. |
The Following that logic, it is meaningful to index the TPR from 1 since the GRO and ITP files, the TPR is built from, are indexed from 1. Though, there are many scripts out there that will be silently broken, and I am afraid a warning will not be enough to prevent wrong analyses to issue. So I would be in favour of warning in the doc that it is 0-indexed. It is indeed important that diversion from the expected behaviour are minimized or at least documented. |
Thanks @lilyminium for the detailed analysis! So the expected behavior is definitely that
|
@jbarnoud I was meaning to ask about
etc. indices/resindices/segindices are the 0-indexed, continuous counters for atoms/residues/segments. I have an ongoing table of notes at the UserGuide wiki |
I may have mistaken
|
resid vs resnumHistoryAt some point in the dim and distance past the intention had been that resnum is whatever the file gives us (just a tag that we carry around) and resid is MDAnalysis own idea of residue numbering, starting at 0 – this is how VMD handles... IIRC(?). Clearly, that's not what we have. ixSince the topology rewrite #363 we really use the >>> u.atoms.indices is u.atoms.ix
True
>>> u.residues.resindices is u.residues.ix
True
>>> u.segments.segindices is u.segments.ix
True but >>> len(u.atoms.resindices)
47681
>>> len(u.residues.resindices)
11302 (I used import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.TPR, data.XTC) for the example above.) Suggested use (?)Currently
(Note that for traditional selections we always had What's wrong here?There are many issues with attributes and their meaning, many of which @lilyminium collected in a gallery of horrors #2362. Many of these also show up when users have issues with the PDB format and related formats, for instance
I think the main problem is that we do not properly distinguish between data that we associate unchanged with particles (I look at them as "tags") and data that MDAnalysis maintains in a consistent form. The latter is generated by the topology parsers and is initially based on the data in the files. But we should then decouple these two instead of doing messy things such as pretending that segid and chainID are the same thing or changing the resid to a 0-based index. Possible solution: "Tags"/Attributes and MDAnalysis indicesThe "new" topology system makes it super-easy (relatively speaking) to associate arbitrary attributes with atoms. It would be helpful to have a list of common attributes that most people expect and would like to access under the same name, regardless of file format. We would then teach our parsers to populate the "common" attributes with the closest equivalent, possibly keeping the native names in custom attributes as well, and maintain the MDAnalysis view (atomindex, residueindex, segmentindex, mass, charge, ... anything else?). Unfortunately, expectations are likely different in different domains. For biomolecular simulations something like
The "common" attributes would have keywords in the selection language. The custom attributes would just be accessible as attributes for pandas-style boolean selections. |
TPR parserAnd from @lilyminium 's summary above, I also think that the TPR parser should use 1-based resids. |
Do you mean guessed by default, or populate what you know? Because the guesses are really only pertinent for fully atomistic simulations, and can break fairly easily beyond the most common molecules. |
It sounds like there are quite a few issues with the way attributes are handled that could be resolved by making a number of common attributes consistently available. In terms of TPR resids being 1-based, we have a couple of thoughts to add. Having used gro or pdb files for most of our work, when switching to tpr files to have access to partial charges, we did not expect that the resids would change. We also did not think to check the documentation for the tpr parser as we had no reason to believe indices would be different across the two file formats. So, in essence, there is already a silent break. There may well be users out there who have not yet realised that tpr files have 0-based resids. Further, by not changing the resids to be 1-based, we will require every future user of the tpr parser to read the documentation to discover this unexpected behaviour. It therefore seems like a good idea to move to 1-based indices for tpr files. One way to do this, without requiring too many changes to current scripts, would be to require an |
I primarily mean "things that we get from the file but might have different names in different file formats" such as using the PDB ChainID as a value for the MDA segment identifier or making PDB's Guessing is always a problem. The purist's approach is of course to just not guess anything and throw an error; or in our case, do not add certain attributes or methods (e.g., no The convenience of guessing can be high, though, although this needs to be improved, and only be done by explicit user settings, see issue #2348. |
@lilyminium collected common attributes in https://github.com/MDAnalysis/UserGuide/wiki/Topology-attributes |
2.0.0 is going to be a good place to fix this issue, with the ugly user-alerting stuff in a 1.1.0 release. My view after rereading the whole discussion is:
We should revisit the purpose of |
^ In current |
So, scrap the suggestion we could repurpose resnum. Perhaps just ditch it – we can deprecated it in a 1.1. |
I'm modifying the milestone here to 1.0.2 so that we can make a decision on whether we want to go ahead with the deprecation for this now so that we can fix this in 2.0. We need to agree on this soon as we'd like to get 1.0.2 out ASAP. |
I am in favor of
EDIT: Let's wrap up MDA 1.x with this release. If we add kwargs to Universe then according to semver we need to call it 1.1.0, though. |
(The purist in me still does not particularly like that we are changing resid from what's inside the file. However, I feel that resid is so commonly used by users without thinking that it's an internal attribute that we have to make the change. If we want to keep track of internal values then we should come up with a bunch of attributes that make this clear, e.g. |
I think many users don't expect this because a) tpr files are not human readable and b) gro files, also supported by gromacs and human readable, index the exact same system from 1. In favour of supporting indexing from 1, but maybe printing warnings that we'll do so in 2.0 in 1.0.2, and printing warnings in 2.0 that we have now done this. Edit: ... as you already said in #2364 (comment) |
|
Also, re: deprecating |
I like the |
To add to this whole complexity. The residues id in the gromacs gro file could start from a number that is not one. |
- Fixes #2364 - Master version of #3152. - Changes made in this Pull Request: - Add `tpr_resid_from_one` keyword to TPRParser - Set default to `False`, keeping current status quo - Emits a warning for `tpr_resid_from_one=False` indicating that default behaviour will change in 2.0 - add tests - update CHANGELOG
Expected behavior
My expected behaviour if is that if I select a residue in
gmx make_ndx
by residue number 1, and I select a residue withu.select_atoms('resid 1')
, they give me the same residue.Actual behavior
I would get different residues, because GROMACS indexes from 1, but TPRParser indexes from 0. TPR files may index from 0, but they are typically created from human-readable .top or .itp files that are indexed from 1.
Code to reproduce the behavior
Currently version of MDAnalysis
python -c "import MDAnalysis as mda; print(mda.__version__)"
) 0.20.1python -V
)? 3.7.3The text was updated successfully, but these errors were encountered: