-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add parents column to .trees output #231
Conversation
Looks like some of the tests that GitHub Actions runs are failing. I'm not going to attempt to diagnose that now, since I'm unfamiliar with them and probably the tests themselves need to be updated... |
I'll have a look! |
Some tests in my test suite are failing; not sure why. The errors are all some variant of this:
So there's some issue with the ordering of the individuals table. So... maybe there's a problem with |
Unrelated to the test fail: it occurs to me that the way things are done in this PR in |
Oh, darn it. In tskit-dev/tskit#1138 we decided to require individuals come after their individual parents. I should have realized at the time, but this might conflict with how we order the individual tables. ReorderIndividualTable puts all the alive individuals first, in the order that they occur in SLiM's internal state, so this is preserved on reading back in. But if an individual's child is Remembered, but no longer alive, it'll then appear later in the table. I should have realized this would be a problem. One solution would be to back out that requirement in tskit. But we've released it, and it seems useful, so we'd rather not do that. Alternatively:
|
OK. Out of curiosity: why is it useful? Makes implementation of some algorithms simpler, or something?
No, I don't think either of these things is true in SLiM. The internal order of individuals is contingent upon past operations, and cannot be reconstructed from other state, nor does it reflect that other state. (For example, when an individual dies it gets removed from the vector of individuals, so to avoid having to shift all other individuals down by one position in the vector, the individual at the end of the vector gets moved back to fill the empty position – thereby breaking any sort of relationship between the ordering and anything else.) |
Pondering this, I see only two alternatives. Either tskit does not require any specific ordering of the individuals table, allowing us to write out the table in whatever order we wish; or it does require a specific ordering, in which case we have to add a "slim_index" field to the individual metadata, so that we can sort the individuals table to satisfy tskit when we write, and restore the original order when we read the .trees back in. Since there is no relationship between SLiM index and any other property, I see no middle ground here. Adding such a field to the metadata is not the end of the world, of course; it just makes the metadata that much bigger (it's presently 40 bytes per individual, this would make it 44). It's also kind of weird because that metadata field would only be used/meaningful for individuals that are still alive at the save point; for all other remembered individuals, it would be a pure waste of space. Also, I really don't relish trying to rewrite the individuals table read/write code to work this new way, while also preserving the ability to read in old files that work the old way – ugh. I think if we did shift over to this new scheme we might want to just declare a break with backward compatibility: pre-3.7 .trees files cannot be read with SLiM 3.7. I think my vote would be for tskit to remove this requirement; it seems like if a given algorithm really needs the table to be ordered in this fashion, it can quite easily sort the table as a first step, no? Seems vastly easier than dealing with the fallout on the SLiM side. |
Right - although I'm not sure if we have any such algorithms yet (although maybe Jerome's new simulation in pedigrees uses it). |
Here's another way around this: put a vector of SLiM IDs in the top-level metadata that says what order they're in. This would actually take less memory than adding it to metadata (since we wouldn't have the empty slot for non-alive individuals), and would still be backwards-compatible. Recall also that this ordering thing doesn't affect correctness of the simulation, it just gives us the nice property that when we reload a file we get back to exactly the same state as before (in the sense of random-seed-equivalent); if the alive individuals are out of order then it'd be an equivalent state. |
I like that idea, I think. So, inside SLiMSim::WriteTreeSequenceMetadata() I'd add a new key, metadata["SLiM"]["individual_order"], and its value would be a vector of integers? Can you sketch out what that would look like in terms of modifying the schema, etc.? It could be considered optional (SLiM would always write it, but pyslim could leave it out, etc.). Do we have an example somewhere in the code already of a metadata key whose value is a vector of integers, that I could crib from? I'm not really sure how to write it out and read it in. I think I know how to reorder the table on read-in to match the metadata spec, and how to generate the correct ordering info on write. To reorder the individuals table on write in a way that makes tskit happy, is there a standard tskit function to do that?
That is technically true, although I have a whole ton of tests that rely on this fact to work, so it's not really just a frill. I think preserving this reproducibility ought to be regarded as an absolute stake; it is far too useful to give up on. OK, so. If this is the path forward (I'm still unconvinced that simply removing this constraint from tskit is not the best option), then there's no point in you reviewing these diffs right now. If you confirm that you think this is really better than the tskit change, and give me a few pointers on the questions above, I'll go back to the drawing board. :-> |
Ah, I now see the thread on Slack, though. So maybe we can just ditch this requirement. |
I have just pushed a commit that optimistically removes the tskit check, since it seems quite unlikely that anybody will object to that change. (Note that GitHub Actions tests will continue to fail if they check the tables in Python, since the tskit being used at that point is not patched.) I'll resume running my test suite on this PR now. @petrelharp, if you want to review the diffs now I think that would be useful, but if you want to wait and see what happens with the tskit issue that's understandable. :-> |
I'll wait a minute to see. =) |
Looks like they will indeed remove the individuals table ordering requirement, so I think this PR is now ready to review! :-> |
@@ -7978,6 +8038,31 @@ void SLiMSim::CrosscheckTreeSeqIntegrity(void) | |||
if (ret != 0) handle_error("CrosscheckTreeSeqIntegrity tsk_table_collection_free()", ret); | |||
free(tables_copy); | |||
} | |||
|
|||
// check that tabled_individuals_hash_ is the right size and has all the right entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh - the hash isn't being maintained at all times, is it? And if not, I don't think we want to check it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, it is? But it's been a while since I looked at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not rebuilt after simplification, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But even if it happens to be, I think it's built when we need it, so we shouldn't assume it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'll check on this tomorrow, I need to review the diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so. tabled_individuals_hash_
is kept correct at all times, and so it's correct that crosscheck verifies that it matches the individuals table; it should always match, and this crosscheck passes with every model in my test suite. tabled_individuals_hash_
is rebuilt after simplification with the call BuildTabledIndividualsHash(&tables_, &tabled_individuals_hash_);
at line 5448. In general, we don't build it when we need it, we keep it up to date; for models that frequently remember individuals, that is much much faster since we need a valid hash table every time we add new individuals to be table. (We used to build a new table whenever we added individuals to the table, on demand, but now we keep it up to date, resulting in more than an order of magnitude speedup for models that do a lot of remembering.)
We do call BuildTabledIndividualsHash()
to build a table on demand when we output, because we reorder the individuals table when we output; but that hash table is just made locally and thrown away when we're done saving. That is separate from tabled_individuals_hash_
, although BuildTabledIndividualsHash()
is used for both purposes just to share code.
One minor query - otherwise, no issues spotted! |
@petrelharp replied above. If that clarifies things and you're happy, give a thumbs up and I'll merge. However, this PR branch is now several commits behind |
Yes it was a headache. The underlying issue is discussed here: isaacs/github#750 . There doesn't appear to be any plan to fix it. I think doing a full rebase eventually worked for me, though I tend to be a bit scared of rebasing so that is why I didn't try it initially, instead preferring to merge upstream changes. But merging leads to the issue mentioned in that link, e.g. diffs don't get updated when the base branch of a PR changes. Some people suggested changing the base branch to a random branch, then changing it back, but this actually didn't work for me either. Would love to hear if you happen to know how to deal with this too @petrelharp . |
I'll give a rebase a try! |
All I did was
... and the result is this branch that seems just fine - i.e., the rebase went cleanly. I don't think I have push access to this PR; so @bhaller I recommend just doing what I did above and let me know if you have issues? If you really want to get my branch do
|
OK, I tried that. Here's the situation before and after the first rebase command:
I don't know what it means by |
None of the changes from the parents_column branch appear to be present in my repo now. No idea what state my repo is in or why this happened. There is no conflict in VERSIONS to resolve, since none of the branch changes are there. Aargh. This is why I dislike git. Every... single... time... that I try to do something more advanced than simply committing a change on the master branch, I get tied up in knots. :-< |
I'm merging in the diffs from this PR by hand into master, will commit soon. Sigh. |
OK, diffs merged into master by hand (e0abc33); closing this PR since it will go unmerged. |
The problem probably was the post-checkout hook mentioned in the error message:
|
Fixes #152. Here's an attempt at adding a parents column to .trees output. It also changes the usage pattern for the pedigree id to tskit id hash table from (old) being created internally when individuals are added, temporarily, to (new) being kept permanently, which greatly speeds up models that remember a lot of individuals. The same sort of pedigree id to tskit id hash table is also created to populate the parents column, so the ability to build and use such hash tables has been generalized and is now used in several spots.
@petrelharp, I'd appreciate a really careful, thorough review of this PR. I'm munging around with tskit's tables in several spots, and I'm not sure I'm doing it quite right; I'm a bit out of my depth. Please pay very close attention to that code, such as where the parents column and parents offsets get created and put into the individuals table. I've run this so far with a couple of test models; now I'll kick off my full test suite, which takes quite a while to run with tree-seq and crosschecks enabled on all the tests. If you could run your test suite too, that would be great. It would also be good if you added a couple of new tests that check that the parents column actually gets added now, has sensical contents, etc.; I've done some ad hoc tests of it, but my test suite doesn't really test it thoroughly since my test suite doesn't involve Python or pyslim. :->
I imagine some changes will be needed in pyslim as a result of this. The parent pedigree IDs in the metadata should be vended by pyslim, of course. The file version number has been incremented; the old file version doesn't have the parent pedigree IDs in the individuals metadata, the new file version does. The metadata schema for the individuals table changed (please proofread what I did, I just typed the new schema into the header file and it might have typos). Maybe that's it?
Thanks! This should be a nice feature for folks.