Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add parents column to .trees output #231

Closed
wants to merge 2 commits into from
Closed

add parents column to .trees output #231

wants to merge 2 commits into from

Conversation

bhaller
Copy link
Contributor

@bhaller bhaller commented Oct 8, 2021

Fixes #152. Here's an attempt at adding a parents column to .trees output. It also changes the usage pattern for the pedigree id to tskit id hash table from (old) being created internally when individuals are added, temporarily, to (new) being kept permanently, which greatly speeds up models that remember a lot of individuals. The same sort of pedigree id to tskit id hash table is also created to populate the parents column, so the ability to build and use such hash tables has been generalized and is now used in several spots.

@petrelharp, I'd appreciate a really careful, thorough review of this PR. I'm munging around with tskit's tables in several spots, and I'm not sure I'm doing it quite right; I'm a bit out of my depth. Please pay very close attention to that code, such as where the parents column and parents offsets get created and put into the individuals table. I've run this so far with a couple of test models; now I'll kick off my full test suite, which takes quite a while to run with tree-seq and crosschecks enabled on all the tests. If you could run your test suite too, that would be great. It would also be good if you added a couple of new tests that check that the parents column actually gets added now, has sensical contents, etc.; I've done some ad hoc tests of it, but my test suite doesn't really test it thoroughly since my test suite doesn't involve Python or pyslim. :->

I imagine some changes will be needed in pyslim as a result of this. The parent pedigree IDs in the metadata should be vended by pyslim, of course. The file version number has been incremented; the old file version doesn't have the parent pedigree IDs in the individuals metadata, the new file version does. The metadata schema for the individuals table changed (please proofread what I did, I just typed the new schema into the header file and it might have typos). Maybe that's it?

Thanks! This should be a nice feature for folks.

@bhaller
Copy link
Contributor Author

bhaller commented Oct 8, 2021

Looks like some of the tests that GitHub Actions runs are failing. I'm not going to attempt to diagnose that now, since I'm unfamiliar with them and probably the tests themselves need to be updated...

@petrelharp
Copy link
Collaborator

I'll have a look!

@bhaller
Copy link
Contributor Author

bhaller commented Oct 8, 2021

Some tests in my test suite are failing; not sure why. The errors are all some variant of this:

_InstantiateSLiMObjectsFromTables tsk_treeseq_init(): Individuals must be provided in an order where children are after their parent individuals
Error on script line 71, character 6:
      sim.readFromPopulationFile(“/tmp/slim_trees_test.trees”);
          ^^^^^^^^^^^^^^^^^^^^^^

So there's some issue with the ordering of the individuals table. So... maybe there's a problem with ReorderIndividualTable()? I wonder whether the way we sort it now needs to be additionally constrained by the parents column or something? I have frankly lost the thread of how all the individuals table sorting works, so I hope you have insight on this @petrelharp. Since we do diligently reorder the individuals table at both save and load, it does seem like this error clearly indicates that the way in which we are doing that reordering violates current tskit policy.

@bhaller
Copy link
Contributor Author

bhaller commented Oct 8, 2021

Unrelated to the test fail: it occurs to me that the way things are done in this PR in WriteTreeSequence() could probably be rearranged a bit. Right now we build a pedigree-to-tskid hash table once, post-simplify, and use it to power AddCurrentGenerationToIndividuals(); then we do ReorderIndividualTable(), which changes the individuals table ordering and thus invalidates that hash table; and then we build another hash table to power AddParentsColumnForOutput(). A possible redesign would be: move AddParentsColumnForOutput() above ReorderIndividualTable(), allowing it to use the same hash table as AddCurrentGenerationToIndividuals() used, and then, in ReorderIndividualTable(), use its inverse_map to fix the parents column up. This would save the overhead of building the second hash table, but at the cost of having to fix up the parents column instead. I suspect it would be faster, for models that remember a large number of individuals (models that don't remember many individuals will be largely unaffected by all of this anyway), but it's a more roundabout design – deliberately making the parents column wrong and then fixing it, rather than making it right to begin with. So... I'm not sure, so I'm open to opinions. :->

@petrelharp
Copy link
Collaborator

_InstantiateSLiMObjectsFromTables tsk_treeseq_init(): Individuals must be provided in an order where children are after their parent individuals

Oh, darn it. In tskit-dev/tskit#1138 we decided to require individuals come after their individual parents. I should have realized at the time, but this might conflict with how we order the individual tables. ReorderIndividualTable puts all the alive individuals first, in the order that they occur in SLiM's internal state, so this is preserved on reading back in. But if an individual's child is Remembered, but no longer alive, it'll then appear later in the table. I should have realized this would be a problem.

One solution would be to back out that requirement in tskit. But we've released it, and it seems useful, so we'd rather not do that. Alternatively:

  • maybe we could adjust ``ReorderIndividualTableso that the Alive individuals come in the correct order, just not the firstn` rows? We've got the `Alive` flag, so we know which ones are alive. This would be possible as long as it's always the case that the position in SLiM's internal state of a parent is always earlier than that of any of their alive children. Is that true?
  • or - mabye better - does the ordering in SLiM's internal state always match the SLiM ID, which we could pull out of metadata? Then we wouldn't have to mess around with reordering the individual table at all!

@bhaller
Copy link
Contributor Author

bhaller commented Oct 8, 2021

Oh, darn it. In tskit-dev/tskit#1138 we decided to require individuals come after their individual parents..

One solution would be to back out that requirement in tskit. But we've released it, and it seems useful, so we'd rather not do that.

OK. Out of curiosity: why is it useful? Makes implementation of some algorithms simpler, or something?

Alternatively:

  • maybe we could adjust ``ReorderIndividualTableso that the Alive individuals come in the correct order, just not the firstn` rows? We've got the `Alive` flag, so we know which ones are alive. This would be possible as long as it's always the case that the position in SLiM's internal state of a parent is always earlier than that of any of their alive children. Is that true?

  • or - mabye better - does the ordering in SLiM's internal state always match the SLiM ID, which we could pull out of metadata? Then we wouldn't have to mess around with reordering the individual table at all!

No, I don't think either of these things is true in SLiM. The internal order of individuals is contingent upon past operations, and cannot be reconstructed from other state, nor does it reflect that other state. (For example, when an individual dies it gets removed from the vector of individuals, so to avoid having to shift all other individuals down by one position in the vector, the individual at the end of the vector gets moved back to fill the empty position – thereby breaking any sort of relationship between the ordering and anything else.)

@bhaller
Copy link
Contributor Author

bhaller commented Oct 8, 2021

Pondering this, I see only two alternatives. Either tskit does not require any specific ordering of the individuals table, allowing us to write out the table in whatever order we wish; or it does require a specific ordering, in which case we have to add a "slim_index" field to the individual metadata, so that we can sort the individuals table to satisfy tskit when we write, and restore the original order when we read the .trees back in. Since there is no relationship between SLiM index and any other property, I see no middle ground here. Adding such a field to the metadata is not the end of the world, of course; it just makes the metadata that much bigger (it's presently 40 bytes per individual, this would make it 44). It's also kind of weird because that metadata field would only be used/meaningful for individuals that are still alive at the save point; for all other remembered individuals, it would be a pure waste of space. Also, I really don't relish trying to rewrite the individuals table read/write code to work this new way, while also preserving the ability to read in old files that work the old way – ugh. I think if we did shift over to this new scheme we might want to just declare a break with backward compatibility: pre-3.7 .trees files cannot be read with SLiM 3.7.

I think my vote would be for tskit to remove this requirement; it seems like if a given algorithm really needs the table to be ordered in this fashion, it can quite easily sort the table as a first step, no? Seems vastly easier than dealing with the fallout on the SLiM side.

@petrelharp
Copy link
Collaborator

OK. Out of curiosity: why is it useful? Makes implementation of some algorithms simpler, or something?

Right - although I'm not sure if we have any such algorithms yet (although maybe Jerome's new simulation in pedigrees uses it).

@petrelharp
Copy link
Collaborator

Here's another way around this: put a vector of SLiM IDs in the top-level metadata that says what order they're in. This would actually take less memory than adding it to metadata (since we wouldn't have the empty slot for non-alive individuals), and would still be backwards-compatible.

Recall also that this ordering thing doesn't affect correctness of the simulation, it just gives us the nice property that when we reload a file we get back to exactly the same state as before (in the sense of random-seed-equivalent); if the alive individuals are out of order then it'd be an equivalent state.

@bhaller
Copy link
Contributor Author

bhaller commented Oct 8, 2021

Here's another way around this: put a vector of SLiM IDs in the top-level metadata that says what order they're in. This would actually take less memory than adding it to metadata (since we wouldn't have the empty slot for non-alive individuals), and would still be backwards-compatible.

I like that idea, I think. So, inside SLiMSim::WriteTreeSequenceMetadata() I'd add a new key, metadata["SLiM"]["individual_order"], and its value would be a vector of integers? Can you sketch out what that would look like in terms of modifying the schema, etc.? It could be considered optional (SLiM would always write it, but pyslim could leave it out, etc.). Do we have an example somewhere in the code already of a metadata key whose value is a vector of integers, that I could crib from? I'm not really sure how to write it out and read it in. I think I know how to reorder the table on read-in to match the metadata spec, and how to generate the correct ordering info on write. To reorder the individuals table on write in a way that makes tskit happy, is there a standard tskit function to do that?

Recall also that this ordering thing doesn't affect correctness of the simulation, it just gives us the nice property that when we reload a file we get back to exactly the same state as before (in the sense of random-seed-equivalent); if the alive individuals are out of order then it'd be an equivalent state.

That is technically true, although I have a whole ton of tests that rely on this fact to work, so it's not really just a frill. I think preserving this reproducibility ought to be regarded as an absolute stake; it is far too useful to give up on.

OK, so. If this is the path forward (I'm still unconvinced that simply removing this constraint from tskit is not the best option), then there's no point in you reviewing these diffs right now. If you confirm that you think this is really better than the tskit change, and give me a few pointers on the questions above, I'll go back to the drawing board. :->

@bhaller
Copy link
Contributor Author

bhaller commented Oct 8, 2021

Ah, I now see the thread on Slack, though. So maybe we can just ditch this requirement.

@bhaller
Copy link
Contributor Author

bhaller commented Oct 8, 2021

I have just pushed a commit that optimistically removes the tskit check, since it seems quite unlikely that anybody will object to that change. (Note that GitHub Actions tests will continue to fail if they check the tables in Python, since the tskit being used at that point is not patched.) I'll resume running my test suite on this PR now. @petrelharp, if you want to review the diffs now I think that would be useful, but if you want to wait and see what happens with the tskit issue that's understandable. :->

@petrelharp
Copy link
Collaborator

I'll wait a minute to see. =)

@bhaller
Copy link
Contributor Author

bhaller commented Oct 11, 2021

I'll wait a minute to see. =)

Looks like they will indeed remove the individuals table ordering requirement, so I think this PR is now ready to review! :->

@@ -7978,6 +8038,31 @@ void SLiMSim::CrosscheckTreeSeqIntegrity(void)
if (ret != 0) handle_error("CrosscheckTreeSeqIntegrity tsk_table_collection_free()", ret);
free(tables_copy);
}

// check that tabled_individuals_hash_ is the right size and has all the right entries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh - the hash isn't being maintained at all times, is it? And if not, I don't think we want to check it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, it is? But it's been a while since I looked at this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not rebuilt after simplification, I think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even if it happens to be, I think it's built when we need it, so we shouldn't assume it is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'll check on this tomorrow, I need to review the diffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so. tabled_individuals_hash_ is kept correct at all times, and so it's correct that crosscheck verifies that it matches the individuals table; it should always match, and this crosscheck passes with every model in my test suite. tabled_individuals_hash_ is rebuilt after simplification with the call BuildTabledIndividualsHash(&tables_, &tabled_individuals_hash_); at line 5448. In general, we don't build it when we need it, we keep it up to date; for models that frequently remember individuals, that is much much faster since we need a valid hash table every time we add new individuals to be table. (We used to build a new table whenever we added individuals to the table, on demand, but now we keep it up to date, resulting in more than an order of magnitude speedup for models that do a lot of remembering.)

We do call BuildTabledIndividualsHash() to build a table on demand when we output, because we reorder the individuals table when we output; but that hash table is just made locally and thrown away when we're done saving. That is separate from tabled_individuals_hash_, although BuildTabledIndividualsHash() is used for both purposes just to share code.

@petrelharp
Copy link
Collaborator

One minor query - otherwise, no issues spotted!

@bhaller
Copy link
Contributor Author

bhaller commented Oct 19, 2021

@petrelharp replied above. If that clarifies things and you're happy, give a thumbs up and I'll merge. However, this PR branch is now several commits behind master, and I'm not sure what to do about that...? @rdinnager was having no end of trouble dealing with that with his Windows PR; he tried to merge in changes from master, but then all of the diffs for those changes from master showed up as diffs in his PR too, which didn't seem right at all, so he ended up just creating a new branch and copying his changes into it to make git happy. I recall Boyana's student, whose name I'm blocking on, had endless troubles with that too. If you could illuminate the right way to deal with this situation, that would be very helpful. :->

@rdinnager
Copy link
Contributor

Yes it was a headache. The underlying issue is discussed here: isaacs/github#750 . There doesn't appear to be any plan to fix it. I think doing a full rebase eventually worked for me, though I tend to be a bit scared of rebasing so that is why I didn't try it initially, instead preferring to merge upstream changes. But merging leads to the issue mentioned in that link, e.g. diffs don't get updated when the base branch of a PR changes. Some people suggested changing the base branch to a random branch, then changing it back, but this actually didn't work for me either. Would love to hear if you happen to know how to deal with this too @petrelharp .

@petrelharp
Copy link
Collaborator

I'll give a rebase a try!

@petrelharp
Copy link
Collaborator

All I did was

git fetch origin
git checkout -b parents_column origin/parents_column
git rebase -i origin/master
# <fix up conflict in VERSIONS>
git add VERSIONS
git rebase --continue

... and the result is this branch that seems just fine - i.e., the rebase went cleanly. I don't think I have push access to this PR; so @bhaller I recommend just doing what I did above and let me know if you have issues? If you really want to get my branch do

git remote add peter git@github.com:petrelharp/SLiM
git fetch peter
git checkout -b new_parents_column peter/parents_column
# <investigate my code if you want>
git checkout parents_column
# save a backup
git checkout -b old_parents_column
git checkout parents_column
git reset --hard new_parents_column
# send the changes back to github
git push -f

@bhaller
Copy link
Contributor Author

bhaller commented Oct 20, 2021

OK, I tried that. Here's the situation before and after the first rebase command:

bhaller@lanois SLiM % git status
On branch parents_column
Your branch is up to date with 'origin/parents_column'.

nothing to commit, working tree clean
bhaller@lanois SLiM % git rebase -i origin/master
Note: switching to '14b7763400c2a8727dd5ac9fc9e9e10daf65ce67'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 14b77634 add Image(numeric matrix) constructor

This repository is configured for Git LFS but 'git-lfs' was not found on your path. If you no longer wish to use Git LFS, remove this hook by deleting .git/hooks/post-checkout.

error: could not detach HEAD
bhaller@lanois SLiM % git status
HEAD detached from refs/heads/parents_column
nothing to commit, working tree clean

I don't know what it means by This repository is configured for Git LFS but 'git-lfs' was not found on your path..., but the main thing that worries me is error: could not detach HEAD. And yet git status seems to indicate that it did get detached. Do you know what's going on here?

@bhaller
Copy link
Contributor Author

bhaller commented Oct 20, 2021

None of the changes from the parents_column branch appear to be present in my repo now. No idea what state my repo is in or why this happened. There is no conflict in VERSIONS to resolve, since none of the branch changes are there. Aargh. This is why I dislike git. Every... single... time... that I try to do something more advanced than simply committing a change on the master branch, I get tied up in knots. :-<

@bhaller
Copy link
Contributor Author

bhaller commented Oct 20, 2021

I'm merging in the diffs from this PR by hand into master, will commit soon. Sigh.

@bhaller
Copy link
Contributor Author

bhaller commented Oct 20, 2021

OK, diffs merged into master by hand (e0abc33); closing this PR since it will go unmerged.

@bhaller bhaller closed this Oct 20, 2021
@bhaller bhaller deleted the parents_column branch October 20, 2021 01:47
@petrelharp
Copy link
Collaborator

The problem probably was the post-checkout hook mentioned in the error message:

This repository is configured for Git LFS but 'git-lfs' was not found on your path. If you no longer wish to use Git LFS, remove this hook by deleting .git/hooks/post-checkout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should write our pedigree data to .trees files
3 participants