Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ordering requirement for the individuals table #1774

Closed
bhaller opened this issue Oct 8, 2021 · 16 comments
Closed

Remove ordering requirement for the individuals table #1774

bhaller opened this issue Oct 8, 2021 · 16 comments
Assignees
Labels
C API Issue is about the C API
Milestone

Comments

@bhaller
Copy link

bhaller commented Oct 8, 2021

In #1138 it was decided to require an ordering for the individuals table: children must come after their individual parents. This has turned out to be a problem for SLiM, because it has its own ordering requirements for the individuals table which conflict with this policy. This was not noticed until now because SLiM wasn't using the parents column; but now that it is putting pedigree-based info in the parents column, now we have a conflict between the two ordering policies. Further discussion of that can be found at MesserLab/SLiM#231.

Discussion on Slack indicates that nobody is actually using this ordering requirement for anything, and that the best solution might be to simply remove tskit's ordering requirement. Quoting @jeromekelleher throughout:

Short answer, "no" we're not doing anything with it, and actually, I was wondering if we had got it wrong too. It actually feels quite unnatural when you start working with it, especially if you were to simulate a pedigree in the correct time direction (i.e., backwards 😉). You'd end up having to reverse the individual table, or else buffer all your individuals and write them in afterwards.

Are you talking about removing the sortedness requirement entirely, or reversing to children-before-parents?

I'm open to either.

...

To be honest I haven't found any advantage to having the sort order so far, it's basically just gotten in the way. Any time you try to rely on it, you end up just sorting by time in any case because weird corner cases crop up with ordering of edges otherwise.

Given that the number of individuals will typically be relatively small, that sort of thing isn't a big deal, so I'd not be particularly upset if we ditched the ordering requirement.

So. Any objections to simply removing this sorting requirement? @petrelharp @benjeffery not sure who else to tag...

@gregorgorjanc
Copy link
Member

When building a pedigree-based relatedness matrix, outwith tskit, we want parents before progeny so that we can process pedigree records forward in time and accumulate relatedness coefficients appropriately. But one can do topological sort of the pedigree graph (DAG) to achieve this if needed.

@petrelharp
Copy link
Contributor

When building a pedigree-based relatedness matrix, outwith tskit, we want parents before progeny so that we can process pedigree records forward in time and accumulate relatedness coefficients appropriately. But one can do topological sort of the pedigree graph (DAG) to achieve this if needed.

Well, it'd be a shame to put @benjeffery's nice code that does the sorting to waste, so we could still provide the functionality (e.g. tables.sort(individual_parent_sort=False)).

@petrelharp
Copy link
Contributor

But I think if it's getting in the way we should definitely remove it.

@bhaller
Copy link
Author

bhaller commented Oct 9, 2021

Hey campers. FYI, it would be good to know the decision on this one way or the other; some work on the SLiM side is waiting to know which way this will go. Implementation is no rush, but a thumbs-up/thumbs-down on the proposal would be helpful. :-> Thanks!

@jeromekelleher
Copy link
Member

On balance I'm in favour of removing it. Here's my proposal:

  • Remove the individual table sorting entirely from TableCollection.sort(), and create a new method IndividualTable.topological_sort(). My experience with the topological sort was that it's too weak in practise for some algorithms and what you really want is a sort by time. This requires a link to the node table and some policy about what the time of an individual is. So, perhaps one day we'll implement TableCollection.sort_individuals(), which does this time-sort but for now, we don't need it. I think we can accept that most tree sequences won't need any sortedness requirements on the individuals, and we'll just live with the consequences later if it turns out there are operations that do need it.
  • Pull the logic for checking individual table sortedness into a method IndividualTable.is_topologically_sorted() (or a better name?)

So we basically let go of the idea that individual sortedness of any kind is part of the tree sequence requirements, and pull the current functionality out into methods of the individual table.

Any objections @benjeffery @molpopgen?

@bhaller
Copy link
Author

bhaller commented Oct 10, 2021

Note that @molpopgen wrote on Slack: "We don't use this ordering requirement. When parent tracking is happening, it is buffered internally and the export to tskit is dealt with at the very end." So he seems fine with it.

@benjeffery
Copy link
Member

Thanks for the discussion all, seems that removing the requirement is the right thing to do. Putting the sorting in a separate function as @jeromekelleher and @petrelharp suggests would be good, there's no point throwing away that code.

@benjeffery benjeffery changed the title remove ordering requirement for the individuals table Remove ordering requirement for the individuals table Oct 11, 2021
@benjeffery benjeffery added the C API Issue is about the C API label Oct 11, 2021
@benjeffery benjeffery added this to the C API 1.0.0 milestone Oct 11, 2021
@benjeffery
Copy link
Member

Hey campers. FYI, it would be good to know the decision on this one way or the other

This was posted at 8pm Friday my time! Give us a chance! 😂

@bhaller
Copy link
Author

bhaller commented Oct 11, 2021

Hey campers. FYI, it would be good to know the decision on this one way or the other

This was posted at 8pm Friday my time! Give us a chance! 😂

Ah, yes, this "weekend" thing. I've heard of that. :->

No worries, just trying to keep the ball rolling. :-> Thanks!

@molpopgen
Copy link
Member

The sorting requirement should be removed.

I don't check github notifications on weekends...

@petrelharp
Copy link
Contributor

Sounds like we have a decision!

@benjeffery
Copy link
Member

Cool, I can make these changes seeing it was my code in the first place.

@benjeffery benjeffery self-assigned this Oct 13, 2021
@benjeffery
Copy link
Member

I'm assuming we still want to use this ordering for tsk_table_collection_canonicalise?

@jeromekelleher
Copy link
Member

Yes, I think so. Although it's probably too weak an ordering to actually be canonical, it's a good start and there's no good motivation to change the definition of tsk_table_collection_canonicalise here.

@jeromekelleher
Copy link
Member

Can we close this now @benjeffery?

@benjeffery
Copy link
Member

Sorry, so used to the github auto-close!

hyanwong added a commit to hyanwong/tskit that referenced this issue Sep 5, 2022
We decided in tskit-dev#1774 to remove the requirement, but the docs aren't all updated.
mergify bot pushed a commit that referenced this issue Sep 5, 2022
We decided in #1774 to remove the requirement, but the docs aren't all updated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C API Issue is about the C API
Projects
None yet
Development

No branches or pull requests

6 participants