Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual root in trees #1704

Merged
merged 8 commits into from
Oct 18, 2021
Merged

Virtual root in trees #1704

merged 8 commits into from
Oct 18, 2021

Conversation

jeromekelleher
Copy link
Member

See #1691

Still a WIP, but it definitely works so I think we should do it. The code is so much cleaner this way.

@codecov
Copy link

codecov bot commented Sep 15, 2021

Codecov Report

Merging #1704 (a384c3e) into main (c916b2d) will increase coverage by 0.01%.
The diff coverage is 97.93%.

❗ Current head a384c3e differs from pull request most recent head a882a5d. Consider uploading reports for the commit a882a5d to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1704      +/-   ##
==========================================
+ Coverage   93.38%   93.40%   +0.01%     
==========================================
  Files          27       27              
  Lines       24571    24672     +101     
  Branches     1100     1090      -10     
==========================================
+ Hits        22945    23044      +99     
- Misses       1591     1593       +2     
  Partials       35       35              
Flag Coverage Δ
c-tests 92.15% <98.15%> (+0.05%) ⬆️
lwt-tests 89.14% <ø> (ø)
python-c-tests 94.52% <97.54%> (-0.01%) ⬇️
python-tests 98.74% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/_tskitmodule.c 91.63% <94.82%> (+0.01%) ⬆️
c/tskit/trees.c 95.03% <98.15%> (+0.14%) ⬆️
python/tskit/trees.py 97.83% <100.00%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c916b2d...a882a5d. Read the comment docs.

@jeromekelleher
Copy link
Member Author

Some benchmarks:

import msprime
import collections
import time
import tskit
import sys
import pandas as pd
import numpy as np


def benchmark_tree_ops(name, ts):

    orders = [
        "preorder",
        "inorder",
        "postorder",
        "levelorder",
        "breadthfirst",
        "timeasc",
        "timedesc",
        "minlex_postorder",
    ]
    data = []

    max_trees = 100
    for ordering in orders:
        times = np.zeros(min(max_trees, ts.num_trees))
        for tree in ts.trees():
            if tree.index == max_trees:
                break
            before = time.perf_counter()
            # consume the iterator as quickly as possible
            iterator = tree.nodes(order=ordering)
            collections.deque(iterator, maxlen=0)
            times[tree.index] = time.perf_counter() - before
        data.append({"name": name, "order": ordering, "time": np.mean(times)})
        df = pd.DataFrame(data)
    print(df)


ts = msprime.sim_ancestry(100000, random_seed=32)
benchmark_tree_ops("big_tree", ts)

ts = msprime.sim_ancestry(1000, sequence_length=1e11, recombination_rate=1e-8,
        random_seed=42)
# print(ts.num_trees)
benchmark_tree_ops("many_trees", ts)

With current head of git we get:

      name             order      time
0  big_tree          preorder  0.110706
1  big_tree           inorder  0.475789
2  big_tree         postorder  0.202509
3  big_tree        levelorder  0.101380
4  big_tree      breadthfirst  0.096691
5  big_tree           timeasc  0.547786
6  big_tree          timedesc  0.547738
7  big_tree  minlex_postorder  0.710262
         name             order      time
0  many_trees          preorder  0.001068
1  many_trees           inorder  0.003650
2  many_trees         postorder  0.001928
3  many_trees        levelorder  0.000914
4  many_trees      breadthfirst  0.000900
5  many_trees           timeasc  0.002935
6  many_trees          timedesc  0.002906
7  many_trees  minlex_postorder  0.006282

with this branch we have

       name             order      time
0  big_tree          preorder  0.024969
1  big_tree           inorder  0.466075
2  big_tree         postorder  0.026851
3  big_tree        levelorder  0.092523
4  big_tree      breadthfirst  0.093192
5  big_tree           timeasc  0.057754
6  big_tree          timedesc  0.060604
7  big_tree  minlex_postorder  0.591150
         name             order      time
0  many_trees          preorder  0.000240
1  many_trees           inorder  0.003583
2  many_trees         postorder  0.000237
3  many_trees        levelorder  0.000847
4  many_trees      breadthfirst  0.000829
5  many_trees           timeasc  0.001475
6  many_trees          timedesc  0.001706
7  many_trees  minlex_postorder  0.005356

So we have roughly an order magnitude traversal perf improvement for preorder, postorder and timeasc and timedesc. We could try to measure the extra memory usage involved in creating the numpy arrays, but I'd be astonished if it was significant.

@jeromekelleher
Copy link
Member Author

Notes:

  • We're currently defining the time of the virtual root as +inf because this is handy for defining its position in the timeasc (and timedesc) traversal orders. However, this is sort-of inconsistent with our definition of the branch length over a root being zero. I guess the alternative would be to make the time of the virtual root equal to nextafter(max(tree.time(root) for root in tree.roots))). This is messy and potentially expensive though, and much less easy to state. I guess we can resolve this by saying the virtual root isn't a parent of the root nodes (it's not, parent[root] == NULL), and so it's logical enough to say that the branch length of any node that doesn't have a parent is 0.
  • The total branch length can be computed more efficiently now: see Implement total_branch_length using numpy ops? #1794

@jeromekelleher jeromekelleher force-pushed the virtual-root branch 2 times, most recently from 12e1ae5 to a2faa9a Compare October 14, 2021 11:31
@jeromekelleher jeromekelleher marked this pull request as ready for review October 14, 2021 11:31
@jeromekelleher
Copy link
Member Author

This is ready for review, and has some big changes so it would be good to get some eyes on it. Hopefully the doc updates will explain what's going on. There's still some stuff that needs tidying up on the C side before that gets documented, but I'd like to merge this much first so it doesn't get any bigger.

Pinging @benjeffery @petrelharp @molpopgen

@benjeffery
Copy link
Member

Some benchmarks:

Oh wow. Awesome!

I'm in the weeds with vargen, will take a look over this later today.

@molpopgen
Copy link
Member

I am writing quiz questions on bacterial genetics today--back tomorrow! Ping me if I seem to have disappeared...

@jeromekelleher
Copy link
Member Author

Bumping this one - @benjeffery any chance you could take a look through please? Good to get this diff pushed through.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! This is amazing. Sorry this took a while to review and check I understand what is going on. Annoyingly I couldn't find anything to comment on.

@benjeffery benjeffery added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Oct 18, 2021
Closes #1670

---

Minimal C changes to implement virtual root support

Includes the minimal changes to get the Python tests passing as well.

Skipping viz tests for now, as the rendering is arbitrary.

Add virtual roots to the quintuply linked tree arrays.

Test properties of the virtual_root
Change timeasc and timedesc to keep the tree ordering within a
timeslice rather than sorting by node ID (potentially breaking)
Closes tskit-dev#1776

Closes tskit-dev#1725
Keep a track of the number of edges that are used to build the topology
of the tree and document in Python and C interfaces.
@mergify mergify bot merged commit da1ff4d into tskit-dev:main Oct 18, 2021
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Oct 18, 2021
@jeromekelleher jeromekelleher deleted the virtual-root branch October 18, 2021 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants