Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low level pair coalescence counts #2932

Merged
merged 1 commit into from
Aug 6, 2024

Conversation

nspope
Copy link
Contributor

@nspope nspope commented Apr 18, 2024

Low level extension of #2915

Copy link

codecov bot commented Apr 18, 2024

Codecov Report

Attention: Patch coverage is 95.28302% with 15 lines in your changes missing coverage. Please review.

Project coverage is 89.65%. Comparing base (beafeba) to head (ca06cff).

Files Patch % Lines
c/tskit/trees.c 93.50% 5 Missing and 8 partials ⚠️
python/_tskitmodule.c 98.09% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2932      +/-   ##
==========================================
+ Coverage   89.62%   89.65%   +0.02%     
==========================================
  Files          29       29              
  Lines       30170    30393     +223     
  Branches     5867     5901      +34     
==========================================
+ Hits        27041    27249     +208     
- Misses       1790     1796       +6     
- Partials     1339     1348       +9     
Flag Coverage Δ
c-tests 86.31% <93.68%> (+0.10%) ⬆️
lwt-tests 80.78% <ø> (ø)
python-c-tests 88.86% <98.09%> (+0.13%) ⬆️
python-tests 99.00% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
c/tskit/core.c 95.63% <100.00%> (+0.03%) ⬆️
c/tskit/core.h 100.00% <ø> (ø)
python/tskit/trees.py 98.75% <100.00%> (-0.05%) ⬇️
python/_tskitmodule.c 88.86% <98.09%> (+0.13%) ⬆️
c/tskit/trees.c 90.55% <93.50%> (+0.13%) ⬆️

@nspope
Copy link
Contributor Author

nspope commented Apr 18, 2024

I'd like to generalize this algorithm slightly:

  • Currently, the output is either a num_windows by num_nodes array (which is very large), or a num_windows by num_time_windows array where the counts are summed within time windows.
  • Conceptually, the "nodes" output gives the empirical distribution of pair coalescence times in windows across the genome. That is, for each window we have a vector of RVs (node times) and a vector of weights (pair coalescence counts).
  • From this distributional viewpoint, there's lots of useful things that may be calculated: the empirical CDF, quantiles, moments, coalescence rates, etc. (of which the "sum in time windows" option of the current implementation is one special case)

So, I think it'd be useful to take the current algorithm, and have it apply a summary function at the end of each window. This would let us calculate any useful summary statistic without having to create a potentially humongous array (windows by nodes by indexes) as an intermediate.

The API would stay the same-- later on, we could add named methods for various summary statistics, and potentially eventually expose a "general summary stat" interface, like is done for the other statistics.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - do you want to add the summary func stuff now before we start porting to C? Probably a good idea, if you want to do this in the short term.

python/tests/test_coalrate.py Outdated Show resolved Hide resolved
@nspope nspope force-pushed the pair-coalescence-extend branch from ff871bd to 7839a94 Compare July 25, 2024 00:54
@nspope nspope marked this pull request as ready for review July 25, 2024 01:24
@nspope
Copy link
Contributor Author

nspope commented Jul 25, 2024

The docs build is failing with numpy 2 issues on import msprime, after a rebase-- is there something I can do to fix this?

@jeromekelleher
Copy link
Member

This is probably a package cache issue @benjeffery?

@nspope
Copy link
Contributor Author

nspope commented Jul 25, 2024

I bumped the cache version to no avail-- however, it looks like requirements/CI-docs/requirements.txt has msprime pinned to 1.3.1?

@nspope
Copy link
Contributor Author

nspope commented Jul 25, 2024

Yup, pinning msprime to the latest version in requirements/* seems to have done the trick.

@nspope
Copy link
Contributor Author

nspope commented Jul 26, 2024

Alright, I think this is ready for a look @petrelharp and @jeromekelleher. The prototypes for TreeSequence methods are prefixed by proto in the tests. Currently, the workhorse is _pair_coalescence_stat which applies a reduction function summary_func(nodes_weight, nodes_time, **summary_func_kwargs) at the end of each window.

I think the next steps are:

  1. Implement _pair_coalescence_stat in C, and allow it to take either (A) a custom reduction function that is a python callback; (B) a C function that also takes in a struct with extra arguments (for the kwargs).

  2. Implement the reduction functions needed to get coalescence rates in C

  3. Make a python API+tests for coalescence rates / counts (already mostly done)

  4. Make a python API for custom reduction functions (lower priority)

Does this seem about right? Alternatively I suppose we could have the reduction functions for (2) be python callbacks? But there might a performance penalty with lots of windows.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me @nspope - the algorithm seems very clean.

Regarding Python callbacks for C code, although there's some existing infrastructure for doing this, I'd leave this bit until last. The Python C API stuff is very tricky and there are footguns everywhere. Also, performance will be terrible - there's no way you can use this approach for anything other than prototyping. The summary func gets called in the deepest inner loops, and getting the Python interpreter involved there just doesn't work performance-wise.

@nspope nspope force-pushed the pair-coalescence-extend branch from ec95c97 to 722a3e4 Compare July 30, 2024 01:07
@nspope
Copy link
Contributor Author

nspope commented Jul 30, 2024

Hmm, tests are passing on Ubuntu but evidently not on MacOS or Windows. ssh'ing onto the runner, I can get

malloc: *** error for object 0x7fe5ea402a90: pointer being freed was not allocated

on MacOS, happening in one of the tsk_safe_free calls.

@jeromekelleher
Copy link
Member

I would strongly advise writing some simple C tests that exercise all reachable code paths, and running it through valgrind. This is a much more effective way of tracking down these kinds of bugs. I can help with this, if it's not clear how to start?

@nspope
Copy link
Contributor Author

nspope commented Jul 30, 2024

Makes sense @jeromekelleher, I can take a crack at it. Just so I'm understanding the scope, is the goal to recreate the entire test suite from python in c/tests (correctness tests, etc)? Or something more minimal?

@jeromekelleher
Copy link
Member

Something minimal. You're just looking to exercise all code paths so that valgrind can see them.

@nspope nspope force-pushed the pair-coalescence-extend branch from 5ad6e09 to 1aab9b5 Compare July 30, 2024 22:32
@nspope
Copy link
Contributor Author

nspope commented Jul 30, 2024

That did the trick ... is there a good way to trigger OOM errors after the mallocs, to get patterns like

if (whatever == NULL) {
  ret = TSK_ERR_NO_MEMORY;
  goto out;
}

covered? It looks like these aren't covered in the existing tests, so presumably not?

@nspope
Copy link
Contributor Author

nspope commented Jul 31, 2024

The only substantial coverage gap is in _tskitmodule.c: https://app.codecov.io/gh/tskit-dev/tskit/pull/2932/blob/python/_tskitmodule.c#L9924 where it looks like all execution paths ends in the check,

    if (TreeSequence_check_state(self) != 0) {
        goto out;
    }

however, this can't be the only execution path, because otherwise ll_tree_sequence.pair_coalescence_counts wouldn't return anything and none of the python test suite would pass.

Is there something else needed to get the CPython parts covered?

@nspope
Copy link
Contributor Author

nspope commented Jul 31, 2024

Ah, nevermind-- I see that coverage for CPython is getting read off of test_lowlevel.py

@jeromekelleher
Copy link
Member

is there a good way to trigger OOM errors after the mallocs, to get patterns like

We've been ignoring these here, as they are very tricky to trigger. We did it in the kastore C tests by running with a patched malloc which fails after n allocs, but it's a lot of work for a pretty marginal benefit.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! There's a lot going on in there, and I've made a few comments style-wise to make things a bit more idiomatic (mostly just change loop conditions to <). There's a few bugs in there that'll only show up in error conditions though, so we do need some more testing of those.

I think a separate C test case where you take a simple case, and then try to provoke all the input errors around time windows etc would be the best approach.

For the Python C API, we have to be super thorough here and make sure that any non-memory allocation failure cases are covered in test_low_level.py. I know it's a lot of boiler plate, but this is what's involved I'm afraid.

We also want to make sure that the reference counting stuff is working properly. The best way I've found of doing this is to run the code in a tight loop, and watch top.

Can you do something like this and check please?

ts = # some intermediate size example, so that loops are quick
for _ in range(1000000):
      ts.pair_coalescence_counts(...)

The memory usage should stabilise after 30 seconds or so, and then remain constant. Then, try commenting out some of the XDECREFs, and see what happens to memory usage.

c/tests/test_stats.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
python/_tskitmodule.c Outdated Show resolved Hide resolved
python/_tskitmodule.c Outdated Show resolved Hide resolved
python/_tskitmodule.c Outdated Show resolved Hide resolved
python/tests/test_lowlevel.py Outdated Show resolved Hide resolved
@nspope
Copy link
Contributor Author

nspope commented Jul 31, 2024

Thanks @jeromekelleher! All is clear to me except how to handle the special case where time windows are nodes (which is an important one). I left some comments above describing the issue. Briefly, I'm detecting the "nodes" case via a zero-length time windows array. This can't be replaced by an array of unique times in the tree sequence, because we want to preserve node identity even if its age is not unique, and also the order of nodes in the output array; and both these get lost when we sort nodes into (necessarily unique) time windows.

I suppose an alternative (to using a zero-length time windows array) is to use options to store a flag which tells us we are in nodes mode (like is done for the general stats). What do you think is the cleanest option here?

@nspope nspope force-pushed the pair-coalescence-extend branch 2 times, most recently from 7a0e1b8 to 06661b1 Compare August 1, 2024 04:21
@nspope
Copy link
Contributor Author

nspope commented Aug 1, 2024

Getting another random windows crash in the CI, but valgrind isn't showing any issues. I traced it back to commit 6287b84 which switched from using int to double, and switched to using the row access macro.

@jeromekelleher
Copy link
Member

I really would prefer if we could back out of doing the three different options here and focus on the simpler case of just a fixed set of required time windows first. There's a lot of complexity here already, and I want to make sure we catch all of the boring C level stuff before we have to think about that extra level of trickiness.

So, can we leave that bit for a follow up, and focus on getting 100% test coverage on the bits that should be covered? That means more tests in C, and breaking up the Python C tests like I suggested.

@nspope
Copy link
Contributor Author

nspope commented Aug 1, 2024

So I should pytest.mark.skip any tests using the "nodes" output for the time being? These are the bulk of the tests, which is why I tried to support that mode.

I'm working on extending the test_lowlevel case to get the CPython covered, but I think all the C code paths are now covered aside from OOM.

@nspope nspope force-pushed the pair-coalescence-extend branch from 338f3f6 to 6318fad Compare August 1, 2024 17:56
@nspope
Copy link
Contributor Author

nspope commented Aug 1, 2024

Here's memory consumption over time:

looks to me like it's stabilizing -- is this what you were after @jeromekelleher ?

Code here:

monitor mem usage with ps
#!/usr/bin/env bash

echo '
import tskit
import numpy as np
import msprime
ts = msprime.sim_ancestry(
  1000, sequence_length=1e5, recombination_rate=1e-8, population_size=1e4
)
time_windows = np.array([0, np.max(ts.nodes_time)/2, np.inf])
for _ in range(100000):
  ts.pair_coalescence_counts(time_windows=time_windows)
' >del_memory_monitor.py

rm -f del_memory_monitor.log
python del_memory_monitor.py &
for i in {1..1000}; do
  ps -eo cmd,%mem,vsz,rsz | grep "python del_memory_monitor" >> del_memory_monitor.log
  sleep 0.0001
done 
rm del_memory_monitor.py

echo '
import numpy as np
import matplotlib.pyplot as plt
vsz, rsz = np.loadtxt("del_memory_monitor.log", usecols=[2,3]).T
step = np.arange(vsz.size) * 0.0001
plt.plot(step, vsz / np.max(vsz), "-", label="VSZ")
plt.plot(step, rsz / np.max(rsz), "-", label="RSZ")
plt.ylabel("Memory usage / max(memory usage)")
plt.xlabel("Wall time (relative)")
plt.xscale("log")
plt.legend()
plt.savefig("del_memory_monitor.png")
' | python
rm del_memory_monitor.log

@jeromekelleher
Copy link
Member

Nice - did you try it with some decrefs commented out?

@nspope
Copy link
Contributor Author

nspope commented Aug 1, 2024

this is with results_array XDECREF commented out. Looks qualitatively similar ... ?

@jeromekelleher
Copy link
Member

Hmm, you need bigger examples then. You should be leaking memory heavily there.

@nspope
Copy link
Contributor Author

nspope commented Aug 1, 2024

Ah, the output array was too small with only 2 time windows.

Here's with XDECREF commented out:

Here's with XDECREF:

looks sane to me. I'm waiting a bit before starting monitoring, here, to skip the initial overhead associated with imports, etc.

memory monitoring code for real
#!/usr/bin/env bash

echo '
import tskit
import numpy as np
import msprime
ts = msprime.sim_ancestry(
  100, sequence_length=1e6, recombination_rate=1e-8, population_size=1e4
)
time_windows = np.append(np.unique(ts.nodes_time), np.inf)
windows = np.linspace(0, ts.sequence_length, 100)
sample_sets = list(ts.samples())
sample_set_sizes = [ts.num_samples]
indexes = [(0, 0)]
for _ in range(100000):
    ts.ll_tree_sequence.pair_coalescence_counts(
        time_windows=time_windows, 
        windows=windows, 
        indexes=indexes, 
        sample_sets=sample_sets,
        sample_set_sizes=sample_set_sizes,
        span_normalise=False,
    )
' >memory_monitor.py

python memory_monitor.py &
sleep 1
for i in {1..1000}; do
  ps -eo cmd,%mem,vsz,rsz | grep "python memory_monitor"
  sleep 0.1
done >memory_monitor.log

echo '
import numpy as np
import matplotlib.pyplot as plt
vsz, rsz = np.loadtxt("memory_monitor.log", usecols=[2,3]).T
step = np.arange(vsz.size) * 0.1
plt.plot(step, vsz / np.max(vsz), "-", label="VSZ")
plt.plot(step, rsz / np.max(rsz), "-", label="RSZ")
plt.ylabel("Memory usage (divided by max)")
plt.xlabel("Wall time (relative to start)")
plt.legend()
plt.savefig("memory_monitor.png")
' | python

@nspope
Copy link
Contributor Author

nspope commented Aug 2, 2024

Maybe the thing to do with time windows is change the input in ts.ll_tree_sequence.pair_coalescence_counts from time_windows to integer vector node_bin, that maps nodes to output bins (which could be time bins, or any other delineation). Then ts.pair_coalescence_counts can internally do the binning for a given set of time breakpoints, which allows all three cases of time windows (nodes, unique time points, time intervals) to be passed to the low-level routine.

@nspope nspope force-pushed the pair-coalescence-extend branch from 93f6feb to 1a01087 Compare August 2, 2024 05:51
@jeromekelleher
Copy link
Member

Nice, that sounds like a good idea. Do you want to push forward with that, or merge this much and iterate? (I haven't reviewed again, will need to take another good look)

@nspope
Copy link
Contributor Author

nspope commented Aug 2, 2024

I gave it a shot, but it looks like I'm getting stochastic crashes on Windows. Totally perplexed as to why, because the semantics are very similar to the last commit, which worked.

(It could be because I had skipped a bunch of tests in the last commit?)

@nspope nspope force-pushed the pair-coalescence-extend branch 3 times, most recently from fde58bf to 9355d19 Compare August 2, 2024 17:37
@nspope
Copy link
Contributor Author

nspope commented Aug 3, 2024

Working now-- bad malloc (d'oh). This is ready for another look whenever you have time @jeromekelleher .

Here's mem usage once more, with the time_windows -> nodes_output_map change:

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just needs a couple of more tests on the low-level input, and we're good to merge I think

c/tskit/trees.c Show resolved Hide resolved
python/_tskitmodule.c Show resolved Hide resolved
python/_tskitmodule.c Show resolved Hide resolved
python/_tskitmodule.c Outdated Show resolved Hide resolved
@nspope nspope force-pushed the pair-coalescence-extend branch from 31cdd29 to ca06cff Compare August 5, 2024 22:40
@nspope
Copy link
Contributor Author

nspope commented Aug 5, 2024

Done!

It occurs to me that the order of dimensions here -- (windows, time_windows, indexes) -- should be consistent with the (forthcoming) time windowed stats. @petrelharp what order did you guys settle on?

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superb stuff @nspope - there's a lot of things to get up to speed on here and you've done a great job 👍

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Aug 6, 2024
@mergify mergify bot merged commit 1f28c11 into tskit-dev:main Aug 6, 2024
21 checks passed
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Aug 6, 2024
@petrelharp
Copy link
Contributor

(windows, time_windows, indexes)

Same!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants