Low level pair coalescence counts #2932

nspope · 2024-04-18T00:44:36Z

Low level extension of #2915

codecov · 2024-04-18T01:21:24Z

Codecov Report

Attention: Patch coverage is 95.28302% with 15 lines in your changes missing coverage. Please review.

Project coverage is 89.65%. Comparing base (beafeba) to head (ca06cff).

Files	Patch %	Lines
c/tskit/trees.c	93.50%	5 Missing and 8 partials ⚠️
python/_tskitmodule.c	98.09%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2932      +/-   ##
==========================================
+ Coverage   89.62%   89.65%   +0.02%     
==========================================
  Files          29       29              
  Lines       30170    30393     +223     
  Branches     5867     5901      +34     
==========================================
+ Hits        27041    27249     +208     
- Misses       1790     1796       +6     
- Partials     1339     1348       +9

Flag	Coverage Δ
c-tests	`86.31% <93.68%> (+0.10%)`	⬆️
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`88.86% <98.09%> (+0.13%)`	⬆️
python-tests	`99.00% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
c/tskit/core.c	`95.63% <100.00%> (+0.03%)`	⬆️
c/tskit/core.h	`100.00% <ø> (ø)`
python/tskit/trees.py	`98.75% <100.00%> (-0.05%)`	⬇️
python/_tskitmodule.c	`88.86% <98.09%> (+0.13%)`	⬆️
c/tskit/trees.c	`90.55% <93.50%> (+0.13%)`	⬆️

nspope · 2024-04-18T03:24:37Z

I'd like to generalize this algorithm slightly:

Currently, the output is either a num_windows by num_nodes array (which is very large), or a num_windows by num_time_windows array where the counts are summed within time windows.
Conceptually, the "nodes" output gives the empirical distribution of pair coalescence times in windows across the genome. That is, for each window we have a vector of RVs (node times) and a vector of weights (pair coalescence counts).
From this distributional viewpoint, there's lots of useful things that may be calculated: the empirical CDF, quantiles, moments, coalescence rates, etc. (of which the "sum in time windows" option of the current implementation is one special case)

So, I think it'd be useful to take the current algorithm, and have it apply a summary function at the end of each window. This would let us calculate any useful summary statistic without having to create a potentially humongous array (windows by nodes by indexes) as an intermediate.

The API would stay the same-- later on, we could add named methods for various summary statistics, and potentially eventually expose a "general summary stat" interface, like is done for the other statistics.

jeromekelleher

Looks good to me - do you want to add the summary func stuff now before we start porting to C? Probably a good idea, if you want to do this in the short term.

python/tests/test_coalrate.py

nspope · 2024-07-25T01:29:35Z

The docs build is failing with numpy 2 issues on import msprime, after a rebase-- is there something I can do to fix this?

jeromekelleher · 2024-07-25T08:22:00Z

This is probably a package cache issue @benjeffery?

nspope · 2024-07-25T18:59:20Z

I bumped the cache version to no avail-- however, it looks like requirements/CI-docs/requirements.txt has msprime pinned to 1.3.1?

nspope · 2024-07-25T19:05:55Z

Yup, pinning msprime to the latest version in requirements/* seems to have done the trick.

nspope · 2024-07-26T01:34:43Z

Alright, I think this is ready for a look @petrelharp and @jeromekelleher. The prototypes for TreeSequence methods are prefixed by proto in the tests. Currently, the workhorse is _pair_coalescence_stat which applies a reduction function summary_func(nodes_weight, nodes_time, **summary_func_kwargs) at the end of each window.

I think the next steps are:

Implement _pair_coalescence_stat in C, and allow it to take either (A) a custom reduction function that is a python callback; (B) a C function that also takes in a struct with extra arguments (for the kwargs).
Implement the reduction functions needed to get coalescence rates in C
Make a python API+tests for coalescence rates / counts (already mostly done)
Make a python API for custom reduction functions (lower priority)

Does this seem about right? Alternatively I suppose we could have the reduction functions for (2) be python callbacks? But there might a performance penalty with lots of windows.

jeromekelleher

This all looks great to me @nspope - the algorithm seems very clean.

Regarding Python callbacks for C code, although there's some existing infrastructure for doing this, I'd leave this bit until last. The Python C API stuff is very tricky and there are footguns everywhere. Also, performance will be terrible - there's no way you can use this approach for anything other than prototyping. The summary func gets called in the deepest inner loops, and getting the Python interpreter involved there just doesn't work performance-wise.

nspope · 2024-07-30T02:15:41Z

Hmm, tests are passing on Ubuntu but evidently not on MacOS or Windows. ssh'ing onto the runner, I can get

malloc: *** error for object 0x7fe5ea402a90: pointer being freed was not allocated

on MacOS, happening in one of the tsk_safe_free calls.

jeromekelleher · 2024-07-30T08:14:45Z

I would strongly advise writing some simple C tests that exercise all reachable code paths, and running it through valgrind. This is a much more effective way of tracking down these kinds of bugs. I can help with this, if it's not clear how to start?

nspope · 2024-07-30T17:08:27Z

Makes sense @jeromekelleher, I can take a crack at it. Just so I'm understanding the scope, is the goal to recreate the entire test suite from python in c/tests (correctness tests, etc)? Or something more minimal?

jeromekelleher · 2024-07-30T18:25:05Z

Something minimal. You're just looking to exercise all code paths so that valgrind can see them.

nspope · 2024-07-30T23:02:56Z

That did the trick ... is there a good way to trigger OOM errors after the mallocs, to get patterns like

if (whatever == NULL) {
  ret = TSK_ERR_NO_MEMORY;
  goto out;
}

covered? It looks like these aren't covered in the existing tests, so presumably not?

nspope · 2024-07-31T01:09:39Z

The only substantial coverage gap is in _tskitmodule.c: https://app.codecov.io/gh/tskit-dev/tskit/pull/2932/blob/python/_tskitmodule.c#L9924 where it looks like all execution paths ends in the check,

    if (TreeSequence_check_state(self) != 0) {
        goto out;
    }

however, this can't be the only execution path, because otherwise ll_tree_sequence.pair_coalescence_counts wouldn't return anything and none of the python test suite would pass.

Is there something else needed to get the CPython parts covered?

nspope · 2024-07-31T04:15:23Z

Ah, nevermind-- I see that coverage for CPython is getting read off of test_lowlevel.py

jeromekelleher · 2024-07-31T10:55:43Z

is there a good way to trigger OOM errors after the mallocs, to get patterns like

We've been ignoring these here, as they are very tricky to trigger. We did it in the kastore C tests by running with a patched malloc which fails after n allocs, but it's a lot of work for a pretty marginal benefit.

jeromekelleher

Looks great! There's a lot going on in there, and I've made a few comments style-wise to make things a bit more idiomatic (mostly just change loop conditions to <). There's a few bugs in there that'll only show up in error conditions though, so we do need some more testing of those.

I think a separate C test case where you take a simple case, and then try to provoke all the input errors around time windows etc would be the best approach.

For the Python C API, we have to be super thorough here and make sure that any non-memory allocation failure cases are covered in test_low_level.py. I know it's a lot of boiler plate, but this is what's involved I'm afraid.

We also want to make sure that the reference counting stuff is working properly. The best way I've found of doing this is to run the code in a tight loop, and watch top.

Can you do something like this and check please?

ts = # some intermediate size example, so that loops are quick
for _ in range(1000000):
      ts.pair_coalescence_counts(...)

The memory usage should stabilise after 30 seconds or so, and then remain constant. Then, try commenting out some of the XDECREFs, and see what happens to memory usage.

c/tests/test_stats.c

c/tskit/trees.c

python/_tskitmodule.c

python/tests/test_lowlevel.py

nspope · 2024-07-31T17:36:41Z

Thanks @jeromekelleher! All is clear to me except how to handle the special case where time windows are nodes (which is an important one). I left some comments above describing the issue. Briefly, I'm detecting the "nodes" case via a zero-length time windows array. This can't be replaced by an array of unique times in the tree sequence, because we want to preserve node identity even if its age is not unique, and also the order of nodes in the output array; and both these get lost when we sort nodes into (necessarily unique) time windows.

I suppose an alternative (to using a zero-length time windows array) is to use options to store a flag which tells us we are in nodes mode (like is done for the general stats). What do you think is the cleanest option here?

nspope · 2024-08-01T04:27:42Z

Getting another random windows crash in the CI, but valgrind isn't showing any issues. I traced it back to commit 6287b84 which switched from using int to double, and switched to using the row access macro.

jeromekelleher · 2024-08-01T08:22:48Z

I really would prefer if we could back out of doing the three different options here and focus on the simpler case of just a fixed set of required time windows first. There's a lot of complexity here already, and I want to make sure we catch all of the boring C level stuff before we have to think about that extra level of trickiness.

So, can we leave that bit for a follow up, and focus on getting 100% test coverage on the bits that should be covered? That means more tests in C, and breaking up the Python C tests like I suggested.

nspope · 2024-08-01T16:11:07Z

So I should pytest.mark.skip any tests using the "nodes" output for the time being? These are the bulk of the tests, which is why I tried to support that mode.

I'm working on extending the test_lowlevel case to get the CPython covered, but I think all the C code paths are now covered aside from OOM.

nspope · 2024-08-01T19:17:34Z

Here's memory consumption over time:

looks to me like it's stabilizing -- is this what you were after @jeromekelleher ?

Code here:

monitor mem usage with ps

#!/usr/bin/env bash

echo '
import tskit
import numpy as np
import msprime
ts = msprime.sim_ancestry(
  1000, sequence_length=1e5, recombination_rate=1e-8, population_size=1e4
)
time_windows = np.array([0, np.max(ts.nodes_time)/2, np.inf])
for _ in range(100000):
  ts.pair_coalescence_counts(time_windows=time_windows)
' >del_memory_monitor.py

rm -f del_memory_monitor.log
python del_memory_monitor.py &
for i in {1..1000}; do
  ps -eo cmd,%mem,vsz,rsz | grep "python del_memory_monitor" >> del_memory_monitor.log
  sleep 0.0001
done 
rm del_memory_monitor.py

echo '
import numpy as np
import matplotlib.pyplot as plt
vsz, rsz = np.loadtxt("del_memory_monitor.log", usecols=[2,3]).T
step = np.arange(vsz.size) * 0.0001
plt.plot(step, vsz / np.max(vsz), "-", label="VSZ")
plt.plot(step, rsz / np.max(rsz), "-", label="RSZ")
plt.ylabel("Memory usage / max(memory usage)")
plt.xlabel("Wall time (relative)")
plt.xscale("log")
plt.legend()
plt.savefig("del_memory_monitor.png")
' | python
rm del_memory_monitor.log

jeromekelleher · 2024-08-01T19:19:30Z

Nice - did you try it with some decrefs commented out?

nspope · 2024-08-01T19:24:54Z

this is with results_array XDECREF commented out. Looks qualitatively similar ... ?

jeromekelleher · 2024-08-01T20:25:04Z

Hmm, you need bigger examples then. You should be leaking memory heavily there.

nspope · 2024-08-01T21:53:52Z

Ah, the output array was too small with only 2 time windows.

Here's with XDECREF commented out:

Here's with XDECREF:

looks sane to me. I'm waiting a bit before starting monitoring, here, to skip the initial overhead associated with imports, etc.

memory monitoring code for real

#!/usr/bin/env bash

echo '
import tskit
import numpy as np
import msprime
ts = msprime.sim_ancestry(
  100, sequence_length=1e6, recombination_rate=1e-8, population_size=1e4
)
time_windows = np.append(np.unique(ts.nodes_time), np.inf)
windows = np.linspace(0, ts.sequence_length, 100)
sample_sets = list(ts.samples())
sample_set_sizes = [ts.num_samples]
indexes = [(0, 0)]
for _ in range(100000):
    ts.ll_tree_sequence.pair_coalescence_counts(
        time_windows=time_windows, 
        windows=windows, 
        indexes=indexes, 
        sample_sets=sample_sets,
        sample_set_sizes=sample_set_sizes,
        span_normalise=False,
    )
' >memory_monitor.py

python memory_monitor.py &
sleep 1
for i in {1..1000}; do
  ps -eo cmd,%mem,vsz,rsz | grep "python memory_monitor"
  sleep 0.1
done >memory_monitor.log

echo '
import numpy as np
import matplotlib.pyplot as plt
vsz, rsz = np.loadtxt("memory_monitor.log", usecols=[2,3]).T
step = np.arange(vsz.size) * 0.1
plt.plot(step, vsz / np.max(vsz), "-", label="VSZ")
plt.plot(step, rsz / np.max(rsz), "-", label="RSZ")
plt.ylabel("Memory usage (divided by max)")
plt.xlabel("Wall time (relative to start)")
plt.legend()
plt.savefig("memory_monitor.png")
' | python

nspope · 2024-08-02T00:20:20Z

Maybe the thing to do with time windows is change the input in ts.ll_tree_sequence.pair_coalescence_counts from time_windows to integer vector node_bin, that maps nodes to output bins (which could be time bins, or any other delineation). Then ts.pair_coalescence_counts can internally do the binning for a given set of time breakpoints, which allows all three cases of time windows (nodes, unique time points, time intervals) to be passed to the low-level routine.

jeromekelleher · 2024-08-02T07:55:04Z

Nice, that sounds like a good idea. Do you want to push forward with that, or merge this much and iterate? (I haven't reviewed again, will need to take another good look)

nspope · 2024-08-02T16:33:14Z

I gave it a shot, but it looks like I'm getting stochastic crashes on Windows. Totally perplexed as to why, because the semantics are very similar to the last commit, which worked.

(It could be because I had skipped a bunch of tests in the last commit?)

nspope · 2024-08-03T04:29:09Z

Working now-- bad malloc (d'oh). This is ready for another look whenever you have time @jeromekelleher .

Here's mem usage once more, with the time_windows -> nodes_output_map change:

c/tskit/trees.c

jeromekelleher

LGTM! Just needs a couple of more tests on the low-level input, and we're good to merge I think

c/tskit/trees.c

python/_tskitmodule.c

nspope · 2024-08-05T23:54:34Z

Done!

It occurs to me that the order of dimensions here -- (windows, time_windows, indexes) -- should be consistent with the (forthcoming) time windowed stats. @petrelharp what order did you guys settle on?

jeromekelleher

Superb stuff @nspope - there's a lot of things to get up to speed on here and you've done a great job 👍

petrelharp · 2024-08-06T23:36:48Z

(windows, time_windows, indexes)

Same!

jeromekelleher reviewed Apr 22, 2024

View reviewed changes

python/tests/test_coalrate.py Outdated Show resolved Hide resolved

nspope force-pushed the pair-coalescence-extend branch from ff871bd to 7839a94 Compare July 25, 2024 00:54

nspope marked this pull request as ready for review July 25, 2024 01:24

jeromekelleher approved these changes Jul 26, 2024

View reviewed changes

nspope force-pushed the pair-coalescence-extend branch from ec95c97 to 722a3e4 Compare July 30, 2024 01:07

nspope force-pushed the pair-coalescence-extend branch from 5ad6e09 to 1aab9b5 Compare July 30, 2024 22:32

jeromekelleher reviewed Jul 31, 2024

View reviewed changes

nspope force-pushed the pair-coalescence-extend branch 2 times, most recently from 7a0e1b8 to 06661b1 Compare August 1, 2024 04:21

nspope force-pushed the pair-coalescence-extend branch from 338f3f6 to 6318fad Compare August 1, 2024 17:56

nspope force-pushed the pair-coalescence-extend branch from 93f6feb to 1a01087 Compare August 2, 2024 05:51

nspope force-pushed the pair-coalescence-extend branch 3 times, most recently from fde58bf to 9355d19 Compare August 2, 2024 17:37

petrelharp reviewed Aug 4, 2024

View reviewed changes

c/tskit/trees.c Show resolved Hide resolved

jeromekelleher reviewed Aug 5, 2024

View reviewed changes

c/tskit/trees.c Show resolved Hide resolved

python/_tskitmodule.c Show resolved Hide resolved

python/_tskitmodule.c Show resolved Hide resolved

python/_tskitmodule.c Outdated Show resolved Hide resolved

Initial C implementation of pair_coalescence_counts

ca06cff

nspope force-pushed the pair-coalescence-extend branch from 31cdd29 to ca06cff Compare August 5, 2024 22:40

jeromekelleher approved these changes Aug 6, 2024

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Aug 6, 2024

mergify bot merged commit 1f28c11 into tskit-dev:main Aug 6, 2024
21 checks passed

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Aug 6, 2024

nspope mentioned this pull request Aug 6, 2024

Use reduction function in pair_coalescence_stat #2975

Merged

nspope mentioned this pull request Sep 27, 2024

Update changelog for 0.5.9 #2994

Closed

Low level pair coalescence counts #2932

Low level pair coalescence counts #2932

Conversation

nspope commented Apr 18, 2024

codecov bot commented Apr 18, 2024 • edited Loading

Codecov Report

nspope commented Apr 18, 2024 • edited Loading

jeromekelleher left a comment

Choose a reason for hiding this comment

nspope commented Jul 25, 2024 • edited Loading

jeromekelleher commented Jul 25, 2024

nspope commented Jul 25, 2024

nspope commented Jul 25, 2024

nspope commented Jul 26, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

nspope commented Jul 30, 2024 • edited Loading

jeromekelleher commented Jul 30, 2024

nspope commented Jul 30, 2024

jeromekelleher commented Jul 30, 2024

nspope commented Jul 30, 2024 • edited Loading

nspope commented Jul 31, 2024

nspope commented Jul 31, 2024

jeromekelleher commented Jul 31, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

nspope commented Jul 31, 2024 • edited Loading

nspope commented Aug 1, 2024

jeromekelleher commented Aug 1, 2024

nspope commented Aug 1, 2024

nspope commented Aug 1, 2024

jeromekelleher commented Aug 1, 2024

nspope commented Aug 1, 2024

jeromekelleher commented Aug 1, 2024

nspope commented Aug 1, 2024 • edited Loading

nspope commented Aug 2, 2024 • edited Loading

jeromekelleher commented Aug 2, 2024

nspope commented Aug 2, 2024

nspope commented Aug 3, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

nspope commented Aug 5, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

petrelharp commented Aug 6, 2024

codecov bot commented Apr 18, 2024 •

edited

Loading

nspope commented Apr 18, 2024 •

edited

Loading

nspope commented Jul 25, 2024 •

edited

Loading

nspope commented Jul 30, 2024 •

edited

Loading

nspope commented Jul 30, 2024 •

edited

Loading

nspope commented Jul 31, 2024 •

edited

Loading

nspope commented Aug 1, 2024 •

edited

Loading

nspope commented Aug 2, 2024 •

edited

Loading