Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement allele counting methods #2393

Merged
merged 1 commit into from
Jul 20, 2022
Merged

Conversation

hyanwong
Copy link
Member

@hyanwong hyanwong commented Jul 8, 2022

Description

Add variant.num_missing, variant.counts(), and variant.frequencies(). Ping @szhan - is this useful for #2384 ?

Fixes #2390

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@hyanwong
Copy link
Member Author

hyanwong commented Jul 8, 2022

No tests yet. Is the interface right @benjeffery ?

@szhan
Copy link
Member

szhan commented Jul 8, 2022

Ah, yes, they are useful for #2384.

@szhan
Copy link
Member

szhan commented Jul 8, 2022

They are also useful for some information-theoretic imputation quality metrics that I want to implement soon.

@codecov
Copy link

codecov bot commented Jul 8, 2022

Codecov Report

Merging #2393 (812af67) into main (f41eddc) will decrease coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 812af67 differs from pull request most recent head 22c10aa. Consider uploading reports for the commit 22c10aa to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2393      +/-   ##
==========================================
- Coverage   93.34%   93.34%   -0.01%     
==========================================
  Files          28       28              
  Lines       26988    26982       -6     
  Branches     1246     1245       -1     
==========================================
- Hits        25193    25187       -6     
  Misses       1761     1761              
  Partials       34       34              
Flag Coverage Δ
c-tests 92.26% <0.00%> (ø)
lwt-tests 89.05% <0.00%> (ø)
python-c-tests 71.01% <0.00%> (-0.02%) ⬇️
python-tests 98.96% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/combinatorics.py 99.36% <0.00%> (-0.01%) ⬇️
python/tskit/trees.py 98.66% <0.00%> (ø)
python/tskit/__init__.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f41eddc...22c10aa. Read the comment docs.

@benjeffery
Copy link
Member

No tests yet. Is the interface right @benjeffery ?

Interface looks good to me, although I'm not sure about raising an error on there being zero samples, what about an empty dict as the return in that case? That's what counts would do.

It would also be nice to reuse counts in frequencies.

@hyanwong
Copy link
Member Author

hyanwong commented Jul 8, 2022

No tests yet. Is the interface right @benjeffery ?

Interface looks good to me, although I'm not sure about raising an error on there being zero samples, what about an empty dict as the return in that case?

Yeah, I wasn't sure about that. Another possibility would be to return NaN?

That's what counts would do.

Would it. That seems a mistake to me. I think it should return all the alleles with a count of zero?

It would also be nice to reuse counts in frequencies.

I could do that. It seemed a bit more work, but probably about the same.

@hyanwong
Copy link
Member Author

hyanwong commented Jul 8, 2022

Would it. That seems a mistake to me. I think it should return all the alleles with a count of zero?

Oh yes, I think I coded it wrongly. I suspect it should return a value for everything in variant.alleles?

@hyanwong hyanwong force-pushed the variant-counts branch 2 times, most recently from 6d67bde to ecae7cf Compare July 8, 2022 14:00
@hyanwong
Copy link
Member Author

hyanwong commented Jul 8, 2022

I coded this to raise a warning if frequencies are calculated on variant with all missing samples or no samples, but output NaN (which is mathematically the correct thing, I think). Is this right, and should I create a separate logger instance in this case @benjeffery ?

@hyanwong hyanwong force-pushed the variant-counts branch 3 times, most recently from 177a999 to b5c301c Compare July 8, 2022 15:52
@hyanwong hyanwong requested a review from benjeffery July 9, 2022 13:10
@hyanwong hyanwong marked this pull request as ready for review July 9, 2022 13:10
@petrelharp
Copy link
Contributor

This looks very nice. Note: this overlaps with #504.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a mistake in the denominator for frequencies, I think!

python/tskit/genotypes.py Outdated Show resolved Hide resolved
python/tskit/genotypes.py Show resolved Hide resolved
@hyanwong
Copy link
Member Author

hyanwong commented Jul 12, 2022

Hmm, an exciting segfault here. WTF?

Edit - failing due to #2400

@benjeffery
Copy link
Member

@hyanwong Did you want to get this in for 0.5.1?

@hyanwong
Copy link
Member Author

hyanwong commented Jul 14, 2022

Maybe release 5.1 first, as we might want to think about how it interacts with @petrelharp's suggestions in #504, and nothing is waiting on this PR.

If @szhan wants to use it for #2384, he can simply wait until we decide on the right API.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, implementation comment. Should also consider options in the stdlib for representating frequency mapping

python/tskit/genotypes.py Outdated Show resolved Hide resolved
python/tskit/genotypes.py Outdated Show resolved Hide resolved
@hyanwong hyanwong force-pushed the variant-counts branch 2 times, most recently from 4d9aec2 to 1dcad55 Compare July 15, 2022 13:08
@hyanwong hyanwong requested a review from benjeffery July 15, 2022 14:18
@hyanwong
Copy link
Member Author

Should also consider options in the stdlib for representating frequency mapping

I had a hunt around and couldn't see anything terribly obvious.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Needs a few explicit tests, as it's quite hard to see what's really being tested when checking against the ts_fixture.

python/CHANGELOG.rst Outdated Show resolved Hide resolved
python/tests/test_genotypes.py Show resolved Hide resolved
python/tskit/genotypes.py Outdated Show resolved Hide resolved
python/tskit/genotypes.py Show resolved Hide resolved
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hyanwong
Copy link
Member Author

Can we merge this now? It would be useful for @szhan I think (and @benjeffery is on holiday)

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 20, 2022
@jeromekelleher
Copy link
Member

Merging - I'm assuming the changes Ben requested have been made? (Not obvious what they were here, I came in late to the conversation)

@jeromekelleher jeromekelleher dismissed benjeffery’s stale review July 20, 2022 08:56

To make PR mergable

@mergify mergify bot merged commit 33df3ca into tskit-dev:main Jul 20, 2022
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 20, 2022
Comment on lines +259 to +261
bincounts = np.bincount(self.genotypes)
for i, allele in enumerate(self.alleles):
counts[allele] = bincounts[i] if i < len(bincounts) else 0
Copy link
Contributor

@petrelharp petrelharp Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late here, but I think you could just do this?

            bincounts = np.bincount(self.genotypes, minlength=self.num_alleles)
            for i, allele in enumerate(self.alleles):
                counts[allele] = bincounts[i]

@petrelharp
Copy link
Contributor

This is excellent!

@petrelharp petrelharp mentioned this pull request Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allele counting methods for Variant?
5 participants