Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add __str__ and _repr_html_ to Variant #2384

Merged
merged 1 commit into from
Jul 28, 2022
Merged

Conversation

szhan
Copy link
Member

@szhan szhan commented Jul 5, 2022

Fixes #2367

@jeromekelleher
Copy link
Member

Cool! Can you paste in what these look like please so we can have a look?

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, needs some more detail like the site position and the allele counts.

python/tskit/genotypes.py Outdated Show resolved Hide resolved
python/tskit/util.py Outdated Show resolved Hide resolved
python/tskit/util.py Outdated Show resolved Hide resolved
@benjeffery
Copy link
Member

@szhan Let me know if you want some pointers here.

@hyanwong hyanwong mentioned this pull request Jul 8, 2022
3 tasks
@szhan szhan force-pushed the display_variant branch from b00c2c4 to bc4bf91 Compare July 20, 2022 15:20
@codecov
Copy link

codecov bot commented Jul 20, 2022

Codecov Report

Merging #2384 (2688ab2) into main (12812b0) will decrease coverage by 12.34%.
The diff coverage is 12.50%.

❗ Current head 2688ab2 differs from pull request most recent head 024ef7a. Consider uploading reports for the commit 024ef7a to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2384       +/-   ##
===========================================
- Coverage   93.43%   81.08%   -12.35%     
===========================================
  Files          28       28               
  Lines       27337    27110      -227     
  Branches     1247     1245        -2     
===========================================
- Hits        25541    21981     -3560     
- Misses       1762     5055     +3293     
- Partials       34       74       +40     
Flag Coverage Δ
c-tests 92.26% <ø> (ø)
lwt-tests 89.05% <ø> (ø)
python-c-tests 71.26% <12.50%> (-0.12%) ⬇️
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/util.py 55.33% <5.26%> (-44.67%) ⬇️
python/tskit/genotypes.py 32.25% <23.07%> (-66.64%) ⬇️
python/tskit/cli.py 0.00% <0.00%> (-96.19%) ⬇️
python/tskit/vcf.py 7.75% <0.00%> (-90.52%) ⬇️
python/tskit/text_formats.py 9.87% <0.00%> (-90.13%) ⬇️
python/tskit/drawing.py 11.02% <0.00%> (-87.97%) ⬇️
python/tskit/combinatorics.py 14.26% <0.00%> (-85.11%) ⬇️
python/tskit/stats.py 35.13% <0.00%> (-64.87%) ⬇️
python/tskit/trees.py 46.07% <0.00%> (-52.65%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12812b0...024ef7a. Read the comment docs.

@szhan
Copy link
Member Author

szhan commented Jul 20, 2022

Example output of __str__().
Screen Shot 2022-07-20 at 5 32 20 PM

@szhan
Copy link
Member Author

szhan commented Jul 20, 2022

Example output of _repr_html_().
Screen Shot 2022-07-20 at 5 59 01 PM

@hyanwong
Copy link
Member

You could put count (freq) - so they are are on the same line?

@szhan
Copy link
Member Author

szhan commented Jul 20, 2022

Should we print out the alleles in some (lexicographically) sorted order?

@szhan
Copy link
Member Author

szhan commented Jul 20, 2022

The HTML text has really long lines, so it is failing the flake8 test. What should we do?

@szhan
Copy link
Member Author

szhan commented Jul 20, 2022

I'm not sure why I can't get re.match to work... I'm expecting re.match to return a Match object not None, which is what I'm getting with the code snippet below...

import tests.tsutil as tsutil
import re

ts = tsutil.all_fields_ts()
v = next(ts.variants())
str(v)

m = re.match(
    r"allele",
    str(v),
)
print(m)

EDIT: I had the regexp pattern wrong!

@szhan szhan requested a review from hyanwong July 20, 2022 18:48
@szhan szhan self-assigned this Jul 20, 2022
@szhan szhan added enhancement New feature or request Python API Issue is about the Python API labels Jul 20, 2022
@szhan
Copy link
Member Author

szhan commented Jul 20, 2022

Cool! Can you paste in what these look like please so we can have a look?

I've attached a couple of images showing some examples.

@szhan szhan marked this pull request as ready for review July 20, 2022 18:54
@szhan szhan requested a review from jeromekelleher July 20, 2022 18:54
@szhan
Copy link
Member Author

szhan commented Jul 20, 2022

Should we also show the ancestral allele?

@jeromekelleher
Copy link
Member

The HTML text has really long lines, so it is failing the flake8 test. What should we do?

Looks like you can split some of the longest easily enough (no reason to have lines > 100), but there's also a noqa you can put at the end to tell flake8 to ignore it (see further down in the same function).

@jeromekelleher
Copy link
Member

Should we also show the ancestral allele?

Let's show them in the order they're returned by variant.alleles. You can do something like

counts = self.counts()
freqs = self.frequencies()
for allele in self.alleles:
       print(f"Allele {allele}: {count[allele]} ({freqs[allele] * 100: .2g}%)")

@szhan
Copy link
Member Author

szhan commented Jul 21, 2022

Here is what the output looks like.

__str__()
Screen Shot 2022-07-21 at 1 04 50 PM

_repr_html_()
Screen Shot 2022-07-21 at 12 56 43 PM

@hyanwong
Copy link
Member

hyanwong commented Jul 21, 2022

Looking good, thanks @szhan

  1. Might it be worth saying "Number of alleles" and "Number of samples" rather than just "allele" and "samples"?
  2. Do we need the length of genotypes in there (isn't that always the number of samples)?
  3. Should we put quotes around the allele string, in case any of the alleles are the empty string (e.g. for an indel). So Allele 'A' rather than Allele A?
  4. Instead of Allele None should we say e.g. "Allele missing"?
  5. "Missing data" might be crueller as "Has missing data"?
  6. Is isolated as missing meant to be a boolean rather than numeric?

@hyanwong
Copy link
Member

hyanwong commented Jul 21, 2022

  1. Might it be worth saying "Number of alleles" and "Number of samples" rather than just "allele" and "samples"?

Actually, I forgot that (a) it might not be all the sample nodes in the ts, if a subset of samples is specified and (b) it is possible to specify non samples to be counted too.

I'm therefore not sure of the best phrasing instead of "samples", but we could say Number of samples used?

@hyanwong
Copy link
Member

hyanwong commented Jul 21, 2022

Looking good, thanks @szhan

  1. Might it be worth saying "Number of alleles" and "Number of samples" rather than just "allele" and "samples"? Or # samples and # alleles if you want a shorter abbreviation.
  2. Do we actually need the length of genotypes in there (isn't that always the number of samples)?
  3. Should we put quotes around the allele string, in case any of the alleles are the empty string (e.g. for an indel). So Allele 'A' rather than Allele A?
  4. Instead of Allele None should we say e.g. "Allele missing"?
  5. "Missing data" might be clearer as "Has missing data"?
  6. Is isolated as missing meant to be a boolean rather than numeric?

@szhan
Copy link
Member Author

szhan commented Jul 21, 2022

The HTML text has really long lines, so it is failing the flake8 test. What should we do?

Looks like you can split some of the longest easily enough (no reason to have lines > 100), but there's also a noqa you can put at the end to tell flake8 to ignore it (see further down in the same function).

There is already # noqa: B950 at the end of variant_html() in util.py, but that flake8 warning still pops up.

EDIT: Ah! Figured it out!

@szhan
Copy link
Member Author

szhan commented Jul 21, 2022

This?
Screen Shot 2022-07-21 at 2 11 16 PM

@hyanwong
Copy link
Member

This?
Screen Shot 2022-07-21 at 2 11 16 PM

Yeah, looks good. Maybe put "Allele missing" instead of Allele 'None'?

@hyanwong
Copy link
Member

Is it clear what the "3" in 3 (2.6%) means? Should we say "Samples with allele 'A'", or is that too wordy?

Also there's a space between the open bracket and the percentage amount, I think?

@szhan
Copy link
Member Author

szhan commented Jul 21, 2022

All the tests have passed! Updating the change log.

@szhan szhan force-pushed the display_variant branch from 5fc76ee to 1dc3e71 Compare July 21, 2022 17:46
@hyanwong
Copy link
Member

hyanwong commented Jul 21, 2022

Yuck. This is why we've been moving away from using simulations as the basis for tests, super annoying dealing with these inconsistencies.

It's sometime helpful to create a new TS using e.g. tskit.Tree.generate_balanced(4).tree_sequence, but that won't have mutations. It is worth having a deterministic routine in tskit (or maybe just in the test suite) that adds a site and some mutations to an existing tree sequence in one pass? E.g.

def add_variation(ts, position, ancestral_state, derived_states, mutation_nodes=ts.samples(), mutation_times=None):
    """
    Return a new tree sequence with a new site added at ``position`` with the given ancestral state.
    and for each element $i$ in the list of ``derived states``, add a mutation above mutation_nodes[i]
    at time mutation_times[i] (or TIME_UNKNOWN if not specified)
    """

Then making a test case is as simple as

ts = tskit.Tree.generate_balanced(4).tree_sequence
ts = tskit.add_variation(ts, 0, "A", ["T", "C"])

useful?

@szhan
Copy link
Member Author

szhan commented Jul 22, 2022

Yuck. This is why we've been moving away from using simulations as the basis for tests, super annoying dealing with these inconsistencies.

It's sometime helpful to create a new TS using e.g. tskit.Tree.generate_balanced(4).tree_sequence, but that won't have mutations. It is worth having a deterministic routine in tskit (or maybe just in the test suite) that adds a site and some mutations to an existing tree sequence in one pass? E.g.

def add_variation(ts, position, ancestral_state, derived_states, mutation_nodes=ts.samples(), mutation_times=None):
    """
    Return a new tree sequence with a new site added at ``position`` with the given ancestral state.
    and for each element $i$ in the list of ``derived states``, add a mutation above mutation_nodes[i]
    at time mutation_times[i] (or TIME_UNKNOWN if not specified)
    """

Then making a test case is as simple as

ts = tskit.Tree.generate_balanced(4).tree_sequence
ts = tskit.add_variation(ts, 0, "A", ["T", "C"])

useful?

Should we add this in separate PR?

@jeromekelleher
Copy link
Member

Should we add this in separate PR?

We should split it off to a separate issue at least.

@szhan
Copy link
Member Author

szhan commented Jul 22, 2022

Should we add this in separate PR?

We should split it off to a separate issue at least.

Should we do that before finishing this PR?

@jeromekelleher
Copy link
Member

Let's not bother with it, it's just a utility for testing. @hyanwong can create an to discuss as it's his idea.

@szhan
Copy link
Member Author

szhan commented Jul 23, 2022

When there are no sample nodes selected to get the genotypes, the outputs look like this.

__str__()
Screen Shot 2022-07-23 at 6 59 31 PM

_repr_html_()
Screen Shot 2022-07-23 at 6 57 03 PM

Also, it gives this warning:
WARNING:root:No non-missing samples at this site, frequencies undefined

@szhan
Copy link
Member Author

szhan commented Jul 23, 2022

When trying to call _str_() and _repr_html_() on an un-decoded Variant object, we get this error:

ValueError: This variant has not yet been decoded at a specific site,call Variant.decode to set the site.

This should be the behaviour we want, right?

@benjeffery
Copy link
Member

When trying to call _str_() and _repr_html_() on an un-decoded Variant object, we get this error:

ValueError: This variant has not yet been decoded at a specific site,call Variant.decode to set the site.

This should be the behaviour we want, right?

I don't think so, it should be a one-line table with "This variant has not yet been decoded at a specific site,call Variant.decode to set the site" as the content IMO.

@szhan
Copy link
Member Author

szhan commented Jul 25, 2022

Something like this?

╔═══════════════════════════════════════════════════════════╗
║ This variant has not yet been decoded at a specific site, call Variant.decode to set the site.║
╚═══════════════════════════════════════════════════════════╝

@jeromekelleher
Copy link
Member

Yes, __str__ etc should always succeed even if the variant hasn't been decoded yet.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

python/tskit/util.py Outdated Show resolved Hide resolved
python/tests/test_genotypes.py Show resolved Hide resolved
@szhan
Copy link
Member Author

szhan commented Jul 27, 2022

When calling __str__() and _repr_html_() on an un-decoded Variant object, the outputs look like this.

Screen Shot 2022-07-27 at 3 12 11 PM

Screen Shot 2022-07-27 at 3 11 54 PM

@szhan szhan force-pushed the display_variant branch from bffe447 to ff8ad95 Compare July 27, 2022 15:40
@szhan szhan requested a review from benjeffery July 27, 2022 15:44
@szhan szhan force-pushed the display_variant branch from 5e9996f to d0235c6 Compare July 27, 2022 17:07
@szhan szhan requested a review from hyanwong July 27, 2022 17:39
Copy link
Member

@hyanwong hyanwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Just a couple of nits.

python/tskit/genotypes.py Outdated Show resolved Hide resolved
python/tskit/util.py Outdated Show resolved Hide resolved
python/tskit/util.py Outdated Show resolved Hide resolved
python/tskit/util.py Show resolved Hide resolved
python/tskit/util.py Outdated Show resolved Hide resolved
@benjeffery benjeffery added this to the Python 0.5.2 milestone Jul 28, 2022
@szhan szhan force-pushed the display_variant branch from 2688ab2 to 024ef7a Compare July 28, 2022 14:09
@szhan szhan added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 28, 2022
@mergify mergify bot merged commit 228a363 into tskit-dev:main Jul 28, 2022
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 28, 2022
@szhan szhan deleted the display_variant branch July 28, 2022 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Python API Issue is about the Python API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Display methods for Variant
4 participants