Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Generate outlier thresholds #513

Merged

Conversation

e-t-k
Copy link
Contributor

@e-t-k e-t-k commented Feb 5, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

  • Generate gene outlier thresholds for ribodeplete samples.
  • List outlier genes for each ribodeplete sample.

That is--for each sample, which genes are very highly expressed (or have very low expression) as compared to the dataset as a whole?

What was your approach?

A modified Tukey outlier method was used to generate thresholds for each gene representing values beyond which the gene's expression is an outlier. (See script for details.)

Then, these outlier thresholds were applied to each sample to generate outlier genes for that sample.

In addition, sample+gene entries were also coded as in the top5% of expression for a sample or dropped by the expression + variance filters calculated in step 01. (A QC fail code was also available for samples with less than ~2000 genes with any expression, but no sample failed this metric.)

What GitHub issue does your pull request address?

#229

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

I'm most concerned with whether my explanation of the threshold modification is understandable.

Also - whether there is a better way to represent the outlier results. Current file is a genes x samples matrix where each entry is a string encoding the results, eg "TU" in the Top5% and Up outlier, etc. I tried to strike a balance between concise, human-readable, and useful for downstream analyses but I imagine it could be improved.

Which areas should receive a particularly close look?

The rationale for the modified Tukey outlier thresholds (and the code implementing such). This algorithm is meant to replicate the outliers in the original Treehouse CARE code, which calculates the threshold on a per-sample basis; it's been identical on all tested samples but I want to confirm that it's rigorous in general.

Is there anything that you want to discuss further?

Nothing in particular.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

N/A -- no figure / tables generated.

I'll be making another pull request soon that will filter out non-tumor and QC fail samples as well as a couple cleanup items for step 01. At that point the outliers generated will be considered meaningful and we'll move forward on the next scientific questions.

Results

What types of results are included (e.g., table, figure)?

A data file listing outlier genes for all samples is generated at results/rsem-tpm-stranded-gene_expression_outliers.rds.

I would expect this to end up as a supplementary data file that readers could download. I haven't checked it into the repo as it's 450 Mb. Is 'results' the correct place for it or should it remain in scratch until a more concise table is generated?

What is your summary of the results?

None yet -- @GeoffLyle will look at the outliers once filtering (mentioned above) is put in.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile. (No additional dependencies in this pull.)
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

e-t-k added 4 commits February 4, 2020 22:07
utils.py - write_rds accidentally modified its input dataframe; now makes
a copy before changing it. oops.
adds candidate final draft for step 2 - generate outlier thresholds and outliers.
also adds it to continuous integration
TODO: check in a markdown viewer for typoes
@jashapiro
Copy link
Member

@e-t-k Thanks for this! At a first glance it looks good and will be a pleasure to review, thanks to all the work you did on documentation!

I will more formally review soon, but I wanted to give a quick suggestion for one of the questions you raised. The results/rsem-tpm-stranded-gene_expression_outliers.rds file seems like it should be pretty compressible, if it is a matrix of not very many different text values. Unfortunately, pyreadr can't write compressed .rds files, but it seems like in this case the best course of action would be to write out a .tsv.gz file, which pandas can do directly For downstream consumption, this would probably be easier anyway for people not using R. I expect the compressed file will easily fit in github's limits, so putting it in results/ should be fine.


U: Up outlier
D: Down outlier
T: in Top five percent of genes for this sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this means top 5% in expression measure, not some other metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - updated doc.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! The code is well arranged and easy to follow.

My main suggestions are these:

  • Move output file name/path specifications to main() to simplify internal functions, and so that all output filenames are fully described in a single place.
  • Consider moving away from dependence on .rds as an intermediate format where it might not be needed. I suggested feather which for uncompressed data is faster while retaining interoperability with R.
  • For text output tables that might be consumed by other users, consider .tsv.gz, which in this case should compress quite well, as I mentioned in my earlier comment.
  • Consider changing the output format table to cells containing fixed width strings, with each position in teh string indicating a single component of the outlier analysis. This has the potential to simplify downstream code a bit, though it can introduce bugs if the format of that string changes. The current solution seems reasonable to me as well, so I think this might be a matter of preference. I'm happy to hear your thoughts.

Lastly, if you wouldn't mind checking the "Allow edits from maintainers" box, that will allow us to help fix little problems that might come up in CI and save you a bit of work. Since your module passed CI with flying colors, that might not be an issue, but the other thing it allows is for us to merge in any changes from the master branch once the PR is approved, so we can merge the PR back to the master branch more efficiently.

Comment on lines +29 to +30
for each gene. This allows outlier results to be concordant whether a focus
sample was originally present in the dataset or whether it is a new sample.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the explanation below) seem quite reasonable. My only question is really a kind of edge case: how many samples can be added before a set of thresholds might need to be recalculated? If I calculate thresholds and then add 50% more samples, presumably the threshold would need to be calculated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of what's mathematically necessary, adding or removing a single sample to the dataset does materially affect the thresholds (by a very small magnitude - but with 60k genes there are almost always a few that gain or lose outlier status due to this.)
If you add 50% more samples, the thresholds would definitely need to be recalculated.

The goal of all of this threshold-finagling is to gain any given sample's contribution when focusing on any other sample; but when that sample is the focus, to not include itself in the comparison.
Because there's a pretty large time gap between Treehouse compendia releases we are often using them to analyze samples that won't be themselves included in the background comparison for months, so when we analyze samples that are in the comparison we want to give them the same treatment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you wouldn't mind checking the "Allow edits from maintainers" box

Unfortunately I'm a bit stymied here - that checkbox isn't appearing anywhere in the PR, even after a relog (see screenshot). I'll keep an eye out in case it suddenly appears.

Screen Shot 2020-02-05 at 3 46 55 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @e-t-k, my best guess is that it has something to do with the UCSC-Treehouse permissions because this is coming from an organization’s fork/branch and not from a fork under your individual account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaclyn-taroni thanks, that seems reasonable. You folks were able to add commits to my previous PR, #323, though; and I notice that that one was originally filed as a draft PR. So I might try making my next one a draft initially and seeing if that mysteriously fixes it.


print_v("Calculating outlier thresholds.")
thresholds = normalized_samples.apply(modified_tukey_thresholds, args=[iqr_multiplier], axis=1)
thresholds_path = os.path.join(scratch_dir, "{}threshold-expression-values.rds".format(prefix))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be more general to take the full path as an argument here, rather than generating it internally?

I am also not sure whether .rds is the best format here. I assume you want to preserve precision, so .tsv may not be the best format. But something less language-specific might be nice, for which I might recommend feather which is built into pandas and super fast (not a concern here, but nice).

Of course, now, having written all that, I am wondering if the theshold values file is actually used anywhere? If not, and you are just using it for debugging or other checks, maybe the file write can be made optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the thresholds file was for debugging so I've removed it.
Feather in general looks really promising --as part of my next pull I'll look into switching to it but I'd prefer not to deal with the extra dependencies at this exact moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s fine; the switch to tsv for the main output file was more important. I don't think it would actually add a dependency, as pandas supports it, but the advantages are not important here.

Comment on lines 142 to 143
if qc_fail:
result += "F" # Mark every gene as QC fail for this sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If QC fails, should the full result string be "F"? Or is there are case where we would still want the remaining information? I guess I can imagine one, so probably worth keeping the full string. For convenience though, I might make "F" the first character for easy scanning.

Alternatively, and I think this may be a good idea overall, perhaps these strings should be all the same length? As an possible example: The first character could be P/F, Pass/Fail; the second indicate outlier status: U/D/N, Up/Down/Not Outlier; and the third T/L, Top5%/Lower95%. Then parsing the string later can be a bit simpler, as you always know which position to look at for your question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading a bit further, I see that you would need one more character for the dropped/retained code, maybe D/R (if position matters, redundant letters are available, but maybe we would still want to avoid them?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like your fixed-length string idea and here's the implementation I have in mind. Note that in the next pull we'll be adding the dropped reason (either Expression filter or Variance filter), so for now E will indicate any dropped genes and later V will also be a possible value.

QC Outlier 95th percentile? Dropped status
P = pass U = up T = top5% R = retained
F = fail D = down L = lower 95% E = dropped by expression filter
- N = not outlier - V = dropped by variance filter (NYI)


def normalize_expression(input_path, scratch_dir, prefix="", verbose=False):
"""Convert samples to log2(TPM) + 1 and save in scratch.
Note: This is duplicate work from step 01 because I forgot in that step
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can feel free to modify step 01 to do this in this PR, if you like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

e-t-k added 2 commits February 6, 2020 15:08
Addresses @jashapiro's comments in AlexsLemonade#513 .
In particular:

Step 01: save normalized_expression to file which I forgot to do earlier.

Step 02:
- ingests normalized expression file from step 1 instead of original expression file
- parameters updated to reflect this
- normalize_expression() no longer needed

- file path handling all moved to main()

- Change output file format - each cell now contains a 4-character string in a fixed order which encodes the analysis results.
- update comments at top to reflect this

- Output file is now a .tsv.gz instead of .rds and is included in this pull.

- no longer saves thresholds file as it's not ingested anyhwere.
@e-t-k
Copy link
Contributor Author

e-t-k commented Feb 6, 2020

@jashapiro I've updated with a new commit that addresses your comments; thank you!

@e-t-k
Copy link
Contributor Author

e-t-k commented Feb 6, 2020

Looks like the CI failed in the tp53_nf1_score module but it happened to complete my analyses before it did so.

@jashapiro
Copy link
Member

Hi @e-t-k! Thanks for your updates! We have a few things breaking CI at the moment as we prepare for a new data release, so don’t worry about that. Hopefully we will get things sorted out shortly. I’ll ping you if we need you to update your branch when it is fully sorted.

@jaclyn-taroni
Copy link
Member

Hi @e-t-k - I just merged #523 which addressed the breaking changes. Looks like there are some conflicts in .circleci/config.yml - the tp53_nf1_score was moved up commented out for now (#527) and all of the molecular subtyping modules are now grouped together so that's what I would keep an eye out for when resolving those. Let us know if you have any questions, thanks for your patience!

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Great!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants