Generate outlier thresholds #513

e-t-k · 2020-02-05T07:14:04Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Generate gene outlier thresholds for ribodeplete samples.
List outlier genes for each ribodeplete sample.

That is--for each sample, which genes are very highly expressed (or have very low expression) as compared to the dataset as a whole?

What was your approach?

A modified Tukey outlier method was used to generate thresholds for each gene representing values beyond which the gene's expression is an outlier. (See script for details.)

Then, these outlier thresholds were applied to each sample to generate outlier genes for that sample.

In addition, sample+gene entries were also coded as in the top5% of expression for a sample or dropped by the expression + variance filters calculated in step 01. (A QC fail code was also available for samples with less than ~2000 genes with any expression, but no sample failed this metric.)

What GitHub issue does your pull request address?

#229

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

I'm most concerned with whether my explanation of the threshold modification is understandable.

Also - whether there is a better way to represent the outlier results. Current file is a genes x samples matrix where each entry is a string encoding the results, eg "TU" in the Top5% and Up outlier, etc. I tried to strike a balance between concise, human-readable, and useful for downstream analyses but I imagine it could be improved.

Which areas should receive a particularly close look?

The rationale for the modified Tukey outlier thresholds (and the code implementing such). This algorithm is meant to replicate the outliers in the original Treehouse CARE code, which calculates the threshold on a per-sample basis; it's been identical on all tested samples but I want to confirm that it's rigorous in general.

Is there anything that you want to discuss further?

Nothing in particular.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

N/A -- no figure / tables generated.

I'll be making another pull request soon that will filter out non-tumor and QC fail samples as well as a couple cleanup items for step 01. At that point the outliers generated will be considered meaningful and we'll move forward on the next scientific questions.

Results

What types of results are included (e.g., table, figure)?

A data file listing outlier genes for all samples is generated at results/rsem-tpm-stranded-gene_expression_outliers.rds.

I would expect this to end up as a supplementary data file that readers could download. I haven't checked it into the repo as it's 450 Mb. Is 'results' the correct place for it or should it remain in scratch until a more concise table is generated?

What is your summary of the results?

None yet -- @GeoffLyle will look at the outliers once filtering (mentioned above) is put in.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile. (No additional dependencies in this pull.)
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

utils.py - write_rds accidentally modified its input dataframe; now makes a copy before changing it. oops.

adds candidate final draft for step 2 - generate outlier thresholds and outliers. also adds it to continuous integration

TODO: check in a markdown viewer for typoes

jashapiro · 2020-02-05T14:38:17Z

@e-t-k Thanks for this! At a first glance it looks good and will be a pleasure to review, thanks to all the work you did on documentation!

I will more formally review soon, but I wanted to give a quick suggestion for one of the questions you raised. The results/rsem-tpm-stranded-gene_expression_outliers.rds file seems like it should be pretty compressible, if it is a matrix of not very many different text values. Unfortunately, pyreadr can't write compressed .rds files, but it seems like in this case the best course of action would be to write out a .tsv.gz file, which pandas can do directly For downstream consumption, this would probably be easier anyway for people not using R. I expect the compressed file will easily fit in github's limits, so putting it in results/ should be fine.

jashapiro · 2020-02-05T14:47:25Z

analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py

+
+U: Up outlier
+D: Down outlier
+T: in Top five percent of genes for this sample


I assume this means top 5% in expression measure, not some other metric?

Correct - updated doc.

jashapiro

This looks great! The code is well arranged and easy to follow.

My main suggestions are these:

Move output file name/path specifications to main() to simplify internal functions, and so that all output filenames are fully described in a single place.
Consider moving away from dependence on .rds as an intermediate format where it might not be needed. I suggested feather which for uncompressed data is faster while retaining interoperability with R.
For text output tables that might be consumed by other users, consider .tsv.gz, which in this case should compress quite well, as I mentioned in my earlier comment.
Consider changing the output format table to cells containing fixed width strings, with each position in teh string indicating a single component of the outlier analysis. This has the potential to simplify downstream code a bit, though it can introduce bugs if the format of that string changes. The current solution seems reasonable to me as well, so I think this might be a matter of preference. I'm happy to hear your thoughts.

Lastly, if you wouldn't mind checking the "Allow edits from maintainers" box, that will allow us to help fix little problems that might come up in CI and save you a bit of work. Since your module passed CI with flying colors, that might not be an issue, but the other thing it allows is for us to merge in any changes from the master branch once the PR is approved, so we can merge the PR back to the master branch more efficiently.

jashapiro · 2020-02-05T15:13:48Z

analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py

+for each gene. This allows outlier results to be concordant whether a focus
+sample was originally present in the dataset or whether it is a new sample.


This (and the explanation below) seem quite reasonable. My only question is really a kind of edge case: how many samples can be added before a set of thresholds might need to be recalculated? If I calculate thresholds and then add 50% more samples, presumably the threshold would need to be calculated.

In terms of what's mathematically necessary, adding or removing a single sample to the dataset does materially affect the thresholds (by a very small magnitude - but with 60k genes there are almost always a few that gain or lose outlier status due to this.)
If you add 50% more samples, the thresholds would definitely need to be recalculated.

The goal of all of this threshold-finagling is to gain any given sample's contribution when focusing on any other sample; but when that sample is the focus, to not include itself in the comparison.
Because there's a pretty large time gap between Treehouse compendia releases we are often using them to analyze samples that won't be themselves included in the background comparison for months, so when we analyze samples that are in the comparison we want to give them the same treatment.

if you wouldn't mind checking the "Allow edits from maintainers" box

Unfortunately I'm a bit stymied here - that checkbox isn't appearing anywhere in the PR, even after a relog (see screenshot). I'll keep an eye out in case it suddenly appears.

Hi @e-t-k, my best guess is that it has something to do with the UCSC-Treehouse permissions because this is coming from an organization’s fork/branch and not from a fork under your individual account.

@jaclyn-taroni thanks, that seems reasonable. You folks were able to add commits to my previous PR, #323, though; and I notice that that one was originally filed as a draft PR. So I might try making my next one a draft initially and seeing if that mysteriously fixes it.

jashapiro · 2020-02-05T15:38:40Z

analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py

+
+    print_v("Calculating outlier thresholds.")
+    thresholds = normalized_samples.apply(modified_tukey_thresholds, args=[iqr_multiplier], axis=1)
+    thresholds_path = os.path.join(scratch_dir, "{}threshold-expression-values.rds".format(prefix))


Might it be more general to take the full path as an argument here, rather than generating it internally?

I am also not sure whether .rds is the best format here. I assume you want to preserve precision, so .tsv may not be the best format. But something less language-specific might be nice, for which I might recommend feather which is built into pandas and super fast (not a concern here, but nice).

Of course, now, having written all that, I am wondering if the theshold values file is actually used anywhere? If not, and you are just using it for debugging or other checks, maybe the file write can be made optional.

You're right, the thresholds file was for debugging so I've removed it.
Feather in general looks really promising --as part of my next pull I'll look into switching to it but I'd prefer not to deal with the extra dependencies at this exact moment.

That’s fine; the switch to tsv for the main output file was more important. I don't think it would actually add a dependency, as pandas supports it, but the advantages are not important here.

jashapiro · 2020-02-05T15:50:53Z

analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py

+    if qc_fail:
+        result += "F" # Mark every gene as QC fail for this sample


If QC fails, should the full result string be "F"? Or is there are case where we would still want the remaining information? I guess I can imagine one, so probably worth keeping the full string. For convenience though, I might make "F" the first character for easy scanning.

Alternatively, and I think this may be a good idea overall, perhaps these strings should be all the same length? As an possible example: The first character could be P/F, Pass/Fail; the second indicate outlier status: U/D/N, Up/Down/Not Outlier; and the third T/L, Top5%/Lower95%. Then parsing the string later can be a bit simpler, as you always know which position to look at for your question.

Reading a bit further, I see that you would need one more character for the dropped/retained code, maybe D/R (if position matters, redundant letters are available, but maybe we would still want to avoid them?)

I really like your fixed-length string idea and here's the implementation I have in mind. Note that in the next pull we'll be adding the dropped reason (either Expression filter or Variance filter), so for now E will indicate any dropped genes and later V will also be a possible value.

QC Outlier 95th percentile? Dropped status

P = pass U = up T = top5% R = retained

F = fail D = down L = lower 95% E = dropped by expression filter

- N = not outlier - V = dropped by variance filter (NYI)

jashapiro · 2020-02-05T15:54:45Z

analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py

+
+def normalize_expression(input_path, scratch_dir, prefix="", verbose=False):
+    """Convert samples to log2(TPM) + 1 and save in scratch.
+    Note: This is duplicate work from step 01 because I forgot in that step


You can feel free to modify step 01 to do this in this PR, if you like.

analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py

@jashapiro

Addresses @jashapiro's comments in AlexsLemonade#513 . In particular: Step 01: save normalized_expression to file which I forgot to do earlier. Step 02: - ingests normalized expression file from step 1 instead of original expression file - parameters updated to reflect this - normalize_expression() no longer needed - file path handling all moved to main() - Change output file format - each cell now contains a 4-character string in a fixed order which encodes the analysis results. - update comments at top to reflect this - Output file is now a .tsv.gz instead of .rds and is included in this pull. - no longer saves thresholds file as it's not ingested anyhwere.

e-t-k · 2020-02-06T23:16:55Z

@jashapiro I've updated with a new commit that addresses your comments; thank you!

e-t-k · 2020-02-06T23:52:03Z

Looks like the CI failed in the tp53_nf1_score module but it happened to complete my analyses before it did so.

jashapiro · 2020-02-06T23:59:38Z

Hi @e-t-k! Thanks for your updates! We have a few things breaking CI at the moment as we prepare for a new data release, so don’t worry about that. Hopefully we will get things sorted out shortly. I’ll ping you if we need you to update your branch when it is fully sorted.

jaclyn-taroni · 2020-02-07T00:27:54Z

Hi @e-t-k - I just merged #523 which addressed the breaking changes. Looks like there are some conflicts in .circleci/config.yml - the tp53_nf1_score was moved up commented out for now (#527) and all of the molecular subtyping modules are now grouped together so that's what I would keep an eye out for when resolving those. Let us know if you have any questions, thanks for your patience!

…thresholds Conflicts: .circleci/config.yml

jashapiro

Looks Great!

e-t-k added 4 commits February 4, 2020 22:07

[AlexsLemonade#229] comparative rnaseq - utils bugfix

1104b61

utils.py - write_rds accidentally modified its input dataframe; now makes a copy before changing it. oops.

[AlexsLemonade#229] comparative rna-seq

75532bf

adds candidate final draft for step 2 - generate outlier thresholds and outliers. also adds it to continuous integration

[AlexsLemonade#229] comparative-rnaseq-analysis: add README.md

ec77eb1

TODO: check in a markdown viewer for typoes

[AlexsLemonade#229] comparative-rnaseq-analysis: README cleanup

00337b0

jaclyn-taroni requested a review from jashapiro February 5, 2020 11:25

jashapiro reviewed Feb 5, 2020

View reviewed changes

e-t-k added 2 commits February 6, 2020 15:08

Merge branch 'master' into generate-outlier-thresholds

a3a041f

Merge remote-tracking branch 'upstream/master' into generate-outlier-…

44c3416

…thresholds Conflicts: .circleci/config.yml

jaclyn-taroni requested a review from jashapiro February 7, 2020 13:07

jashapiro approved these changes Feb 7, 2020

View reviewed changes

jaclyn-taroni merged commit cdb59bb into AlexsLemonade:master Feb 7, 2020

e-t-k mentioned this pull request Feb 19, 2020

(Comparative RNA-Seq Analysis) Apply sample filters (QC + tumor status); convert scratch files to .feather #544

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate outlier thresholds #513

Generate outlier thresholds #513

e-t-k commented Feb 5, 2020

jashapiro commented Feb 5, 2020

jashapiro Feb 5, 2020

e-t-k Feb 6, 2020

jashapiro left a comment

jashapiro Feb 5, 2020

e-t-k Feb 5, 2020

e-t-k Feb 5, 2020

jaclyn-taroni Feb 6, 2020

e-t-k Feb 6, 2020

jashapiro Feb 5, 2020

e-t-k Feb 6, 2020

jashapiro Feb 6, 2020

jashapiro Feb 5, 2020

jashapiro Feb 5, 2020

e-t-k Feb 6, 2020

jashapiro Feb 5, 2020

e-t-k Feb 6, 2020

e-t-k commented Feb 6, 2020

e-t-k commented Feb 6, 2020

jashapiro commented Feb 6, 2020

jaclyn-taroni commented Feb 7, 2020

jashapiro left a comment

		for each gene. This allows outlier results to be concordant whether a focus
		sample was originally present in the dataset or whether it is a new sample.

		if qc_fail:
		result += "F" # Mark every gene as QC fail for this sample

QC	Outlier	95th percentile?	Dropped status
P = pass	U = up	T = top5%	R = retained
F = fail	D = down	L = lower 95%	E = dropped by expression filter
-	N = not outlier	-	V = dropped by variance filter (NYI)

Generate outlier thresholds #513

Generate outlier thresholds #513

Conversation

e-t-k commented Feb 5, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jashapiro commented Feb 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-t-k commented Feb 6, 2020

e-t-k commented Feb 6, 2020

jashapiro commented Feb 6, 2020

jaclyn-taroni commented Feb 7, 2020

jashapiro left a comment

Choose a reason for hiding this comment