Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Add shell script to generate analysis files that are included in data releases #1421

Merged
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
3b532e3
Merge remote-tracking branch 'upstream/master'
jaclyn-taroni Mar 24, 2022
348b277
Merge remote-tracking branch 'upstream/master'
jaclyn-taroni Apr 7, 2022
3decf7c
Merge remote-tracking branch 'upstream/master'
jaclyn-taroni Apr 12, 2022
86aabd0
Merge remote-tracking branch 'upstream/master'
jaclyn-taroni May 10, 2022
0c89713
Merge remote-tracking branch 'origin/master'
jaclyn-taroni May 11, 2022
bd56016
Merge remote-tracking branch 'upstream/master'
jaclyn-taroni May 11, 2022
300450e
Remove `short_histology` from TMB count function
jaclyn-taroni May 12, 2022
14a329f
Merge remote-tracking branch 'upstream/master'
jaclyn-taroni May 18, 2022
827fff4
Rework logic and naming for release in focal CN
jaclyn-taroni May 11, 2022
2351eba
Rework logic and naming for fusion filtering
jaclyn-taroni May 11, 2022
a154520
Rework logic and naming for independent samples
jaclyn-taroni May 11, 2022
dc7cd77
change paths
runjin326 May 16, 2022
a15f12b
modify path more
runjin326 May 17, 2022
4e7ed59
Remove Rscripts reliance on analysis directory for inputs
jaclyn-taroni May 19, 2022
3b40a91
Merge branch 'jaclyn-taroni/rm-short-hist-tmb' into jaclyn-taroni/139…
jaclyn-taroni May 19, 2022
4e53522
Add option for running with base file for release
jaclyn-taroni May 19, 2022
1fdccb2
Merge branch 'jaclyn-taroni/1399-snv-callers' into jaclyn-taroni/1399…
jaclyn-taroni May 19, 2022
c38158c
Add a shell script for generating analysis files for release
jaclyn-taroni May 11, 2022
0d27baa
additional bugs fixed
runjin326 May 18, 2022
248ac8e
Use base file in PBTA SNV callers step
jaclyn-taroni May 19, 2022
dfa9ffe
Add a step where we create a checksum file
jaclyn-taroni May 19, 2022
ffd13cf
Move compiled directory creation
jaclyn-taroni May 20, 2022
c7753c2
Merge remote-tracking branch 'upstream/master' into jaclyn-taroni/139…
jaclyn-taroni May 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions scripts/generate-analysis-files-for-release.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
#!/bin/sh

set -e
set -o pipefail

# Set the working directory to the directory of this file
cd "$(dirname "${BASH_SOURCE[0]}")"

# If RUN_LOCAL is used, the time-intensive steps are skipped because they cannot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and memory-intensive!

# be run on a local computer -- the idea is that setting RUN_LOCAL=1 will allow for
# local testing running/testing of all other steps
RUN_LOCAL=${RUN_LOCAL:-0}

# Get base directory of project
cd ..
BASEDIR="$(pwd)"
cd -

analyses_dir="$BASEDIR/analyses"
data_dir="$BASEDIR/data"
scratch_dir="$BASEDIR/scratch"

# Compile all the files that need to be included in the release in one place
# in the scratch directory
compiled_dir=${scratch_dir}/analysis_files_for_release
mkdir -p ${compiled_dir}

# Collapsed RNA-seq files
echo "Create collapse RSEM files"
bash ${analyses_dir}/collapse-rnaseq/run-collapse-rnaseq.sh

# Create the independent sample list using the *BASE* histology file
echo "Create independent sample list"
OPENPBTA_BASE_RELEASE=1 bash ${analyses_dir}/independent-samples/run-independent-samples.sh

# Fusion filtering
echo "Create fusion filtered list"
OPENPBTA_BASE_RELEASE=1 bash ${analyses_dir}/fusion_filtering/run_fusion_merged.sh

# Fusion summary
echo "Run fusion summary for subtypes"
bash ${analyses_dir}/fusion-summary/run-new-analysis.sh

# Copy over collapsed RNA-seq files
cp ${analyses_dir}/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds ${compiled_dir}
cp ${analyses_dir}/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds ${compiled_dir}

# Copy over independent specimen lists
cp ${analyses_dir}/independent-samples/results/independent-specimens.* ${compiled_dir}

# Copy over fusions lists
cp ${analyses_dir}/fusion_filtering/results/pbta-fusion-putative-oncogenic.tsv ${compiled_dir}
cp ${analyses_dir}/fusion_filtering/results/pbta-fusion-recurrently-fused-genes-* ${compiled_dir}

# Copy over fusion summary
cp ${analyses_dir}/fusion-summary/results/* ${compiled_dir}

# Run modules that cannot be run locally due to memory requirements
if [ "$RUN_LOCAL" -lt "1" ]; then

# Run SNV consensus & TMB step for PBTA data
echo "Run SNV callers module for PBTA data"
OPENPBTA_BASE_RELEASE=1 bash ${analyses_dir}/snv-callers/run_caller_consensus_analysis-pbta.sh

# Run SNV consensus & TMB step for TCGA data
echo "Run SNV callers module for TCGA data"
bash ${analyses_dir}/snv-callers/run_caller_consensus_analysis-tcga.sh

# Copy over SNV callers
## PBTA
cp ${analyses_dir}/snv-callers/results/consensus/pbta-snv-consensus-mutation.maf.tsv.gz ${compiled_dir}
cp ${analyses_dir}/snv-callers/results/consensus/pbta-snv-mutation-tmb-coding.tsv ${compiled_dir}
cp ${analyses_dir}/snv-callers/results/consensus/pbta-snv-mutation-tmb-all.tsv ${compiled_dir}
## TCGA
cp ${analyses_dir}/snv-callers/results/consensus/tcga-snv-consensus-snv.maf.tsv.gz ${compiled_dir}
cp ${analyses_dir}/snv-callers/results/consensus/tcga-snv-mutation-tmb-coding.tsv ${compiled_dir}
cp ${analyses_dir}/snv-callers/results/consensus/tcga-snv-mutation-tmb-all.tsv ${compiled_dir}

# Run hotspot detection
echo "Run hotspots detection"
bash ${analyses_dir}/hotspots-detection/run_overlaps_hotspots.sh
Comment on lines +79 to +81
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this module (and output file pbta-snv-scavenged-hotspots.maf.tsv.gz) appears to be missing from here but it should be there, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they should, but maybe that should get fixed on #1424 and not here?

Copy link
Member

@sjspielman sjspielman May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue. Edit: #1428


# Copy over hotspots detection
cp ${analyses_dir}/hotspots-detection/results/pbta-snv-scavenged-hotspots.maf.tsv.gz ${compiled_dir}

# Run consensus CN caller step
echo "Run CNV consensus"
bash ${analyses_dir}/copy_number_consensus_call/run_consensus_call.sh

# Copy over CNV consensus
cp ${analyses_dir}/copy_number_consensus_call/results/pbta-cnv-consensus.seg.gz ${compiled_dir}

# Run GISTIC step -- only the part that generates ZIP file
echo "Run GISTIC"
# Run a step that subs ploidy for NA to allow GISTIC to run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question about the whether the CI if lines might be needed here.

# If this is CI, run the example included with GISTIC
# The sample size for the subset files are too small otherwise
IS_CI=${OPENPBTA_CI:-0}
if [[ "$IS_CI" -gt "0" ]]
then
# Environmental variables for MCR
ORIG_LD_LIBRARY_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mcr/v83/runtime/glnxa64:/opt/mcr/v83/bin/glnxa64:/opt/mcr/v83/sys/os/glnxa64
export XAPPLRESDIR=/opt/mcr/v83/X11/app-defaults
# We want this to fail if the GISTIC example fails only -- because we have
# some instances of running GISTIC that do not complete but do save some
# output
set -e
set -o pipefail
# Run the example that comes with GISTIC - that allows us to
cd /home/rstudio/gistic_install && ./run_gistic_example
# 'Undo' environmental variables for MCR
export LD_LIBRARY_PATH=$ORIG_LD_LIBRARY_PATH
unset XAPPLRESDIR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. I don't necessarily expect this script (scripts/generate-analysis-files-for-release.sh) to get run in CI.

We don't use analyses/run-gistic/run-gistic-module.sh at all here.

Copy link
Member

@sjspielman sjspielman May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily expect this script (scripts/generate-analysis-files-for-release.sh) to get run in CI.

Just looking for a confirmation of this, and whether it might ever change! 👍

Rscript ${analyses_dir}/run-gistic/scripts/prepare_seg_for_gistic.R \
--in_consensus ${analyses_dir}/copy_number_consensus_call/results/pbta-cnv-consensus.seg.gz \
--out_consensus ${analyses_dir}/run-gistic/results/pbta-cnv-consensus-gistic-only.seg.gz \
--histology ${data_dir}/pbta-histologies-base.tsv

# This will use the file that just got generated above
bash ${analyses_dir}/run-gistic/scripts/run-gistic-openpbta.sh

# Copy over GISTIC
cp ${analyses_dir}/run-gistic/results/pbta-cnv-consensus-gistic.zip ${compiled_dir}

# Run step that generates "most focal CN" files (annotation) using the *BASE* histology file
echo "Run focal CN file preparation"
OPENPBTA_BASE_RELEASE=1 bash ${analyses_dir}/focal-cn-file-preparation/run-prepare-cn.sh

# Copy over focal CN
cp ${analyses_dir}/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz ${compiled_dir}
cp ${analyses_dir}/focal-cn-file-preparation/results/consensus_seg_annotated_cn_x_and_y.tsv.gz ${compiled_dir}

fi

# Create an md5sum file for all the files in the directory where the analysis
# files are compiled
cd ${compiled_dir}
# Remove old file if it exists
rm -f analysis_files_md5sum.txt
# Create a new md5sum.txt file
md5sum * > analysis_files_md5sum.txt