Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Plan: Road to v22 #1426

Closed
11 of 12 tasks
jaclyn-taroni opened this issue May 19, 2022 · 8 comments
Closed
11 of 12 tasks

Plan: Road to v22 #1426

jaclyn-taroni opened this issue May 19, 2022 · 8 comments

Comments

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented May 19, 2022

The purpose of this issue is to outline the order in which things need to be reviewed and run prior to the release of v22.

1. Reviews before release

Pull requests that are marked as blocking release and review before release should be reviewed & ideally merged in the following order. cc: @sjspielman

Analysis files generation

Subtyping modules need to use data/

Subtype script & docs

2. Running subtyping

Once #1422 is in, we can use the PBTA data and analysis files from v22 to run scripts/run-for-subtyping.sh. We’ll need to create a branch to commit back any changes to files in the repository (akin to what’s happening with #1412).

Then we’ll upload pbta-histologies.tsv to S3 (related: #1424) .

3. Creating CI files

Once all the v22 files are on S3, we’re ready to generate the CI files (#1369). Here’s what we need to do:

  1. Create a branch to update the create-subset-files module as required to create the CI files. I’m going to call this v22-ci for illustrative purposes.

Specifically, we need to update this line to release-v22-20220505 reflect the new release:

RELEASE=${RELEASE:-release-v21-20210820}

Then we should check to make sure it runs locally without issue with:

RUN_LOCAL=1 ./analyses/create-subset-files/create_subset_files.sh
  1. Start an AWS instance with at least 128 GB RAM
  2. Clone AlexsLemonade/OpenPBTA-analysis and checkout v22-release (v22-release (1/N release) #1424).
  3. Run bash download-data.sh to obtain the v22 data while on that branch.
  4. Add whatever the relevant remote is (whichever fork created v22-ci) and checkout v22-ci which contains the required changes to generate the v22 files for CI.
  5. Run ./analyses/create-subset-files/create_subset_files.sh.
  6. Snag the files in data/testing/release-v22-20220505; they are what we need for CI.
  7. Commit to v22-ci + push the changes to analyses/create-subset-files/biospecimen_ids_for_subset.RDS
  8. Upload the CI files to https://s3.amazonaws.com/d3b-openaccess-us-east-1-prd-pbta/data/testing
  9. File a PR with the changes in v22-ci

We’ll use this PR to make sure there are no issues with the CI files. If there are issues, we may need to create a series of branches to fix them. If there are no issues, we can merge this.

4. Cutting a release

#1424 gets updated to include the required documentation & merged

@jharenza
Copy link
Collaborator

jharenza commented May 19, 2022

@zhangb1 will work on #3 above when it's time. Cc @yuankunzhu

@jaclyn-taroni
Copy link
Member Author

Wanted to note that we had some issue with medulloPackage not being installed on some Docker containers if the layer was cached (example linked over on #1425). When I went to try and look into that, it took almost 3 hours to build the image alone. That means that we would definitely have an issue with timeouts in CI and that could be a problem as we try to get a bunch of stuff in! So what I did in 43fa160 was to consolidate all of the R package installs from Bioconductor and CRAN. We can see if that speeds building the image up. If it does, we might want that to be the first thing to go in. #1425 is a draft for now.

@jaclyn-taroni
Copy link
Member Author

jaclyn-taroni commented May 21, 2022

The changes related to the analysis files generation and subtyping can go in independently. Since some of the PRs ensuring that subtyping modules are already approved & ready to merge, I'm going to start merging them.

@jaclyn-taroni
Copy link
Member Author

Updated the initial comment, but to summarize what comes next:

  1. Group R package installation to try to speed up build time #1425 should be reviewed and go in to fix the deploy step in CI
  2. Update snv-callers to be able to be run from places besides the root of repo #1420 needs to be re-reviewed, approved, and merged
  3. Add shell script to generate analysis files that are included in data releases #1421 needs to be merged (currently approved & comments addressed!)
  4. We should probably rerun things with the code changes (Rerun changes added #1412) tbh
  5. Make sure HGG subtyping uses data download #1414 needs to be re-reviewed, approved, and merged
  6. Revise shell script run for subtyping such that analysis file generation is removed #1422 needs to be merged (currently approved and comments addressed)
  7. We can start on everything else (e.g., running subtyping)

@migbro
Copy link
Contributor

migbro commented Jun 1, 2022

In progress - just a few notes as I work on this

  • It seems step one is really a summary, and not really a step. Doing that before the rest is probably not gonna pan out
  • With a fresh ec2 as a starting point, def gotta download first before running that test
  • This script create_subset_files.sh seems to look for a structure relative to itself, so gotta cd to it's location first before running

After making those adjustments, I got the error:

Reading in ../../data/release-v22-20220505/pbta-sv-manta.tsv.gz ...
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells    619873   33.2   11776407   629.0    9048557   483.3
Vcells 269327097 2054.9 2323008920 17723.2 2753752554 21009.5
Error in sample.int(length(x), size, replace, prob) :
  invalid first argument
Calls: sample -> sample.int
Execution halted

@jharenza
Copy link
Collaborator

jharenza commented Jun 1, 2022

Ok, this may be coming from not having consensus_seg_with_status.tsv in the subset, but odd that this is a manta error. Updating the code to capture the new file and testing locally now.

@jharenza
Copy link
Collaborator

jharenza commented Jun 1, 2022

After updating the code to add the additional CNV file, I ran using

RUN_LOCAL=1 ./analyses/create-subset-files/create_subset_files.sh

but gets killed:

create_subset_files.sh: line 51:  7389 Killed                  Rscript --vanilla 01-get_biospecimen_identifiers.R --data_directory $FULL_DIRECTORY --output_file $BIOSPECIMEN_FILE --num_matched $NUM_MATCHED --local $RUN_LOCAL

If I try to just copy and not subset, I also get an error about directory creation, which @migbro also mentioned he didn't have in his run.

root@334c2bfa1312:/home/rstudio/kitematic/analyses/create-subset-files# bash create_subset_files.sh 
cp: target '../../data/testing/release-v22-20220505' is not a directory

2:55
but also if i try to not to subset, just to copy, it tells me dir not found
root@334c2bfa1312:/home/rstudio/kitematic/analyses/create-subset-files# bash create_subset_files.sh
cp: target '../../data/testing/release-v22-20220505' is not a directory

@migbro
Copy link
Contributor

migbro commented Jun 2, 2022

I think I am ready to push, however, I don't have that kind of access to this project. Can I be granted such access please? Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants