Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store analysis output-files from Balsamic in Housekeeper #475

Closed
henrikstranneheim opened this issue Nov 15, 2019 · 17 comments · Fixed by #687
Closed

Store analysis output-files from Balsamic in Housekeeper #475

henrikstranneheim opened this issue Nov 15, 2019 · 17 comments · Fixed by #687
Labels
Balsamic Issued related to the Balsamic workflow MIP issues related to the MIP pipeline

Comments

@henrikstranneheim
Copy link
Contributor

As a TA member, I would like to have files from Balsamic runs added to HK so that they are easy to find/deliver etc

Problem: the Balsamic pipeline is now being set up. The output files from this shall be stored in Housekeeper just as the files from MIP.

Information:
Dir-structure for Balsamic
fastq's: /home/proj/production/cancer/analysis/case-id/fastq/
results: /home/proj/production/cancer/analysis/case-id/analysis/
job logs: /home/proj/production/cancer/analysis/case-id/logs/
job scripts: /home/proj/production/cancer/analysis/case-id/scripts/
Internal BALSAMIC log: job logs: /home/proj/production/cancer/analysis/case-id/BALSAMIC_run

How to know when an analysis is finished:
I asked Hassan to make Balsamic create a file when an analysis is completed, for example 'analysis_finish' in case-directory (we need to know if this is on sample or case level)
He has created an issue on GitHub:
Clinical-Genomics/BALSAMIC#143

Suggested solution:

Use Balsamic root directory in cg config file
Let 'store-completed.sh' store also Balsamic files in HK
Let cg store analysis use Analysis Type to figure out what to store.
Add an extra tag to the files in HK to differentiate between pipelines, for example mip/balsamic - apply tags also to old files
Make sure the 'cg store' crontab is working
Parse the .hk file for things to store in housekeeper. Example /home/proj/production/cancer/cases/rapidghost/analysis/delivery_report/rapidghost.hk
Check existing code for interactions with bundles to make sure that we don't introduce bugs when using latest bundle version - what happens if there are two bundle versions with the same date? -> solve this by having the order portal make two cases if Balsamic and MIP is ordered for the same sample and add data analysis on case level (new user story @ingkebil).
add a move to trash of analysis folder at end of storage.
document in balsamic method
make a US that adds cg store completed

There will be a delivery_report directory in /home/proj/production/cancer/analysis//analysis/ with a file .hk which is a yaml-file with keys:

vcf
bam
multiqc
scout
other
    pdfs

Check if tests exist in cg: other US will be made by KB

Q: What do we need to store?
A: We should ask HFA before starting the US
Q2: is the .hk already created?
A2:

DoD:
Files from Balsamic are stored in HK after finished analysis
A pipeline-tag has been added to the files in HK
Verified

How to demo:
Show
$ housekeeper get balsamic case bundle

@emiliaol emiliaol added the Balsamic Issued related to the Balsamic workflow label Dec 11, 2019
@hassanfa
Copy link
Contributor

hassanfa commented Dec 17, 2019

@henrikstranneheim This is the structure I have for BALSAMIC. Files section is for housekeeper to store. And key names are going to be directory path to keep'em organized and tidy.

How do you have it in MIP?

Notes: Add keys under bam to have sampleIDs (tumor, normal, etc)

files:
  bam:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/bam/tumor.sorted.mrkdup.ralgn.bsrcl.merged.bam
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/bam/tumor.merged.bam
  cnv:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/cnv/tumor.merged-scatter.pdf
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/cnv/tumor.merged-diagram.pdf
  qc:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/qc/multiqc_data
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/qc/multiqc_report.html
  scout:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/scout/benchmark_filename_revamp_single.scout.yaml
  vcf:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vcf/SNV.germline.S1_R.haplotypecaller.vcf.gz
  vep:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.germline.S1_R.haplotypecaller.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SV.germline.S1_R.manta_germline.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.somatic.benchmark_filename_revamp_single.mutect.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.somatic.benchmark_filename_revamp_single.vcfmerge.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.somatic.benchmark_filename_revamp_single.vardict.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.germline.S1_R.strelka_germline.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SV.somatic.benchmark_filename_revamp_single.manta.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.somatic.benchmark_filename_revamp_single.vcfmerge.balsamic_stat
meta:
  key_1:
  key_2:
  key_3:

@henrikstranneheim
Copy link
Contributor Author

henrikstranneheim commented Jan 21, 2020

Here is my suggestion:
In file <case>_deliverables.yaml:

files:
  - format: <fastq|bam|cram|bcf|vcf|meta|"whatever_makes_sense"> "#String|required"
    id: <sample_id|case_id|project_id> "#String|required"
    path: /a/absolute/path/to/analysis_file "#String|required"
    path_index: /a/path/to/analysis_file.index_suffix "#String|optional"
    step: gatk_baserecalibration "#String|required"
    tag: <file_tag> "#String|optional"
  - format: bam
    id: sample_1
    path: /mip_analysis/sample_1/bwa_mem/sample_1.bam
    path_index: /mip_analysis/sample_1/bwa_mem/sample_1.bam.bai
    step: gatk_baserecalibration
    tag: igv_bam

The point here is that the file produced by the workflows can have from 1 to n files where each file can have a set of mandatory keys and values (if we like), but also other keys describing each file whenever appropriate. I think this will be more maintainable and extendible while still bringing consistency across workflows and the code that operates on this file.

@hassanfa @jemten @barrystokman @patrikgrenfeldt @sylvinite What do you think of this?

@hassanfa
Copy link
Contributor

Looks good. I like this! It is quite informative.

Two comments:

  • I'm not sure about path_index, not all files have indexes (only bam, vcf, etc) have those. I would remove path_index, and add bam_index|vcf_index or just index to format.

  • Another vague part is level, would it make sense to change the word to something a bit more meaningful? level sounds more like depth of directory. I don't have any suggestion for rename thought.

@henrikstranneheim
Copy link
Contributor Author

The path_index would most likely not be mandatory, but is probably a good idea to have it in the same hash entry as the path which it is indexing for. Meaning that housekeeper could pick it up if it expects it, but skip it otherwise.

Yeah, I know the level is a bit vague, but I had a hard time coming up with a good name. Maybe processing_level and another keyprocessing_ids. We have to discuss

@hassanfa hassanfa added the MIP issues related to the MIP pipeline label Jan 22, 2020
@hassanfa
Copy link
Contributor

So we leave the logic to housekeeper to handle path_index, etc.

level doesn't say if it is case level or sample level. How about, say, we call it accession OR name OR id instead? sample, case, project can be other entries to reflect if this is a case, sample or project level.

@henrikstranneheim
Copy link
Contributor Author

Sounds good! I updated the suggestion

@hassanfa
Copy link
Contributor

👍 I'll start working on balsamic's side.

@henrikstranneheim
Copy link
Contributor Author

@patrikgrenfeldt @barrystokman Do you have any comments before we start working on the code to produce the file?

@barrystokman
Copy link
Contributor

barrystokman commented Jan 27, 2020

Discussed last Thursday, good idea. Issue for work in cg regarding MIP-RNA here.

edit: specified pipeline

@patrikgrenfeldt
Copy link
Contributor

patrikgrenfeldt commented Jan 27, 2020

I suggest we call the root node (currently called store) something intuitive for the usage maybe: "output-files", "outputs", "files" or "deliverables", what do you say?

@hassanfa
Copy link
Contributor

I prefer 'output-files' or 'files'.

@henrikstranneheim
Copy link
Contributor Author

Let's do 'files' then.

@emiliaol
Copy link
Contributor

I'd prefer deliverables. Easier for me to understand which file to use for storing results if I have to do it manually

@barrystokman
Copy link
Contributor

@emiliaol you're talking about the name of the file, which has not been decided yet. You indicated to me that <case>_deliverables.yaml would work best for production, so I would like to put that suggestion forward right now.

@emiliaol
Copy link
Contributor

Aha! It was for inside the file. That I won't care about :)

@henrikstranneheim
Copy link
Contributor Author

Sounds good! Updated suggestion

@henrikstranneheim
Copy link
Contributor Author

I added a new key to the suggestion "tag". This is needed to distinguish between files produced in the same step e.g. multiqc html and json reports or mip_analyse config or log.

@hassanfa hassanfa mentioned this issue Feb 13, 2020
17 tasks
@Mropat Mropat linked a pull request Jul 14, 2020 that will close this issue
23 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Balsamic Issued related to the Balsamic workflow MIP issues related to the MIP pipeline
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants