Store analysis output-files from Balsamic in Housekeeper #475

henrikstranneheim · 2019-11-15T13:03:29Z

As a TA member, I would like to have files from Balsamic runs added to HK so that they are easy to find/deliver etc

Problem: the Balsamic pipeline is now being set up. The output files from this shall be stored in Housekeeper just as the files from MIP.

Information:
Dir-structure for Balsamic
fastq's: /home/proj/production/cancer/analysis/case-id/fastq/
results: /home/proj/production/cancer/analysis/case-id/analysis/
job logs: /home/proj/production/cancer/analysis/case-id/logs/
job scripts: /home/proj/production/cancer/analysis/case-id/scripts/
Internal BALSAMIC log: job logs: /home/proj/production/cancer/analysis/case-id/BALSAMIC_run

How to know when an analysis is finished:
I asked Hassan to make Balsamic create a file when an analysis is completed, for example 'analysis_finish' in case-directory (we need to know if this is on sample or case level)
He has created an issue on GitHub:
Clinical-Genomics/BALSAMIC#143

Suggested solution:

Use Balsamic root directory in cg config file
Let 'store-completed.sh' store also Balsamic files in HK
Let cg store analysis use Analysis Type to figure out what to store.
Add an extra tag to the files in HK to differentiate between pipelines, for example mip/balsamic - apply tags also to old files
Make sure the 'cg store' crontab is working
Parse the .hk file for things to store in housekeeper. Example /home/proj/production/cancer/cases/rapidghost/analysis/delivery_report/rapidghost.hk
Check existing code for interactions with bundles to make sure that we don't introduce bugs when using latest bundle version - what happens if there are two bundle versions with the same date? -> solve this by having the order portal make two cases if Balsamic and MIP is ordered for the same sample and add data analysis on case level (new user story @ingkebil).
add a move to trash of analysis folder at end of storage.
document in balsamic method
make a US that adds cg store completed

There will be a delivery_report directory in /home/proj/production/cancer/analysis//analysis/ with a file .hk which is a yaml-file with keys:

vcf
bam
multiqc
scout
other
    pdfs

Check if tests exist in cg: other US will be made by KB

Q: What do we need to store?
A: We should ask HFA before starting the US
Q2: is the .hk already created?
A2:

DoD:
Files from Balsamic are stored in HK after finished analysis
A pipeline-tag has been added to the files in HK
Verified

How to demo:
Show
$ housekeeper get balsamic case bundle

The text was updated successfully, but these errors were encountered:

hassanfa · 2019-12-17T15:30:13Z

@henrikstranneheim This is the structure I have for BALSAMIC. Files section is for housekeeper to store. And key names are going to be directory path to keep'em organized and tidy.

How do you have it in MIP?

Notes: Add keys under bam to have sampleIDs (tumor, normal, etc)

files:
  bam:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/bam/tumor.sorted.mrkdup.ralgn.bsrcl.merged.bam
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/bam/tumor.merged.bam
  cnv:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/cnv/tumor.merged-scatter.pdf
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/cnv/tumor.merged-diagram.pdf
  qc:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/qc/multiqc_data
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/qc/multiqc_report.html
  scout:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/scout/benchmark_filename_revamp_single.scout.yaml
  vcf:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vcf/SNV.germline.S1_R.haplotypecaller.vcf.gz
  vep:
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.germline.S1_R.haplotypecaller.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SV.germline.S1_R.manta_germline.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.somatic.benchmark_filename_revamp_single.mutect.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.somatic.benchmark_filename_revamp_single.vcfmerge.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.somatic.benchmark_filename_revamp_single.vardict.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.germline.S1_R.strelka_germline.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SV.somatic.benchmark_filename_revamp_single.manta.vcf.gz
  - /home/hassan.foroughi/repos/BALSAMIC/run_tests/benchmark_filename_revamp_single/analysis/vep/SNV.somatic.benchmark_filename_revamp_single.vcfmerge.balsamic_stat
meta:
  key_1:
  key_2:
  key_3:

henrikstranneheim · 2020-01-21T13:15:40Z

Here is my suggestion:
In file <case>_deliverables.yaml:

files:
  - format: <fastq|bam|cram|bcf|vcf|meta|"whatever_makes_sense"> "#String|required"
    id: <sample_id|case_id|project_id> "#String|required"
    path: /a/absolute/path/to/analysis_file "#String|required"
    path_index: /a/path/to/analysis_file.index_suffix "#String|optional"
    step: gatk_baserecalibration "#String|required"
    tag: <file_tag> "#String|optional"
  - format: bam
    id: sample_1
    path: /mip_analysis/sample_1/bwa_mem/sample_1.bam
    path_index: /mip_analysis/sample_1/bwa_mem/sample_1.bam.bai
    step: gatk_baserecalibration
    tag: igv_bam

The point here is that the file produced by the workflows can have from 1 to n files where each file can have a set of mandatory keys and values (if we like), but also other keys describing each file whenever appropriate. I think this will be more maintainable and extendible while still bringing consistency across workflows and the code that operates on this file.

@hassanfa @jemten @barrystokman @patrikgrenfeldt @sylvinite What do you think of this?

hassanfa · 2020-01-21T14:01:09Z

Looks good. I like this! It is quite informative.

Two comments:

I'm not sure about path_index, not all files have indexes (only bam, vcf, etc) have those. I would remove path_index, and add bam_index|vcf_index or just index to format.
Another vague part is level, would it make sense to change the word to something a bit more meaningful? level sounds more like depth of directory. I don't have any suggestion for rename thought.

henrikstranneheim · 2020-01-21T14:23:25Z

The path_index would most likely not be mandatory, but is probably a good idea to have it in the same hash entry as the path which it is indexing for. Meaning that housekeeper could pick it up if it expects it, but skip it otherwise.

Yeah, I know the level is a bit vague, but I had a hard time coming up with a good name. Maybe processing_level and another keyprocessing_ids. We have to discuss

hassanfa · 2020-01-22T14:42:11Z

So we leave the logic to housekeeper to handle path_index, etc.

level doesn't say if it is case level or sample level. How about, say, we call it accession OR name OR id instead? sample, case, project can be other entries to reflect if this is a case, sample or project level.

henrikstranneheim · 2020-01-22T14:44:34Z

Sounds good! I updated the suggestion

hassanfa · 2020-01-22T14:47:39Z

👍 I'll start working on balsamic's side.

henrikstranneheim · 2020-01-22T14:52:34Z

@patrikgrenfeldt @barrystokman Do you have any comments before we start working on the code to produce the file?

barrystokman · 2020-01-27T07:58:35Z

Discussed last Thursday, good idea. Issue for work in cg regarding MIP-RNA here.

edit: specified pipeline

patrikgrenfeldt · 2020-01-27T12:54:23Z

I suggest we call the root node (currently called store) something intuitive for the usage maybe: "output-files", "outputs", "files" or "deliverables", what do you say?

hassanfa · 2020-01-27T13:01:47Z

I prefer 'output-files' or 'files'.

henrikstranneheim · 2020-01-27T15:28:41Z

Let's do 'files' then.

emiliaol · 2020-01-28T10:03:38Z

I'd prefer deliverables. Easier for me to understand which file to use for storing results if I have to do it manually

barrystokman · 2020-01-28T10:07:35Z

@emiliaol you're talking about the name of the file, which has not been decided yet. You indicated to me that <case>_deliverables.yaml would work best for production, so I would like to put that suggestion forward right now.

emiliaol · 2020-01-28T10:08:48Z

Aha! It was for inside the file. That I won't care about :)

henrikstranneheim · 2020-01-28T12:42:08Z

Sounds good! Updated suggestion

henrikstranneheim · 2020-01-29T08:05:58Z

I added a new key to the suggestion "tag". This is needed to distinguish between files produced in the same step e.g. multiqc html and json reports or mip_analyse config or log.

emiliaol added the Balsamic Issued related to the Balsamic workflow label Dec 11, 2019

hassanfa added the MIP issues related to the MIP pipeline label Jan 22, 2020

henrikstranneheim mentioned this issue Jan 29, 2020

Feature/store refactored Clinical-Genomics/MIP#1323

Merged

3 tasks

hassanfa mentioned this issue Feb 13, 2020

Store balsamic #551

Merged

17 tasks

Mropat linked a pull request Jul 14, 2020 that will close this issue

feat/refactor cg workflow balsamic #687

Merged

23 tasks

moonso closed this as completed in #687 Sep 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store analysis output-files from Balsamic in Housekeeper #475

Store analysis output-files from Balsamic in Housekeeper #475

henrikstranneheim commented Nov 15, 2019

hassanfa commented Dec 17, 2019 •

edited

Loading

henrikstranneheim commented Jan 21, 2020 •

edited by jemten

Loading

hassanfa commented Jan 21, 2020

henrikstranneheim commented Jan 21, 2020

hassanfa commented Jan 22, 2020

henrikstranneheim commented Jan 22, 2020

hassanfa commented Jan 22, 2020

henrikstranneheim commented Jan 22, 2020

barrystokman commented Jan 27, 2020 •

edited

Loading

patrikgrenfeldt commented Jan 27, 2020 •

edited

Loading

hassanfa commented Jan 27, 2020

henrikstranneheim commented Jan 27, 2020

emiliaol commented Jan 28, 2020

barrystokman commented Jan 28, 2020

emiliaol commented Jan 28, 2020

henrikstranneheim commented Jan 28, 2020

henrikstranneheim commented Jan 29, 2020

Store analysis output-files from Balsamic in Housekeeper #475

Store analysis output-files from Balsamic in Housekeeper #475

Comments

henrikstranneheim commented Nov 15, 2019

hassanfa commented Dec 17, 2019 • edited Loading

henrikstranneheim commented Jan 21, 2020 • edited by jemten Loading

hassanfa commented Jan 21, 2020

henrikstranneheim commented Jan 21, 2020

hassanfa commented Jan 22, 2020

henrikstranneheim commented Jan 22, 2020

hassanfa commented Jan 22, 2020

henrikstranneheim commented Jan 22, 2020

barrystokman commented Jan 27, 2020 • edited Loading

patrikgrenfeldt commented Jan 27, 2020 • edited Loading

hassanfa commented Jan 27, 2020

henrikstranneheim commented Jan 27, 2020

emiliaol commented Jan 28, 2020

barrystokman commented Jan 28, 2020

emiliaol commented Jan 28, 2020

henrikstranneheim commented Jan 28, 2020

henrikstranneheim commented Jan 29, 2020

hassanfa commented Dec 17, 2019 •

edited

Loading

henrikstranneheim commented Jan 21, 2020 •

edited by jemten

Loading

barrystokman commented Jan 27, 2020 •

edited

Loading

patrikgrenfeldt commented Jan 27, 2020 •

edited

Loading