Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File streaming problems from s3? #744

Closed
rronen opened this issue Feb 3, 2015 · 23 comments
Closed

File streaming problems from s3? #744

rronen opened this issue Feb 3, 2015 · 23 comments

Comments

@rronen
Copy link

rronen commented Feb 3, 2015

Hi Brad/Rory/John,

I've been trying out bcbio_vm on AWS (tumor/normal calling on a couple BAMs from s3 bucket). I was able to set up and start a cluster, and ssh into the head-node to start the run. However, I am getting different errors in the alignment step. First it was a paired read missing ("samblaster: Can't find first and/or second of pair in sam block of length 1 for id: X:XXX:XX:X:XX samblaster: Are you sure the input is sorted by read ids?"); after deleting the 'work' dir and restarting the run I got a different error ("pysam: SEQ and QUAL are of different lengths").

Running the same input BAMs & config on a local bcbio-nextgen full install (not bcbio_vm) I don't get any of these errors. I suspect something may be going wrong related to streaming the input files from s3 (perhaps related to gof3r?), although this is a total guess.

Just wanted to report this and see if you've run into anything similar, and if so what you'd recommend.

(Unrelated question: is there a simple way to directly re-run certain commands with the docker setup? This would be pretty straightforward on a full/local install since all the tools & data are accessible, wondering if there's an equivalent "debug procedure" with docker -- apologies in advance if this is already answered somewhere in the docs)

@roryk
Copy link
Collaborator

roryk commented Feb 3, 2015

Hi Roy,

Thanks for posting about the issue. We’ve been working out the kinks running with bcbio_vm and it looks like you ran into one. Just to start, by any chance, are you running out of space on your cluster on AWS? It sounds like the files might be getting truncated.

Best,

Rory

On Feb 3, 2015, at 3:17 PM, Roy Ronen notifications@github.com wrote:

Hi Brad/Rory/John,

I've been trying out bcbio_vm on AWS (tumor/normal calling on a couple BAMs from s3 bucket). I was able to set up and start a cluster, and ssh into the head-node to start the run. However, I am getting different errors in the alignment step. First it was a paired read missing ("samblaster: Can't find first and/or second of pair in sam block of length 1 for id: X:XXX:XX:X:XX samblaster: Are you sure the input is sorted by read ids?"); after deleting the 'work' dir and restarting the run I got a different error ("pysam: SEQ and QUAL are of different lengths").

Running the same input BAMs & config on a local bcbio-nextgen full install (not bcbio_vm) I don't get any of these errors. I suspect something may be going wrong related to streaming the input files from s3 (perhaps related to gof3r?), although this is a total guess.

Just wanted to report this and see if you've run into anything similar, and if so what you'd recommend.

(Unrelated question: is there a simple way to directly re-run certain commands with the docker setup? This would be pretty straightforward on a full/local install since all the tools & data are accessible, wondering if there's an equivalent "debug procedure" with docker -- apologies in advance if this is already answered somewhere in the docs)


Reply to this email directly or view it on GitHub #744.

@rronen
Copy link
Author

rronen commented Feb 3, 2015

Hi Rory,

Thanks for the quick reply! Nope, space doesn't seem to be a problem (120G still available).

Just to add some info, I am running on a "single-node cluster" configuration so the head-node is doing all the processing (instance type is c3.2xlarge with 200G "general purpose SSD" boot drive).

Thanks!
Roy

@rronen
Copy link
Author

rronen commented Feb 3, 2015

Hi guys,

FYI, another restart and it is back to the first error. From bcbio-nextgen.log:

[M::mem_pestat] skip orientation RR as there are not enough pairs
samblaster: Can't find first and/or second of pair in sam block of length 1 for id: H9213ADXX:2:2112:14772:75370
samblaster: Are you sure the input is sorted by read ids?

' returned non-zero exit status 1
Uncaught exception occurred
Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "build/bdist.linux-x86_64/egg/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'docker attach --no-stdin 331f20dab05ba940e34245e4727fb3b95d01b7a19bd412becfdb04d26c0f730a

Edit: I initially attached an incorrect error message (a different error from my local install of bcbio-nextgen); now it is correct. Running the same config on the same input data with locally installed bcbio-nextgen also errors out but with (seemingly) different error:

[2015-02-03T19:02Z] Aligning lane 1_2015-02-02_p3_1_B1_tumor_normal with bwa aligner
[2015-02-03T21:58Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/home/ubuntu/bcbio_datadir/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/ubuntu/bcbio_datadir/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /home/ubuntu/bcbio_tooldir/bin/samtools sort -n -o -l 1 -@ 8 -m 682M /home/ubuntu/p3_analysis/input/03-1-B1.merged.sorted.bam /home/ubuntu/p3_analysis/p3_1_B1_tumor_normal_fb_vs/work_test/align/p3_T_1B1/tx/tmpzyXvXM/1_2015-02-02_p3_1_B1_tumor_normal-sort-in1 | /home/ubuntu/bcbio_tooldir/bin/bedtools bamtofastq -i /dev/stdin -fq /dev/stdout -fq2 /dev/stdout | /home/ubuntu/bcbio_tooldir/bin/bwa mem -p -M -t 8  -R '@RG\tID:1\tPL:illumina\tPU:1_2015-02-02_p3_1_B1_tumor_normal\tSM:p3_T_1B1' -v 1 /home/ubuntu/bcbio_datadir/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa - | /home/ubuntu/bcbio_tooldir/bin/samblaster --splitterFile >(/home/ubuntu/bcbio_tooldir/bin/samtools view -S -u /dev/stdin | /home/ubuntu/bcbio_tooldir/bin/sambamba sort -t 8 -m 1G --tmpdir /home/ubuntu/p3_analysis/p3_1_B1_tumor_normal_fb_vs/work_test/tx/tmpK3_BqI/spl -o /home/ubuntu/p3_analysis/p3_1_B1_tumor_normal_fb_vs/work_test/align/p3_T_1B1/tx/tmpvYO0Sm/1_2015-02-02_p3_1_B1_tumor_normal-sort-sr.bam /dev/stdin) --discordantFile >(/home/ubuntu/bcbio_tooldir/bin/samtools view -S -u /dev/stdin | /home/ubuntu/bcbio_tooldir/bin/sambamba sort -t 8 -m 1G --tmpdir /home/ubuntu/p3_analysis/p3_1_B1_tumor_normal_fb_vs/work_test/tx/tmpK3_BqI/disc -o /home/ubuntu/p3_analysis/p3_1_B1_tumor_normal_fb_vs/work_test/align/p3_T_1B1/tx/tmpfETJOP/1_2015-02-02_p3_1_B1_tumor_normal-sort-disc.bam /dev/stdin) | /home/ubuntu/bcbio_tooldir/bin/samtools view -S -u /dev/stdin | /home/ubuntu/bcbio_tooldir/bin/sambamba sort -t 8 -m 1G --tmpdir /home/ubuntu/p3_analysis/p3_1_B1_tumor_normal_fb_vs/work_test/tx/tmpK3_BqI/full -o /home/ubuntu/p3_analysis/p3_1_B1_tumor_normal_fb_vs/work_test/align/p3_T_1B1/tx/tmpzyXvXM/1_2015-02-02_p3_1_B1_tumor_normal-sort.bam /dev/stdin
...
...
...
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
sambamba-sort: Unable to write to stream
samtools: writing to standard output failed: Broken pipe
samtools: error closing standard output: -1
samblaster: Unable to write to output file.

[fputs] Broken pipe
sambamba-sort: unable to write to stream
sambamba-sort: unable to write to stream
' returned non-zero exit status 1

@lpantano
Copy link
Collaborator

lpantano commented Feb 4, 2015

Hi,

I would check /var/log/message or any other system log.

maybe there are some error there that can help.

@chapmanb
Copy link
Member

chapmanb commented Feb 4, 2015

Roy;
Sorry about the issues. I'll try to work to reproduce to see if we can isolate an underlying cause. Would you be able to share your input configuration (no need to have the exact files, just the configuration) so I can try to replicate?

For alignment, we first pull the entire fastq file locally to avoid any gof3r streaming issues. I had hopes that we could stream the entire thing from S3 but found intermittent failures when doing that so shelved the idea. I can try to identify if there's something else going on.

Debugging exact runs with docker tools is trickier, as you can't immediately re-run commands. This is something where we'll have to work on additional tooling but don't have it right now.

Thanks much for trying this all out.

@rronen
Copy link
Author

rronen commented Feb 4, 2015

Hi Brad,

Thanks, I appreciate the help in tracking this down. The configuration is below.

Perhaps because the inputs are BAM it still tries to stream using gof3r?

details:
- algorithm:
    aligner: bwa
    coverage_interval: exome
    indelcaller: false
    mark_duplicates: true
    phasing: false
    platform: illumina
    quality_format: Standard
    realign: true
    recalibrate: true
    variantcaller:
    - varscan
    - freebayes
  analysis: variant2
  description: p3_T_1B1
  files:
  - s3://my-bucket/my-folder/tumor.bam
  genome_build: GRCh37
  metadata:
    batch: batch1
    phenotype: tumor
    sex: female
- algorithm:
    aligner: bwa
    coverage_interval: exome
    indelcaller: false
    mark_duplicates: true
    phasing: false
    platform: illumina
    quality_format: Standard
    realign: true
    recalibrate: true
    variantcaller:
    - varscan
    - freebayes
  analysis: variant2
  description: p3_N
  files:
  - s3://my-bucket/my-folder/normal.bam
  genome_build: GRCh37
  metadata:
    batch: batch1
    phenotype: normal
    sex: female
fc_date: '2015-01-31'
fc_name: p3_1_B1_tumor_normal
upload:
  bucket: my-bucket
  dir: /home/ubuntu/run/bcbio-vm-test1/final  
  folder: my-folder/final
  method: s3

@rronen
Copy link
Author

rronen commented Feb 5, 2015

update: I've restarted bcbio_vm a couple more times (each time deleting the contents of 'work') and it consistently fails* during alignment.

Before doing this, to rule out truncated files on s3, I re-copied the BAM (input) files to my s3 bucket and from there copied them over to a different AWS instance where they successfully ran through bcbio_nextgen with the same (almost) configuration.

The reason I suspected s3-streaming/gof3r issue is that it seems the first two commands to run (from log/bcbio-nextgen-commands.log) are:

[2015-02-04T23:43Z] /usr/local/bin/bamtofastq filename=<(gof3r get --no-md5 -k roy-bcbio-test/03-1-B1.merged.sorted.bam -b d4c-work ) T=/mnt/work/align_prep/tx/tmp863NX8/03-1-B1.merged.sorted-1.fq-sort F=>(/usr/local/bin/bgzip -c /dev/stdin > /mnt/work/align_prep/tx/tmp863NX8/03-1-B1.merged.sorted-1.fq.gz) F2=>(/usr/local/bin/bgzip -c /dev/stdin > /mnt/work/align_prep/03-1-B1.merged.sorted-2.fq.gz) S=/dev/null O=/dev/null O2=/dev/null collate=1 colsbs=8589934592
[2015-02-05T00:23Z] /usr/local/bin/bamtofastq filename=<(gof3r get --no-md5 -k roy-bcbio-test/03-1-PBMC.merged.sorted.bam -b d4c-work ) T=/mnt/work/align_prep/tx/tmpsBye04/03-1-PBMC.merged.sorted-1.fq-sort F=>(/usr/local/bin/bgzip -c /dev/stdin > /mnt/work/align_prep/tx/tmpsBye04/03-1-PBMC.merged.sorted-1.fq.gz) F2=>(/usr/local/bin/bgzip -c /dev/stdin > /mnt/work/align_prep/03-1-PBMC.merged.sorted-2.fq.gz) S=/dev/null O=/dev/null O2=/dev/null collate=1 colsbs=8589934592

Would be great to figure this out and get bcbio_vm running. Thanks for all the help so far.

* it errors with one of

samblaster: Can't find first and/or second of pair in sam block of length 1 for id: XX:X:XX:XX:XX
samblaster: Are you sure the input is sorted by read ids?

or

[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] parse error at line XXXXX
[main_samview] truncated file.

chapmanb added a commit that referenced this issue Feb 5, 2015
chapmanb added a commit that referenced this issue Feb 5, 2015
@chapmanb
Copy link
Member

chapmanb commented Feb 5, 2015

Roy;
Thanks for all the help debugging. It was trying to stream into the fastq preparation step, prior to alignment, but it looks like that is not going to fly. I pushed fixes to bcbio to pre-download these to avoid that problem. I'll test and work on rolling out a new Docker image. I've also got encrypted NFS working so will test that at the same time and should hopefully be able to give you both to work with at the same time. Thanks again.

@rronen
Copy link
Author

rronen commented Feb 5, 2015

Hi Brad,

Thanks for fixing this issue, and for moving so fast on the encrypted NFS for bcbio_vm!

I'll give bcbio_vm another spin as soon as you roll out the new Docker image. Btw, there's nothing special needed for using the new image when it's up? (bcbio_vm automatically gets the latest?)

Cheers,
Roy

@chapmanb
Copy link
Member

chapmanb commented Feb 8, 2015

Roy;
Thanks for all the patience on this. If you start a new cluster, it will now have the fix for S3, include encrypted NFS storage, and also have the updated sorting code from #747. I hope this will work cleanly out of the box for you. You don't need to change anything as bcbio_vm will atuomatically get the latest code with the fixes. Prior to starting do a:

bcbio_vm.py aws config edit

to set the size of the encrypted NFS filesystem. It will then be available to all nodes at /encrypted, so run your analyses there and it should proceed as before. I'll also update the documentation to reflect this.

Hope this gets everything running for you. Please feel free to reopen if you run into any issues.

@chapmanb chapmanb closed this as completed Feb 8, 2015
@rronen
Copy link
Author

rronen commented Feb 8, 2015

Hi Brad,

Great news! Just gave this a first spin-up. A few things I've come across from most to least important (to the best of my judgment:).

  1. I set (via bcbio_vm.py aws config edit) a 500G encrypted NFS. After starting the cluster it looks like the 500G drive mounted at /.
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      493G  2.4G  470G   1% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            7.4G   12K  7.4G   1% /dev
tmpfs           1.5G  348K  1.5G   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            7.4G     0  7.4G   0% /run/shm
none            100M     0  100M   0% /run/user
/dev/xvdb        80G  4.0G   75G   6% /mnt

In my AWS console under encryption the drive shows up as 'Not Encrypted'. Is there any way it is actually encrypted but shows up differently in AWS console?

  1. When running bcbio_vm.py aws cluster start it starts up the cluster okay, but then gives:
INFO:gc3.elasticluster:Cluster correctly configured.
Your cluster is ready!

Cluster name:     bcbio
Cluster template: bcbio
Default ssh to node: frontend001
- frontend nodes: 1

To login on the frontend node, run the command:

    elasticluster ssh bcbio

To upload or download files to the cluster, use the command:

    elasticluster sftp bcbio


PLAY [Docker installation] **************************************************** 

GATHERING FACTS *************************************************************** 

paramiko: The authenticity of host 'xx.x.xxx.xxx' can't be established.
The ssh-rsa key fingerprint is xxxxxxxxxxxxx.
Are you sure you want to continue connecting (yes/no)?
yes
fatal: [frontend001] => {'msg': 'FAILED: No existing session', 'failed': True}

TASK: [Install docker dependencies] ******************************************* 
FATAL: no hosts matched or all hosts have already failed -- aborting

Unreachable hosts: frontend001

Then, running bcbio_vm.py aws cluster bootstrap prompts the same question with the ssh-rsa key, but once I enter yes the installation runs cleanly.

(this would be great to figure out as then things can be started programmatically/without interversion).

  1. After config edit when running aws info I get:
bcbio@master:~/bcbio_runs$ bcbio_vm.py aws info
Available clusters: bcbio

Configuration for cluster 'bcbio':
 Frontend: c3.2xlarge with 500Gb NFS storage

AWS setup:
 OK: expected IAM user 'bcbio' exists.
 OK: expected security group 'bcbio_cluster_sg' exists.
Traceback (most recent call last):
  File "/usr/local/bin/bcbio_vm.py", line 4, in <module>
    __import__('pkg_resources').run_script('bcbio-nextgen-vm==0.1.0a0', 'bcbio_vm.py')
  File "/home/bcbio/install/bcbio-vm/anaconda/lib/python2.7/site-packages/setuptools-12.0.5-py2.7.egg/pkg_resources/__init__.py", line 696, in run_script

  File "/home/bcbio/install/bcbio-vm/anaconda/lib/python2.7/site-packages/setuptools-12.0.5-py2.7.egg/pkg_resources/__init__.py", line 1621, in run_script

  File "/home/bcbio/install/bcbio-vm/anaconda/lib/python2.7/site-packages/bcbio_nextgen_vm-0.1.0a0-py2.7.egg/EGG-INFO/scripts/bcbio_vm.py", line 290, in <module>

  File "/home/bcbio/install/bcbio-vm/anaconda/lib/python2.7/site-packages/bcbio_nextgen_vm-0.1.0a0-py2.7.egg/bcbiovm/aws/info.py", line 28, in print_info
  File "/home/bcbio/install/bcbio-vm/anaconda/lib/python2.7/site-packages/bcbio_nextgen_vm-0.1.0a0-py2.7.egg/bcbiovm/aws/info.py", line 87, in _vpc_info
KeyError: 'Name'

This doesn't seem serious, as I'm able to run bcbio_vm.py aws cluster start after and it (sort of, see 2) works.

@chapmanb
Copy link
Member

chapmanb commented Feb 9, 2015

Roy;
Ugh, sorry this is my mistake. I was totally focused on the cluster forgot to tell you to update your local bcbio_vm installation so it has the matching fixes. Sorry. On the machine you are starting the cluster from run:

bcbio_vm.py upgrade --wrapper

This will fix problem 1 if you run the edit command again. You may also want to manually edit ~/.bcbio/elasticluster/config to have a smaller root_volume_size so you don't have two big EBS volumes attached. You should see the new one mounted in /encrypted and it will also say it's encrypted in the console. This upgrade also has a fix for problem 2.

I'll look into problem 3 but I don't think it's critical to getting things running. Sorry for sending you down the wrong path and hope things work smoothly after the wrapper upgrade. Thanks again for the patience getting this going.

@rronen
Copy link
Author

rronen commented Feb 9, 2015

Hi Brad,

Okay got it. I will try this and update if I come across anything else.

Thank again!
Roy

@rronen
Copy link
Author

rronen commented Feb 9, 2015

Hi Brad,

Okay, NFS encryption works! I was able to cleanly start up a cluster with encrypted NFS and almost get through a full run on a small test dataset. The error I finally got seems to be in the annotation (post variant calling) part:

*unfortunately my test data is fastq, so haven't yet tried out the new part dealing with bam->fastq.

[2015-02-09T07:51Z] tabix index batch1-GL000227.1_0_72126-raw-gatkann.vcf.gz
An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (123, 0))

An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (123, 0))

An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (123, 0))

An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (123, 0))

An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (123, 0))

An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (123, 0))

An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (123, 0))

Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 216, in <module>
    main(**kwargs)
  File "/usr/local/bin/bcbio_nextgen.py", line 42, in main
    run_main(**kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 37, in run_main
    fc_dir, run_info_yaml)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 81, in _run_toplevel
    for xs in pipeline.run(config, run_info_yaml, parallel, dirs, samples):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 166, in run
    samples = genotype.parallel_variantcall_region(samples, run_parallel)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/genotype.py", line 165, in parallel_variantcall_region
    "vrn_file", ["region", "sam_ref", "config"]))
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 32, in grouped_parallel_split_combine
    final_output = parallel_fn(parallel_name, split_args)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 84, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 660, in __call__
    self.retrieve()
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 543, in retrieve
    raise exception_type(report)
TypeError: __init__() takes at least 3 arguments (2 given)
' returned non-zero exit status 1

Either way, now testing with full-sized data (from BAM).

@chapmanb
Copy link
Member

chapmanb commented Feb 9, 2015

Roy;
Glad encryption is working now and sorry about the tabix issues. I'm unsure what is going on and didn't see any of this with my test run, so would love help debugging it so we can push a fix. This traceback doesn't give many details about the underlying issue, unfortunately. Would it be possible to re-run with a single core:

bcbio_vm.py run ../config/your_config.yaml

to see if that provides a more useful error message? It would also be helpful to know the last command that fails from log/bcbio-nextgen-commands.log. Thanks much for the help debugging.

@chapmanb chapmanb reopened this Feb 9, 2015
@rronen
Copy link
Author

rronen commented Feb 9, 2015

Hi Brad,

Okay, here's a tail of log/bcbio-nextgen-commands.log:

ubuntu@frontend001:/encrypted/p3_fullprep_TEST/work$ tail log/bcbio-nextgen-commands.log 
[2015-02-09T07:51Z] /usr/local/bin/freebayes -f /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa --ploidy 2 --targets /mnt/work/freebayes/GL000227.1/batch1-GL000227.1_0_72126-raw-regions-nolcr.bed --min-repeat-entropy 1 --experimental-gls --min-alternate-fraction 0.1 --pooled-discrete --pooled-continuous --genotype-qualities --report-genotype-likelihood-max --allele-balance-priors-off /mnt/work/bamprep/p3_T_1B1/GL000227.1/1_2015-02-08_p3_1_B1_tumor_normal_fullprep-sort-GL000227.1_0_72126-prep.bam /mnt/work/bamprep/p3_N/GL000227.1/2_2015-02-08_p3_1_B1_tumor_normal_fullprep-sort-GL000227.1_0_72126-prep.bam | vcffilter -f 'QUAL > 5' -s | /usr/local/share/bcbio-nextgen/anaconda/bin/py -x 'bcbio.variation.freebayes.call_somatic(x)' | awk -F$'\t' -v OFS='\t' '{if ($0 !~ /^#/) gsub(/[KMRYSWBVHDX]/, "N", $4) } {print}' | vcfallelicprimitives --keep-geno | vcffixup - | vcfstreamsort | vt normalize -r /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -q - 2> /dev/null | vcfuniqalleles | bgzip -c > /mnt/work/freebayes/GL000227.1/tx/tmpJ3Kqgq/batch1-GL000227.1_0_72126-raw.vcf.gz
[2015-02-09T07:51Z] /usr/local/bin/tabix -f -p vcf /mnt/work/freebayes/X/tx/tmpVxVhUE/batch1-X_79828380_96396867-raw-gatkann.vcf.gz
[2015-02-09T07:51Z] /usr/local/bin/tabix -f -p vcf /mnt/work/freebayes/X/tx/tmpc9fOCm/batch1-X_130179104_155249817-raw-gatkann.vcf.gz
[2015-02-09T07:51Z] /usr/local/bin/tabix -f -p vcf /mnt/work/freebayes/GL000227.1/tx/tmpiZujcH/batch1-GL000227.1_0_72126-raw.vcf.gz
[2015-02-09T07:51Z] java -Xms454m -Xmx1590m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/work/tx/tmptXxgFg -jar /mnt/work/inputs/jars/gatk/GenomeAnalysisTKLite.jar -T VariantAnnotator -R /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa --variant /mnt/work/freebayes/GL000227.1/batch1-GL000227.1_0_72126-raw.vcf.gz --out /mnt/work/freebayes/GL000227.1/tx/tmpx0a9cG/batch1-GL000227.1_0_72126-raw-gatkann.vcf.gz -L /mnt/work/freebayes/GL000227.1/batch1-GL000227.1_0_72126-raw.vcf.gz --dbsnp /mnt/work/inputs/data/genomes/GRCh37/variation/dbsnp_138.vcf.gz -I /mnt/work/bamprep/p3_N/GL000227.1/2_2015-02-08_p3_1_B1_tumor_normal_fullprep-sort-GL000227.1_0_72126-prep.bam -I /mnt/work/bamprep/p3_T_1B1/GL000227.1/1_2015-02-08_p3_1_B1_tumor_normal_fullprep-sort-GL000227.1_0_72126-prep.bam -A BaseQualityRankSumTest -A FisherStrand -A GCContent -A HaplotypeScore -A HomopolymerRun -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth -A ReadPosRankSumTest -A RMSMappingQuality --allow_potentially_misencoded_quality_scores -U ALL --read_filter BadCigar --read_filter NotPrimaryAlignment
[2015-02-09T07:51Z] /usr/local/bin/tabix -f -p vcf /mnt/work/freebayes/X/tx/tmpa06_SR/batch1-X_155250141_155270560-raw-gatkann.vcf.gz
[2015-02-09T07:51Z] /usr/local/bin/tabix -f -p vcf /mnt/work/freebayes/MT/tx/tmpRocuyE/batch1-MT_0_9035-raw-gatkann.vcf.gz
[2015-02-09T07:51Z] /usr/local/bin/tabix -f -p vcf /mnt/work/freebayes/GL000232.1/tx/tmpdvoAZ9/batch1-GL000232.1_0_26578-raw-gatkann.vcf.gz
[2015-02-09T07:51Z] /usr/local/bin/tabix -f -p vcf /mnt/work/freebayes/GL000212.1/tx/tmpw1r2Tp/batch1-GL000212.1_0_2861-raw-gatkann.vcf.gz
[2015-02-09T07:51Z] /usr/local/bin/tabix -f -p vcf /mnt/work/freebayes/GL000227.1/tx/tmpqKLpS4/batch1-GL000227.1_0_72126-raw-gatkann.vcf.gz

I hope the following is more informative. After restarting the run with a single core I get:

[2015-02-09T18:22Z] tabix index batch1-freebayes-multiallelic-decompose.vcf.gz
[2015-02-09T18:22Z] Resource requests: ; memory: 1.00; cores: 1
[2015-02-09T18:22Z] Configuring 1 jobs to run, using 1 cores each with 1.00g of memory reserved for each job
[2015-02-09T18:22Z] Combine variant files
[2015-02-09T18:22Z] /usr/local/bin/gatk-framework: line 7: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
[2015-02-09T18:22Z] INFO  18:22:37,480 HelpFormatter - ---------------------------------------------------------------------------------
[2015-02-09T18:22Z] INFO  18:22:37,482 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-18-g478145d, Compiled 2014/09/19 01:24:53
[2015-02-09T18:22Z] INFO  18:22:37,482 HelpFormatter - Copyright (c) 2010 The Broad Institute
[2015-02-09T18:22Z] INFO  18:22:37,483 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
[2015-02-09T18:22Z] INFO  18:22:37,486 HelpFormatter - Program Args: -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -T CombineVariants -R /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa --out /mnt/work/gemini/tx/tmpa29Tq8/batch1-freebayes-nomultiallelic.vcf.gz --variant:v0 /mnt/work/gemini/batch1-freebayes-biallelic.vcf.gz --variant:v1 /mnt/work/gemini/batch1-freebayes-multiallelic-decompose.vcf.gz --rod_priority_list v0,v1 --suppressCommandLineHeader --setKey null -L /mnt/work/bedprep/1_2015-02-08_p3_1_B1_tumor_normal_fullprep-sort-callable-callableblocks.bed --interval_set_rule INTERSECTION
[2015-02-09T18:22Z] INFO  18:22:37,490 HelpFormatter - Executing as ubuntu@frontend001 on Linux 3.13.0-45-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_75-b13.
[2015-02-09T18:22Z] INFO  18:22:37,491 HelpFormatter - Date/Time: 2015/02/09 18:22:37
[2015-02-09T18:22Z] INFO  18:22:37,491 HelpFormatter - ---------------------------------------------------------------------------------
[2015-02-09T18:22Z] INFO  18:22:37,491 HelpFormatter - ---------------------------------------------------------------------------------
[2015-02-09T18:22Z] INFO  18:22:37,557 GenomeAnalysisEngine - Strictness is SILENT
[2015-02-09T18:22Z] INFO  18:22:37,635 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
[2015-02-09T18:22Z] INFO  18:22:37,797 IntervalUtils - Processing 587300 bp from intervals
[2015-02-09T18:22Z] WARN  18:22:37,798 IndexDictionaryUtils - Track v0 doesn't have a sequence dictionary built in, skipping dictionary validation
[2015-02-09T18:22Z] WARN  18:22:37,798 IndexDictionaryUtils - Track v1 doesn't have a sequence dictionary built in, skipping dictionary validation
[2015-02-09T18:22Z] INFO  18:22:37,857 GenomeAnalysisEngine - Preparing for traversal
[2015-02-09T18:22Z] INFO  18:22:37,860 GenomeAnalysisEngine - Done preparing for traversal
[2015-02-09T18:22Z] INFO  18:22:37,860 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
[2015-02-09T18:22Z] INFO  18:22:37,860 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
[2015-02-09T18:22Z] INFO  18:22:37,861 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
[2015-02-09T18:22Z] ##### ERROR ------------------------------------------------------------------------------------------
[2015-02-09T18:22Z] ##### ERROR A USER ERROR has occurred (version 3.2-18-g478145d):
[2015-02-09T18:22Z] ##### ERROR
[2015-02-09T18:22Z] ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
[2015-02-09T18:22Z] ##### ERROR The error message below tells you what is the problem.
[2015-02-09T18:22Z] ##### ERROR
[2015-02-09T18:22Z] ##### ERROR If the problem is an invalid argument, please check the online documentation guide
[2015-02-09T18:22Z] ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
[2015-02-09T18:22Z] ##### ERROR
[2015-02-09T18:22Z] ##### ERROR Visit our website and forum for extensive documentation and answers to
[2015-02-09T18:22Z] ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
[2015-02-09T18:22Z] ##### ERROR
[2015-02-09T18:22Z] ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
[2015-02-09T18:22Z] ##### ERROR
[2015-02-09T18:22Z] ##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 164: The VCF specification requires a valid ID field, for input source: /mnt/work/gemini/batch1-freebayes-multiallelic-decompose.vcf.gz

With traceback:

##### ERROR ------------------------------------------------------------------------------------------
' returned non-zero exit status 1
Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 216, in <module>
    main(**kwargs)
  File "/usr/local/bin/bcbio_nextgen.py", line 42, in main
    run_main(**kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 37, in run_main
    fc_dir, run_info_yaml)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 81, in _run_toplevel
    for xs in pipeline.run(config, run_info_yaml, parallel, dirs, samples):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 190, in run
    samples = population.prep_db_parallel(samples, run_parallel)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/population.py", line 203, in prep_db_parallel
    output = parallel_fn("prep_gemini_db", to_process)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 84, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 149, in wrapper
    return apply(f, *args, **kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 127, in prep_gemini_db
    return population.prep_gemini_db(*args)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/population.py", line 27, in prep_gemini_db
    gemini_vcf = multiallelic.to_single(multisample_vcf, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/multiallelic.py", line 43, in to_single
    dd.get_ref_file(data), data["config"])
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/variation/vcfutils.py", line 354, in combine_variant_files
    do.run(cmd, "Combine variant files")
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command '/usr/local/bin/gatk-framework -Xms500m -Xmx1750m -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -XX:+UseSerialGC -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -T CombineVariants -R /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa --out /mnt/work/gemini/tx/tmpa29Tq8/batch1-freebayes-nomultiallelic.vcf.gz --variant:v0 /mnt/work/gemini/batch1-freebayes-biallelic.vcf.gz --variant:v1 /mnt/work/gemini/batch1-freebayes-multiallelic-decompose.vcf.gz --rod_priority_list v0,v1 --suppressCommandLineHeader --setKey null -L /mnt/work/bedprep/1_2015-02-08_p3_1_B1_tumor_normal_fullprep-sort-callable-callableblocks.bed --interval_set_rule INTERSECTION

Perhaps this is related to me adding to the configuration:

ensemble:
  numpass: 2

(previously I was not using this option)

EDIT 1: looking at gemini/batch1-freebayes-multiallelic-decompose.vcf.gz, 164 is the first variant line and the file looks like (ignore line numbers):

163:#CHROM      POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  p3_N    p3_T_1B1
164:1   15274           A       G       .       .       OLD_MULTIALLELIC=1:15274:A/G/T  GT:PL:DP        1/.:100,100,100:115     1/.:100,100,100:95
165:1   15274           A       T       .       .       OLD_MULTIALLELIC=1:15274:A/G/T  GT:PL:DP        ./1:100,100,100:115     ./1:100,100,100:95
166:1   886817          C       CATTTT  .       .       OLD_MULTIALLELIC=1:886817:C/CATTTT/T    GT:PL:DP        1/.:100,100,67:19       1/.:100,100,100:12
167:1   886817          C       T       .       .       OLD_MULTIALLELIC=1:886817:C/CATTTT/T    GT:PL:DP        ./1:100,100,100:19      ./1:100,100,100:12

EDIT2: this does not appear related to using the ensemble config option. I just tried an identical run with this option removed (still calling with multiple callers, but no ensemble) and it gives the same error.

chapmanb added a commit to chapmanb/vt that referenced this issue Feb 10, 2015
…leic

The latest fixes to decompose to pass along the FORMAT DP field
(atks/vt@e51d17e)
cleared all of the other variant fields, losing information like IDs,
filters and quals and causing downstream errors (bcbio/bcbio-nextgen#744).
This avoids clearing these.
chapmanb added a commit to chapmanb/homebrew-cbl that referenced this issue Feb 10, 2015
…eq: provide stable download link instead of git pull (@mjafin)
@chapmanb
Copy link
Member

Roy;
Apologies, this problem is due to an issue with vt. We've been working on better supporting multi-allelic variants when exporting to GEMINI, and the recent changes in vt cleared the ID column which causes this error. I sent a fix upstream and will prepare a new docker image with the fix right now. Sorry about the problem and thanks for your help debugging this.

@chapmanb
Copy link
Member

Roy;
This should be good to go now. You can update the docker image on a running cluster with:

bcbio_vm.py aws cluster bootstrap

from the machine you're using to manage it. If you do that, then remove the old problematic files on the cluster (rm -rf /mnt/work/gemini/) it should hopefully work cleanly now. Sorry again about the issue and hope this fixes it for you.

@rronen
Copy link
Author

rronen commented Feb 11, 2015

Hi Brad,Great, thanks for the quick fix! I restarted the run and it finished smoothly. Now trying a full run on real-sized data -- will update if I run into anything else.

@rronen rronen closed this as completed Feb 11, 2015
@rronen rronen reopened this Feb 11, 2015
@rronen
Copy link
Author

rronen commented Feb 11, 2015

Hi Brad,

Unfortunately, it looks like something fishy is still going on with BAM input. After bootsraping to the latest docker version, I:

  1. Got through a full run on small (tumor/normal) FASTQ dataset (thanks to the fix above in vt), but
  2. Failed during alignment with a large (real life) dataset starting from BAMs.

It seems the 4 FASTQs are successfully prep'ed in align_prep, but during alignment I get:

[2015-02-11T02:24Z] samblaster: Unable to write to output file.
[2015-02-11T02:24Z] 
[2015-02-11T02:24Z] [fputs] Broken pipe
[2015-02-11T02:24Z] /bin/bash: line 1:   300 Exit 1                  /usr/local/bin/bwa mem -M -t 8 -R '@RG\tID:1\tPL:illumina\tPU:1_2015-02-10_p2_1_B1_tumor_normal_fullprep\tSM:p3_T_1
B1' -v 1 /mnt/work/inputs/data/genomes/GRCh37/bwa/GRCh37.fa /mnt/work/align_prep/Patient_2_T-ready-1.fq.gz /mnt/work/align_prep/Patient_2_T-ready-2.fq.gz
[2015-02-11T02:24Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /usr/local/bin/bwa mem -M -t 8  -R '@RG\tID:1\tPL:illumina\tPU:1_2015-02-10_p2_1_B1_tumor_normal_fullprep\tSM:p3_T_1B1' -v 1 /mnt/work/inp
uts/data/genomes/GRCh37/bwa/GRCh37.fa /mnt/work/align_prep/Patient_2_T-ready-1.fq.gz /mnt/work/align_prep/Patient_2_T-ready-2.fq.gz | /usr/local/bin/samblaster --splitterFile >(/usr/lo
cal/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-spl -o /mnt/work/align/p3_T_1B1/tx/tmpWTYhZZ/1_2015-
02-10_p2_1_B1_tumor_normal_fullprep-sort-sr.bam /dev/stdin) --discordantFile >(/usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_t
umor_normal_fullprep-sort-sorttmp-disc -o /mnt/work/align/p3_T_1B1/tx/tmpPgLKaI/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-disc.bam /dev/stdin) | /usr/local/bin/samtools sort -@ 8
 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-full -o /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_norma
l_fullprep-sort.bam /dev/stdin

And also:

samblaster: Unable to write to output file.

[fputs] Broken pipe
/bin/bash: line 1:   300 Exit 1                  /usr/local/bin/bwa mem -M -t 8 -R '@RG\tID:1\tPL:illumina\tPU:1_2015-02-10_p2_1_B1_tumor_normal_fullprep\tSM:p3_T_1B1' -v 1 /mnt/work/inputs/data/genomes/GRCh37/bwa/GRCh37.fa /mnt/work/align_prep/Patient_2_T-ready-1.fq.gz /mnt/work/align_prep/Patient_2_T-ready-2.fq.gz
       301                       | /usr/local/bin/samblaster --splitterFile >(/usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-spl -o /mnt/work/align/p3_T_1B1/tx/tmpWTYhZZ/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sr.bam /dev/stdin) --discordantFile >(/usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-disc -o /mnt/work/align/p3_T_1B1/tx/tmpPgLKaI/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-disc.bam /dev/stdin)
       302 Killed                  | /usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-full -o /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort.bam /dev/stdin
' returned non-zero exit status 137
Uncaught exception occurred
Traceback (most recent call last):
  File "/home/ubuntu/install/bcbio-vm/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.8.6a0-py2.7.egg/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks, log_stdout)
  File "/home/ubuntu/install/bcbio-vm/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.8.6a0-py2.7.egg/bcbio/provenance/do.py", line 95, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'docker attach --no-stdin ba105967e471cb1b7f819db56c138b95fbfc840e2717f0d8a750f122a52a957e

And

samblaster: Unable to write to output file.

[fputs] Broken pipe
/bin/bash: line 1:   300 Exit 1                  /usr/local/bin/bwa mem -M -t 8 -R '@RG\tID:1\tPL:illumina\tPU:1_2015-02-10_p2_1_B1_tumor_normal_fullprep\tSM:p3_T_1B1' -v 1 /mnt/work/inputs/data/genomes/GRCh37/bwa/GRCh37.fa /mnt/work/align_prep/Patient_2_T-ready-1.fq.gz /mnt/work/align_prep/Patient_2_T-ready-2.fq.gz
       301                       | /usr/local/bin/samblaster --splitterFile >(/usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-spl -o /mnt/work/align/p3_T_1B1/tx/tmpWTYhZZ/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sr.bam /dev/stdin) --discordantFile >(/usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-disc -o /mnt/work/align/p3_T_1B1/tx/tmpPgLKaI/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-disc.bam /dev/stdin)
       302 Killed                  | /usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-full -o /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort.bam /dev/stdin
' returned non-zero exit status 137
' returned non-zero exit status 1

(guessing these are multiple threads failing on the same problem at different times)

The contents of bcbio-nextgen-commands.log:

docker attach --no-stdin ba105967e471cb1b7f819db56c138b95fbfc840e2717f0d8a750f122a52a957e
[2015-02-11T01:03Z] gof3r get --no-md5 -k prepped/GRCh37/GRCh37-seq.tar.gz -b biodata | pigz -d -c | tar -xvp
[2015-02-11T01:05Z] gof3r get --no-md5 -k prepped/GRCh37/GRCh37-bwa.tar.gz -b biodata | pigz -d -c | tar -xvp
[2015-02-11T01:06Z] /usr/local/bin/bamtofastq filename=/mnt/work/inputs/work_data4cure/p2/Patient_2_T/Patient_2_T-ready.bam T=/mnt/work/align_prep/tx/tmpae6IBy/Patient_2_T-ready-1.fq-sort F=>(/usr/local/bin/bgzip -c /dev/stdin > /mnt/work/align_prep/tx/tmpae6IBy/Patient_2_T-ready-1.fq.gz) F2=>(/usr/local/bin/bgzip -c /dev/stdin > /mnt/work/align_prep/Patient_2_T-ready-2.fq.gz) S=/dev/null O=/dev/null O2=/dev/null collate=1 colsbs=8589934592
[2015-02-11T01:35Z] /usr/local/bin/bamtofastq filename=/mnt/work/inputs/work_data4cure/p2/Patient_2_N/Patient_2_N-ready.bam T=/mnt/work/align_prep/tx/tmpBp0qLl/Patient_2_N-ready-1.fq-sort F=>(/usr/local/bin/bgzip -c /dev/stdin > /mnt/work/align_prep/tx/tmpBp0qLl/Patient_2_N-ready-1.fq.gz) F2=>(/usr/local/bin/bgzip -c /dev/stdin > /mnt/work/align_prep/Patient_2_N-ready-2.fq.gz) S=/dev/null O=/dev/null O2=/dev/null collate=1 colsbs=8589934592
[2015-02-11T02:08Z] /usr/local/bin/bwa mem -M -t 8  -R '@RG\tID:1\tPL:illumina\tPU:1_2015-02-10_p2_1_B1_tumor_normal_fullprep\tSM:p3_T_1B1' -v 1 /mnt/work/inputs/data/genomes/GRCh37/bwa/GRCh37.fa /mnt/work/align_prep/Patient_2_T-ready-1.fq.gz /mnt/work/align_prep/Patient_2_T-ready-2.fq.gz | /usr/local/bin/samblaster --splitterFile >(/usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-spl -o /mnt/work/align/p3_T_1B1/tx/tmpWTYhZZ/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sr.bam /dev/stdin) --discordantFile >(/usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-disc -o /mnt/work/align/p3_T_1B1/tx/tmpPgLKaI/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-disc.bam /dev/stdin) | /usr/local/bin/samtools sort -@ 8 -m 875M -T /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort-sorttmp-full -o /mnt/work/align/p3_T_1B1/tx/tmpzQ8dnh/1_2015-02-10_p2_1_B1_tumor_normal_fullprep-sort.bam /dev/stdin

@chapmanb
Copy link
Member

Roy;
Sorry about the issue. I can reproduce similar problems and am working on a fix now. It looks like samtools sort is more aggressive with memory usage than the previous sambamba sort so I'm working on getting a memory setting that will avoid issues with larget inputs. I'll update, push a new docker image and report here when we have it working. Thanks for the debugging info and patience.

chapmanb added a commit that referenced this issue Feb 13, 2015
…sues seen with samtools on long running process -- the large samtools sort seems to leak memory and eventually crash or trigger shutdown from schedulers. #744
@chapmanb
Copy link
Member

Roy;
I pushed an updated docker instance which I hope will resolve the issues you found during alignment. You should just need to start a new cluster, no other updates needed.

In my tests on larger input files, samtools sort seemed to have sort of memory leak and would increase usage over time until eventually causing either SLURM to complain or issues with the system. I tried the new samtools version (1.2) and other adjustments but all would have this issue after ~24 hours of churning on a big 100Gb input. I swapped part of the sorting back to sambamba sort and this resolves the issue, so I hope will avoid problems for you. I'm going to take a look at integrating samtools rocksort to improve this more:

http://devblog.dnanexus.com/faster-bam-sorting-with-samtools-and-rocksdb/

but I hope the update will get you there. Another trick to avoid issues is to use align_split_size which will chunk up the alignment to avoid long runs of any individual part:

https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#alignment

Hope this cleans everything up for you. If you still have issues, please feel free to re-open and we'll work more on it. Thanks again for the patience and help debugging.

@rronen
Copy link
Author

rronen commented Feb 16, 2015

Hi Brad,

Great! All your help with getting this going is much appreciated. I'll give it a spin shortly and report back here if I run into anything further. (sorting with rocksdb looks promising!)

Thanks again,
Roy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants