-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File streaming problems from s3? #744
Comments
Hi Roy, Thanks for posting about the issue. We’ve been working out the kinks running with bcbio_vm and it looks like you ran into one. Just to start, by any chance, are you running out of space on your cluster on AWS? It sounds like the files might be getting truncated. Best, Rory
|
Hi Rory, Thanks for the quick reply! Nope, space doesn't seem to be a problem (120G still available). Just to add some info, I am running on a "single-node cluster" configuration so the head-node is doing all the processing (instance type is c3.2xlarge with 200G "general purpose SSD" boot drive). Thanks! |
Hi guys, FYI, another restart and it is back to the first error. From bcbio-nextgen.log:
Edit: I initially attached an incorrect error message (a different error from my local install of bcbio-nextgen); now it is correct. Running the same config on the same input data with locally installed bcbio-nextgen also errors out but with (seemingly) different error:
|
Hi, I would check maybe there are some error there that can help. |
Roy; For alignment, we first pull the entire fastq file locally to avoid any gof3r streaming issues. I had hopes that we could stream the entire thing from S3 but found intermittent failures when doing that so shelved the idea. I can try to identify if there's something else going on. Debugging exact runs with docker tools is trickier, as you can't immediately re-run commands. This is something where we'll have to work on additional tooling but don't have it right now. Thanks much for trying this all out. |
Hi Brad, Thanks, I appreciate the help in tracking this down. The configuration is below. Perhaps because the inputs are BAM it still tries to stream using gof3r?
|
update: I've restarted bcbio_vm a couple more times (each time deleting the contents of 'work') and it consistently fails* during alignment. Before doing this, to rule out truncated files on s3, I re-copied the BAM (input) files to my s3 bucket and from there copied them over to a different AWS instance where they successfully ran through bcbio_nextgen with the same (almost) configuration. The reason I suspected s3-streaming/gof3r issue is that it seems the first two commands to run (from log/bcbio-nextgen-commands.log) are:
Would be great to figure this out and get bcbio_vm running. Thanks for all the help so far. * it errors with one of
or
|
… avoid issues with streaming to bamtofastq. #744
… avoid issues with streaming to bamtofastq. #744
Roy; |
Hi Brad, Thanks for fixing this issue, and for moving so fast on the encrypted NFS for bcbio_vm! I'll give bcbio_vm another spin as soon as you roll out the new Docker image. Btw, there's nothing special needed for using the new image when it's up? (bcbio_vm automatically gets the latest?) Cheers, |
Roy;
to set the size of the encrypted NFS filesystem. It will then be available to all nodes at Hope this gets everything running for you. Please feel free to reopen if you run into any issues. |
Hi Brad, Great news! Just gave this a first spin-up. A few things I've come across from most to least important (to the best of my judgment:).
In my AWS console under encryption the drive shows up as 'Not Encrypted'. Is there any way it is actually encrypted but shows up differently in AWS console?
Then, running (this would be great to figure out as then things can be started programmatically/without interversion).
This doesn't seem serious, as I'm able to run |
Roy;
This will fix problem 1 if you run the edit command again. You may also want to manually edit I'll look into problem 3 but I don't think it's critical to getting things running. Sorry for sending you down the wrong path and hope things work smoothly after the wrapper upgrade. Thanks again for the patience getting this going. |
Hi Brad, Okay got it. I will try this and update if I come across anything else. Thank again! |
Hi Brad, Okay, NFS encryption works! I was able to cleanly start up a cluster with encrypted NFS and almost get through a full run on a small test dataset. The error I finally got seems to be in the annotation (post variant calling) part: *unfortunately my test data is fastq, so haven't yet tried out the new part dealing with bam->fastq.
Either way, now testing with full-sized data (from BAM). |
Roy;
to see if that provides a more useful error message? It would also be helpful to know the last command that fails from |
Hi Brad, Okay, here's a tail of
I hope the following is more informative. After restarting the run with a single core I get:
With traceback:
Perhaps this is related to me adding to the configuration:
(previously I was not using this option) EDIT 1: looking at
EDIT2: this does not appear related to using the |
…leic The latest fixes to decompose to pass along the FORMAT DP field (atks/vt@e51d17e) cleared all of the other variant fields, losing information like IDs, filters and quals and causing downstream errors (bcbio/bcbio-nextgen#744). This avoids clearing these.
…eq: provide stable download link instead of git pull (@mjafin)
Roy; |
Roy;
from the machine you're using to manage it. If you do that, then remove the old problematic files on the cluster ( |
Hi Brad,Great, thanks for the quick fix! I restarted the run and it finished smoothly. Now trying a full run on real-sized data -- will update if I run into anything else. |
Hi Brad, Unfortunately, it looks like something fishy is still going on with BAM input. After bootsraping to the latest docker version, I:
It seems the 4 FASTQs are successfully prep'ed in
And also:
And
(guessing these are multiple threads failing on the same problem at different times) The contents of
|
Roy; |
…sues seen with samtools on long running process -- the large samtools sort seems to leak memory and eventually crash or trigger shutdown from schedulers. #744
Roy; In my tests on larger input files, samtools sort seemed to have sort of memory leak and would increase usage over time until eventually causing either SLURM to complain or issues with the system. I tried the new samtools version (1.2) and other adjustments but all would have this issue after ~24 hours of churning on a big 100Gb input. I swapped part of the sorting back to sambamba sort and this resolves the issue, so I hope will avoid problems for you. I'm going to take a look at integrating samtools rocksort to improve this more: http://devblog.dnanexus.com/faster-bam-sorting-with-samtools-and-rocksdb/ but I hope the update will get you there. Another trick to avoid issues is to use https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#alignment Hope this cleans everything up for you. If you still have issues, please feel free to re-open and we'll work more on it. Thanks again for the patience and help debugging. |
Hi Brad, Great! All your help with getting this going is much appreciated. I'll give it a spin shortly and report back here if I run into anything further. (sorting with rocksdb looks promising!) Thanks again, |
Hi Brad/Rory/John,
I've been trying out bcbio_vm on AWS (tumor/normal calling on a couple BAMs from s3 bucket). I was able to set up and start a cluster, and ssh into the head-node to start the run. However, I am getting different errors in the alignment step. First it was a paired read missing ("samblaster: Can't find first and/or second of pair in sam block of length 1 for id: X:XXX:XX:X:XX samblaster: Are you sure the input is sorted by read ids?"); after deleting the 'work' dir and restarting the run I got a different error ("pysam: SEQ and QUAL are of different lengths").
Running the same input BAMs & config on a local bcbio-nextgen full install (not bcbio_vm) I don't get any of these errors. I suspect something may be going wrong related to streaming the input files from s3 (perhaps related to gof3r?), although this is a total guess.
Just wanted to report this and see if you've run into anything similar, and if so what you'd recommend.
(Unrelated question: is there a simple way to directly re-run certain commands with the docker setup? This would be pretty straightforward on a full/local install since all the tools & data are accessible, wondering if there's an equivalent "debug procedure" with docker -- apologies in advance if this is already answered somewhere in the docs)
The text was updated successfully, but these errors were encountered: