-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannoli for BWA using Singularity #201
Comments
It was due to assigning small memory for --driver-memory --executor-memory options. Resolved. |
Hello @tahashmi, thank you for submitting this issue. Sorry for not replying last week. Curious, what is your target platform for running this analysis? Running |
Thank you @heuermh for your reply. I have to run it on a cluster with more than one node. I have a few more questions. If you may kindly give me answers.
Thanks. |
Sorry, that is somewhat of an art, not a science, especially if Cannoli and external processes are involved. You need to reserve enough RAM for Spark to run and enough RAM for On a cluster with HDFS shared disk, the alignment step will be much faster if the data are in Parquet format. Parquet format has an additional benefit of being much smaller on disk. You can do this by writing out Parquet $ cannoli-shell \
$SPARK_ARGS
...
scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._
scala> val fragments = sc.loadPairedFastqAsFragments("/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_1.fastq", "/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_2.fastq")
fragments: org.bdgenomics.adam.rdd.fragment.FragmentDataset = RDDBoundFragmentDataset with 0 reference sequences, 0 read groups, and 0 processing steps
scala> fragments.saveAsParquet("/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fragments.adam") Then you can read from Parquet
Or if more convenient, skip writing interleaved FASTQ/fragments to disk at all
You may want to spend some time with your cluster administrator discussing the best way to run Apache Spark jobs in general first. Hopefully Spark has already been installed as a Slurm module and they might have further advice on how to configure Spark executor nodes and how to take advantage of any shared disk the cluster might have. For either In your example above, it appears you are running Spark in single-node mode within a Singularity container. Cannoli supports calling out to external applications (e.g. You are also using the Please keep asking questions, and I'll do my best to help! |
Thanks again. First, I'll try to run as per your recommendations and then may ask more questions if any :-) |
Hi @heuermh , I have a question regarding running How all this below pipeline can be run on a Spark cluster with I am using slrum on my cluster, and want to use Thanks.
|
You could save it to a file e.g.
Hope this helps! |
Hi @heuermh Thanks. |
Hi,
I am facing problem in running Cannoli for BWA when running on a single cluster node. Please see my log file in the attachment for output. Thanks.
I have converted two paired-end fastq files to one fastq file with command successfully:
singularity exec -B ~/bulk/ bionic.simg ./cannoli/bin/cannoli-submit interleaveFastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_1.fastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_2.fastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fastq
Now I want to run bwa with this command but I am facing the problem when running this command:
singularity exec -B ~/bulk/ bionic.simg /cannoli/bin/cannoli-submit --driver-memory 5G --executor-memory 30G -- bwa /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_0259.adam sample -index /home/tahmad/bulk/bio_data/gnome/hg19/hg19.fasta -sequence_dictionary /home/tahmad/bulk/bio_data/gnome/hg19/hg19.dict -force_load_ifastq -fragments -add_files
singularity.txt
Cannoli and BWA-MEM are installed locally in my singularity container (bionic.simg).
The text was updated successfully, but these errors were encountered: