Cannoli for BWA using Singularity #201

tahashmi · 2019-09-26T17:13:52Z

Hi,

I am facing problem in running Cannoli for BWA when running on a single cluster node. Please see my log file in the attachment for output. Thanks.

I have converted two paired-end fastq files to one fastq file with command successfully:

singularity exec -B ~/bulk/ bionic.simg ./cannoli/bin/cannoli-submit interleaveFastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_1.fastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_2.fastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fastq

Now I want to run bwa with this command but I am facing the problem when running this command:

singularity exec -B ~/bulk/ bionic.simg /cannoli/bin/cannoli-submit --driver-memory 5G --executor-memory 30G -- bwa /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_0259.adam sample -index /home/tahmad/bulk/bio_data/gnome/hg19/hg19.fasta -sequence_dictionary /home/tahmad/bulk/bio_data/gnome/hg19/hg19.dict -force_load_ifastq -fragments -add_files

singularity.txt

Cannoli and BWA-MEM are installed locally in my singularity container (bionic.simg).

The text was updated successfully, but these errors were encountered:

tahashmi · 2019-09-30T10:31:01Z

It was due to assigning small memory for --driver-memory --executor-memory options. Resolved.

heuermh · 2019-09-30T14:31:34Z

Hello @tahashmi, thank you for submitting this issue. Sorry for not replying last week.

Curious, what is your target platform for running this analysis? Running bwa via Cannoli on a single node will not provide any performance benefits, unless perhaps you are starting from Fragments in Parquet format.

tahashmi · 2019-09-30T17:07:08Z

Thank you @heuermh for your reply.

I have to run it on a cluster with more than one node.

I have a few more questions. If you may kindly give me answers.

How much exactly the memory should be allocated in --driver-memory and --executor-memory Spark options if the 'fastq' reads file size is 60GB?
How I can run such variant calling pipeline through Slurm 'sbatch' on multi-node cluster, ie. nodes=10 and each node has 28 cores? Or I can only run this through cannoli-shell by putting all commands in scala file and passing that file to cannoli-shell (Like: /cannoli/bin/cannoli-shell --driver-memory 20G --executor-memory 20G -i /home/tahmad/bulk_mnt/images/adam.scala)?
If I have nodes=10 and cores/node=28, how many --ntasks and --cpus-per-task should be assigned Ref ?

Thanks.

heuermh · 2019-09-30T18:10:54Z

How much exactly the memory should be allocated in --driver-memory and --executor-memory Spark options if the 'fastq' reads file size is 60GB?

Sorry, that is somewhat of an art, not a science, especially if Cannoli and external processes are involved. You need to reserve enough RAM for Spark to run and enough RAM for bwa to run. Also of interest will be the number of partitions for the input data set.

On a cluster with HDFS shared disk, the alignment step will be much faster if the data are in Parquet format. Parquet format has an additional benefit of being much smaller on disk. You can do this by writing out Parquet Fragments instead of interleaved FASTQ format. E.g. in cannoli-shell

$ cannoli-shell \
  $SPARK_ARGS
...
scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val fragments = sc.loadPairedFastqAsFragments("/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_1.fastq", "/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_2.fastq")
fragments: org.bdgenomics.adam.rdd.fragment.FragmentDataset = RDDBoundFragmentDataset with 0 reference sequences, 0 read groups, and 0 processing steps

scala> fragments.saveAsParquet("/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fragments.adam")

Then you can read from Parquet Fragments in the bwa step

$ cannoli-submit \
  $SPARK_ARGS \
  -- \
  bwa \
  -index /home/tahmad/bulk/bio_data/gnome/hg19/hg19.fasta \
  -sequence_dictionary /home/tahmad/bulk/bio_data/gnome/hg19/hg19.dict \
  /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fragments.adam \
  /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.alignments.adam \
  sample

Or if more convenient, skip writing interleaved FASTQ/fragments to disk at all

$ cannoli-shell \
  $SPARK_ARGS
...
scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val fragments = sc.loadPairedFastqAsFragments("/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_1.fastq", "/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_2.fastq")
fragments: org.bdgenomics.adam.rdd.fragment.FragmentDataset = RDDBoundFragmentDataset with 0 reference sequences, 0 read groups, and 0 processing steps

scala> import org.bdgenomics.cannoli.BwaArgs
import org.bdgenomics.cannoli.BwaArgs

scala> val args = new BwaArgs()
args: org.bdgenomics.cannoli.BwaArgs = org.bdgenomics.cannoli.BwaArgs@79ec9d81

scala> args.indexPath = "/home/tahmad/bulk/bio_data/gnome/hg19/hg19.fasta"
args.indexPath: String = /home/tahmad/bulk/bio_data/gnome/hg19/hg19.fasta

scala> args.sample = "sample"
args.sample: String = sample

scala> import org.bdgenomics.cannoli.Cannoli._
import org.bdgenomics.cannoli.Cannoli._

scala> val alignments = fragments.alignWithBwa(args)
alignments: org.bdgenomics.adam.rdd.read.AlignmentRecordDataset = RDDBoundAlignmentRecordDataset with 0 reference sequences, 0 read groups, and 0 processing steps

scala> alignments.dataset.show
+-------------...
|referenceName|start|originalStart...
|            1| 7811|         null...
|            1| 7847|         null...

How I can run such variant calling pipeline through Slurm 'sbatch' on multi-node cluster, ie. nodes=10 and each node has 28 cores?

You may want to spend some time with your cluster administrator discussing the best way to run Apache Spark jobs in general first. Hopefully Spark has already been installed as a Slurm module and they might have further advice on how to configure Spark executor nodes and how to take advantage of any shared disk the cluster might have.

For either cannoli-submit or cannoli-shell, you can specify the Apache Spark master node via the --master spark://hostname:7077 command line argument.

In your example above, it appears you are running Spark in single-node mode within a Singularity container. Cannoli supports calling out to external applications (e.g. bwa) via Singularity but running Spark and Cannoli from within a container may be unnecessarily complicating things.

You are also using the -add_files argument, which uses Spark to distribute the index files to local disk on each of the executor nodes. If the index files are on shared disk, you won't need to do this.

Please keep asking questions, and I'll do my best to help!

tahashmi · 2019-10-01T15:47:05Z

Thanks again.

First, I'll try to run as per your recommendations and then may ask more questions if any :-)

tahashmi · 2020-10-14T23:03:16Z

Hi @heuermh , I have a question regarding running sorting and mdup stages with BWA.

How all this below pipeline can be run on a Spark cluster with submit option? Like in which file format should I save it? And how to pass this file to submit.

I am using slrum on my cluster, and want to use $ADAM_HOME/bin/adam-submit --master spark://hostnamefromslurmdotout:7077 command to submit my job.

Thanks.

import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.cannoli.cli._
import org.bdgenomics.cannoli.cli.Cannoli._

val sample = "sample"
val reference = "ref.fa"

val reads = sc.loadPairedFastqAsFragments(sample + "_1.fq", sample + "_2.fq")

val bwaArgs = new BwaArgs()
bwaArgs.sample = sample
bwaArgs.indexPath = reference

val alignments = reads.alignWithBwa(bwaArgs)
val sorted = alignments.sortReadsByReferencePositionAndIndex()
val markdup = sorted.markDuplicates()

val freebayesArgs = new FreebayesArgs()
freebayesArgs.referencePath = reference

val variantContexts = markdup.callVariantsWithFreebayes(freebayesArgs)

variantContexts.saveAsVcf(sample + ".freebayes.vcf.bgzf")

heuermh · 2020-10-22T19:02:54Z

You could save it to a file e.g. workflow.scala and run it with adam-shell ... -i workflow.scala. Alternatively, you could run adam-shell and use paste mode to run it interactively.

adam-shell is a wrapper around Spark shell, so the docs here might be helpful:
https://spark.apache.org/docs/latest/quick-start.html#interactive-analysis-with-the-spark-shell

Hope this helps!

tahashmi · 2020-10-24T17:24:14Z

Hi @heuermh Thanks.

tahashmi closed this as completed Sep 30, 2019

heuermh added this to the 0.8.0 milestone Sep 30, 2019

tahashmi reopened this Sep 30, 2019

tahashmi closed this as completed Oct 1, 2019

tahashmi reopened this Oct 14, 2020

heuermh modified the milestones: 0.8.0, 0.11.0 Oct 22, 2020

tahashmi closed this as completed Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannoli for BWA using Singularity #201

Cannoli for BWA using Singularity #201

tahashmi commented Sep 26, 2019

tahashmi commented Sep 30, 2019

heuermh commented Sep 30, 2019 •

edited

Loading

tahashmi commented Sep 30, 2019

heuermh commented Sep 30, 2019

tahashmi commented Oct 1, 2019

tahashmi commented Oct 14, 2020

heuermh commented Oct 22, 2020

tahashmi commented Oct 24, 2020

Cannoli for BWA using Singularity #201

Cannoli for BWA using Singularity #201

Comments

tahashmi commented Sep 26, 2019

tahashmi commented Sep 30, 2019

heuermh commented Sep 30, 2019 • edited Loading

tahashmi commented Sep 30, 2019

heuermh commented Sep 30, 2019

tahashmi commented Oct 1, 2019

tahashmi commented Oct 14, 2020

heuermh commented Oct 22, 2020

tahashmi commented Oct 24, 2020

heuermh commented Sep 30, 2019 •

edited

Loading