Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannoli for BWA using Singularity #201

Closed
tahashmi opened this issue Sep 26, 2019 · 8 comments
Closed

Cannoli for BWA using Singularity #201

tahashmi opened this issue Sep 26, 2019 · 8 comments
Milestone

Comments

@tahashmi
Copy link

Hi,

I am facing problem in running Cannoli for BWA when running on a single cluster node. Please see my log file in the attachment for output. Thanks.

I have converted two paired-end fastq files to one fastq file with command successfully:

singularity exec -B ~/bulk/ bionic.simg ./cannoli/bin/cannoli-submit interleaveFastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_1.fastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_2.fastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fastq

Now I want to run bwa with this command but I am facing the problem when running this command:

singularity exec -B ~/bulk/ bionic.simg /cannoli/bin/cannoli-submit --driver-memory 5G --executor-memory 30G -- bwa /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fastq /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_0259.adam sample -index /home/tahmad/bulk/bio_data/gnome/hg19/hg19.fasta -sequence_dictionary /home/tahmad/bulk/bio_data/gnome/hg19/hg19.dict -force_load_ifastq -fragments -add_files

singularity.txt

Cannoli and BWA-MEM are installed locally in my singularity container (bionic.simg).

@tahashmi
Copy link
Author

It was due to assigning small memory for --driver-memory --executor-memory options. Resolved.

@heuermh
Copy link
Member

heuermh commented Sep 30, 2019

Hello @tahashmi, thank you for submitting this issue. Sorry for not replying last week.

Curious, what is your target platform for running this analysis? Running bwa via Cannoli on a single node will not provide any performance benefits, unless perhaps you are starting from Fragments in Parquet format.

@heuermh heuermh added this to the 0.8.0 milestone Sep 30, 2019
@tahashmi
Copy link
Author

Thank you @heuermh for your reply.

I have to run it on a cluster with more than one node.

I have a few more questions. If you may kindly give me answers.

  1. How much exactly the memory should be allocated in --driver-memory and --executor-memory Spark options if the 'fastq' reads file size is 60GB?

  2. How I can run such variant calling pipeline through Slurm 'sbatch' on multi-node cluster, ie. nodes=10 and each node has 28 cores? Or I can only run this through cannoli-shell by putting all commands in scala file and passing that file to cannoli-shell (Like: /cannoli/bin/cannoli-shell --driver-memory 20G --executor-memory 20G -i /home/tahmad/bulk_mnt/images/adam.scala)?

  3. If I have nodes=10 and cores/node=28, how many --ntasks and --cpus-per-task should be assigned Ref ?

Thanks.

@tahashmi tahashmi reopened this Sep 30, 2019
@heuermh
Copy link
Member

heuermh commented Sep 30, 2019

How much exactly the memory should be allocated in --driver-memory and --executor-memory Spark options if the 'fastq' reads file size is 60GB?

Sorry, that is somewhat of an art, not a science, especially if Cannoli and external processes are involved. You need to reserve enough RAM for Spark to run and enough RAM for bwa to run. Also of interest will be the number of partitions for the input data set.

On a cluster with HDFS shared disk, the alignment step will be much faster if the data are in Parquet format. Parquet format has an additional benefit of being much smaller on disk. You can do this by writing out Parquet Fragments instead of interleaved FASTQ format. E.g. in cannoli-shell

$ cannoli-shell \
  $SPARK_ARGS
...
scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val fragments = sc.loadPairedFastqAsFragments("/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_1.fastq", "/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_2.fastq")
fragments: org.bdgenomics.adam.rdd.fragment.FragmentDataset = RDDBoundFragmentDataset with 0 reference sequences, 0 read groups, and 0 processing steps

scala> fragments.saveAsParquet("/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fragments.adam")

Then you can read from Parquet Fragments in the bwa step

$ cannoli-submit \
  $SPARK_ARGS \
  -- \
  bwa \
  -index /home/tahmad/bulk/bio_data/gnome/hg19/hg19.fasta \
  -sequence_dictionary /home/tahmad/bulk/bio_data/gnome/hg19/hg19.dict \
  /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.fragments.adam \
  /home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025.alignments.adam \
  sample

Or if more convenient, skip writing interleaved FASTQ/fragments to disk at all

$ cannoli-shell \
  $SPARK_ARGS
...
scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val fragments = sc.loadPairedFastqAsFragments("/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_1.fastq", "/home/tahmad/bulk/bio_data/reads/gcat_025/gcat_025_2.fastq")
fragments: org.bdgenomics.adam.rdd.fragment.FragmentDataset = RDDBoundFragmentDataset with 0 reference sequences, 0 read groups, and 0 processing steps

scala> import org.bdgenomics.cannoli.BwaArgs
import org.bdgenomics.cannoli.BwaArgs

scala> val args = new BwaArgs()
args: org.bdgenomics.cannoli.BwaArgs = org.bdgenomics.cannoli.BwaArgs@79ec9d81

scala> args.indexPath = "/home/tahmad/bulk/bio_data/gnome/hg19/hg19.fasta"
args.indexPath: String = /home/tahmad/bulk/bio_data/gnome/hg19/hg19.fasta

scala> args.sample = "sample"
args.sample: String = sample

scala> import org.bdgenomics.cannoli.Cannoli._
import org.bdgenomics.cannoli.Cannoli._

scala> val alignments = fragments.alignWithBwa(args)
alignments: org.bdgenomics.adam.rdd.read.AlignmentRecordDataset = RDDBoundAlignmentRecordDataset with 0 reference sequences, 0 read groups, and 0 processing steps

scala> alignments.dataset.show
+-------------...
|referenceName|start|originalStart...
|            1| 7811|         null...
|            1| 7847|         null...

How I can run such variant calling pipeline through Slurm 'sbatch' on multi-node cluster, ie. nodes=10 and each node has 28 cores?

You may want to spend some time with your cluster administrator discussing the best way to run Apache Spark jobs in general first. Hopefully Spark has already been installed as a Slurm module and they might have further advice on how to configure Spark executor nodes and how to take advantage of any shared disk the cluster might have.

For either cannoli-submit or cannoli-shell, you can specify the Apache Spark master node via the --master spark://hostname:7077 command line argument.

In your example above, it appears you are running Spark in single-node mode within a Singularity container. Cannoli supports calling out to external applications (e.g. bwa) via Singularity but running Spark and Cannoli from within a container may be unnecessarily complicating things.

You are also using the -add_files argument, which uses Spark to distribute the index files to local disk on each of the executor nodes. If the index files are on shared disk, you won't need to do this.

Please keep asking questions, and I'll do my best to help!

@tahashmi
Copy link
Author

tahashmi commented Oct 1, 2019

Thanks again.

First, I'll try to run as per your recommendations and then may ask more questions if any :-)

@tahashmi tahashmi closed this as completed Oct 1, 2019
@tahashmi tahashmi reopened this Oct 14, 2020
@tahashmi
Copy link
Author

Hi @heuermh , I have a question regarding running sorting and mdup stages with BWA.

How all this below pipeline can be run on a Spark cluster with submit option? Like in which file format should I save it? And how to pass this file to submit.

I am using slrum on my cluster, and want to use $ADAM_HOME/bin/adam-submit --master spark://hostnamefromslurmdotout:7077 command to submit my job.

Thanks.

import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.cannoli.cli._
import org.bdgenomics.cannoli.cli.Cannoli._

val sample = "sample"
val reference = "ref.fa"

val reads = sc.loadPairedFastqAsFragments(sample + "_1.fq", sample + "_2.fq")

val bwaArgs = new BwaArgs()
bwaArgs.sample = sample
bwaArgs.indexPath = reference

val alignments = reads.alignWithBwa(bwaArgs)
val sorted = alignments.sortReadsByReferencePositionAndIndex()
val markdup = sorted.markDuplicates()

val freebayesArgs = new FreebayesArgs()
freebayesArgs.referencePath = reference

val variantContexts = markdup.callVariantsWithFreebayes(freebayesArgs)

variantContexts.saveAsVcf(sample + ".freebayes.vcf.bgzf")

@heuermh
Copy link
Member

heuermh commented Oct 22, 2020

You could save it to a file e.g. workflow.scala and run it with adam-shell ... -i workflow.scala. Alternatively, you could run adam-shell and use paste mode to run it interactively.

adam-shell is a wrapper around Spark shell, so the docs here might be helpful:
https://spark.apache.org/docs/latest/quick-start.html#interactive-analysis-with-the-spark-shell

Hope this helps!

@heuermh heuermh modified the milestones: 0.8.0, 0.11.0 Oct 22, 2020
@tahashmi
Copy link
Author

Hi @heuermh Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants