Skip to content

Latest commit



134 lines (98 loc) · 5.95 KB

File metadata and controls

134 lines (98 loc) · 5.95 KB

Bcbio workflow

Documentation for bcbio: bcbio-nextgen readthedocs


  1. Follow instructions for starting an analysis using

  2. Download fastq files from facility to data folder

    • Download fastq files from a non-password protected url

      • wget --mirror url (for each file of sample in each lane)

      • Rory's code to concatenate files for the same samples on multiple lanes:

          barcodes="BC1 BC2 BC3 BC4"
          for barcode in $barcodes
          find folder -name $barcode_*R1.fastq.gz -exec cat {} \; > data/${barcode}_R1.fastq.gz
          find folder -name $barcode_*R2.fastq.gz -exec cat {} \; > data/${barcode}_R2.fastq.gz
    • Download from password protected FTP such as: Dana Farber

      • Dana Farber: wget -r <FTP address of folder> --user <username> --password <pwd> <destination>

      • MGH: wget -r --user=<username> --password=<pwd>*

    • Download fastq files from BioPolymers:

      • rsync -avr .


      • sftp
      • cd to correct folder
      • mget *.tab
      • mget *.bz2
    • Download from the Broad using Aspera:

      • To download data I use this script.
  3. Create metadata in Excel create sym links by concatenate("ln -s ", column $A2 with path_to_where_files_are_stored, " ", column with name of sym link $D2). Can extract parts of column using delimiters in Data tab column to text.

  4. Save Excel as text and replace ^M with new lines in vim:


  5. Settings for bcbio- make sure you have following settings in ~/.bashrc file:

   export PATH=/n/app/bcbio/tools/bin:$PATH
  1. Within the meta folder, add your comma-separated metadata file (projectname_rnaseq.csv)

    • first column is samplename and is the names of the fastq files as they appear in the directory (should be the file name without the extension (no .fastq or R#.fastq for paired-end reads))
    • second column is description and is unique names to call samples - provide the names you want to have the samples called by
    • FOR CHIP-SEQ need additional columns:
      • phenotype: chip or input for each sample
      • batch: batch1, batch2, batch3, ... for grouping each input with it's appropriate chip(s)
    • additional specifics regarding the metadata file:
  2. Within the config folder, add your custom Illumina template

    • Example template for human RNA-seq using Illumina prepared samples (genome_build for mouse = mm10, human = hg19 or hg38 (need to change star to hisat2 if using hg38):
    # Template for mouse RNA-seq using Illumina prepared samples
      - analysis: RNA-seq
        genome_build: mm10
          aligner: star
          quality_format: standard
          strandedness: firststrand
          tools_on: bcbiornaseq
    	organism: mus musculus
    	interesting_groups: [genotype]
      dir: /n/data1/cores/bcbio/PIs/vamsi_mootha/hbc_mootha_rnaseq_of_metabolite_transporter_KO_mouse_livers_hbc03618_1/bcbio_final
  3. Within the data folder, add all your fastq files to analyze.


  1. Go to /n/scratch3/groups/hsph/hbc/your_ECommonsID/PI and create an analysis folder. Change directories to analysis folder and create the full Illumina instructions using the Illumina template created in Set-up: step #6.

    • srun --pty -p interactive -t 0-12:00 --mem 8G bash start interactive job
    • cd path-to-folder/analysis change directories to analysis folder
    • -w template /n/data1/cores/bcbio/PIs/path_to_templates/star-illumina-rnaseq.yaml /n/data1/cores/bcbio/PIs/path_to_meta/*-rnaseq.csv /n/data1/cores/bcbio/PIs/path_to_data/*fastq.gz run command to create the full yaml file
  2. Create script for running the job (in analysis folder)

    #SBATCH -p priority
    #SBATCH -J mootha
    #SBATCH -o run.o
    #SBATCH -e run.e
    #SBATCH -t 0-100:00
    #SBATCH --cpus-per-task=1
    #SBATCH --mem-per-cpu=8G
    #SBATCH --mail-type=ALL
    export PATH=/n/app/bcbio/tools/bin:$PATH
    /n/app/bcbio/dev/anaconda/bin/ ../config/\*\_rnaseq.yaml -n 48 -t ipython -s slurm -q medium -r t=0-100:00 --timeout 300 --retries 3
  3. Go to work folder and start the job - make sure in an interactive session

    cd /n/scratch2/path_to_folder/analysis/\*\_rnaseq/work
    sbatch ../../runJob-\*\_rnaseq.slurm

Exploration of region of interest

  1. The bam files will be located here: path-to-folder/*-rnaseq/analysis/*-rnaseq/work/align/SAMPLENAME/NAME_*-rnaseq_star/ # needs to be updated

  2. Extracting interesting region (example)

    • samtools view -h -b sample1.bam "chr2:176927474-177089906" > sample1_hox.bam

    • samtools index sample1_hox.bam

Mounting bcbio

sshfs ~/bcbio -o volname=bcbio -o follow_symlinks