Documentation for bcbio: bcbio-nextgen readthedocs
Follow instructions for starting an analysis using
Download fastq files from facility to data folder
Download fastq files from a non-password protected url
wget --mirror url
(for each file of sample in each lane) -
Rory's code to concatenate files for the same samples on multiple lanes:
barcodes="BC1 BC2 BC3 BC4" for barcode in $barcodes do find folder -name $barcode_*R1.fastq.gz -exec cat {} \; > data/${barcode}_R1.fastq.gz find folder -name $barcode_*R2.fastq.gz -exec cat {} \; > data/${barcode}_R2.fastq.gz done
Download from password protected FTP such as Dana Farber
wget -r <FTP address of folder> --user <username> --password <pwd> <destination>
Download fastq files from BioPolymers:
rsync -avr .
to correct foldermget *.tab
mget *.bz2
Download from the Broad using Aspera:
- To download data I use this script.
Create metadata in Excel create sym links by concatenate("ln -s ", column $A2 with path_to_where_files_are_stored, " ", column with name of sym link $D2). Can extract parts of column using delimiters in Data tab column to text.
Save Excel as text and replace ^M with new lines in vim:
Settings for bcbio- make sure you have following settings in
export PATH=/n/app/bcbio/tools/bin:$PATH
Within the
folder, add your comma-separated metadata file (projectname_rnaseq.csv
)- first column is
and is the names of the fastq files as they appear in the directory (should be the file name without the extension (no .fastq or R#.fastq for paired-end reads)) - second column is
and is unique names to call samples - provide the names you want to have the samples called by - FOR CHIP-SEQ need additional columns:
for each samplebatch
: batch1, batch2, batch3, ... for grouping each input with it's appropriate chip(s)
- additional specifics regarding the metadata file:
- first column is
Within the
folder, add your custom Illumina template- Example template for human RNA-seq using Illumina prepared samples (genome_build for mouse = mm10, human = hg19 or hg38 (need to change star to hisat2 if using hg38):
# Template for mouse RNA-seq using Illumina prepared samples --- details: - analysis: RNA-seq genome_build: mm10 algorithm: aligner: star quality_format: standard strandedness: firststrand tools_on: bcbiornaseq bcbiornaseq: organism: mus musculus interesting_groups: [genotype] upload: dir: /n/data1/cores/bcbio/PIs/vamsi_mootha/hbc_mootha_rnaseq_of_metabolite_transporter_KO_mouse_livers_hbc03618_1/bcbio_final
- List of genomes available can be found by running
- strandedness options:
- Additional parameters can be found:
- Best practice templates can be found:
Within the
folder, add all your fastq files to analyze.
Go to
and create ananalysis
folder. Change directories toanalysis
folder and create the full Illumina instructions using the Illumina template created in Set-up: step #6.srun --pty -p interactive -t 0-12:00 --mem 8G bash
start interactive jobcd path-to-folder/analysis
change directories to analysis -w template /n/data1/cores/bcbio/PIs/path_to_templates/star-illumina-rnaseq.yaml /n/data1/cores/bcbio/PIs/path_to_meta/*-rnaseq.csv /n/data1/cores/bcbio/PIs/path_to_data/*fastq.gz
run command to create the full yaml file
Create script for running the job (in analysis folder)
For a larger job:
#SBATCH -p medium
#SBATCH -J mootha
#SBATCH -o run.o
#SBATCH -e run.e
#SBATCH -t 0-100:00
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=8G
#SBATCH --mail-type=ALL
export PATH=/n/app/bcbio/tools/bin:$PATH
/n/app/bcbio/dev/anaconda/bin/ ../config/\*\_rnaseq.yaml -n 48 -t ipython -s slurm -q medium -r t=0-100:00 --timeout 300 --retries 3
For a smaller job, it might be faster in overall time to just run the job on the priority queue. If you only have a few samples, and your fairshare score is low, running on the priority queue could end up being faster since you will quickly get a job there and not have to wait.
#SBATCH -p priority
#SBATCH -J mootha
#SBATCH -o run.o
#SBATCH -e run.e
#SBATCH -t 0-100:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=64G
#SBATCH --mail-type=ALL
export PATH=/n/app/bcbio/tools/bin:$PATH
/n/app/bcbio/dev/anaconda/bin/ ../config/\*\_rnaseq.yaml -n 8
Go to work folder and start the job - make sure in an interactive session
cd /n/scratch2/path_to_folder/analysis/\*\_rnaseq/work sbatch ../../runJob-\*\_rnaseq.slurm
The bam files will be located here:
# needs to be updated -
Extracting interesting region (example)
samtools view -h -b sample1.bam "chr2:176927474-177089906" > sample1_hox.bam
samtools index sample1_hox.bam
sshfs ~/bcbio -o volname=bcbio -o follow_symlinks