IsoWorm, is a Snakemake pipeline developed to quantify isoforms expression levels in large RNA-seq datasets (paired-end short-reads). The pipeline consists of a series of interconnected modules that perform various stages of data analysis. It starts with a txt file containing SRA IDs, while the indications about the RNAseq library type, or a BAM file, and the references files (FASTA and GTF) are in the snakemake config file. The custom module of IsoWorm could be used to specifically analyse isoforms (in our case study, BRAF), using custom gtf files to quantify isoform-specific genomic regions. The quantification is made through Stringtie and all the plots are generated with R language. Conversely, the Salmon module of IsoWorm was used to quantify all the isoforms annotated in Ensembl db (our reference). An R script generates pie charts for genen isoform expression. IsoWorm implent also a module for single-end reads tp identifies polyA sites using custom R scripts, starting form Quant Seq 3' REV sequencing data. An extra module quantify gene expression on single cell data
The input files and parameters are specified in config_final.yml
, and for R plots and script in config file for R:
workflow_type: ""
- options: "polyA_module", "salmon_module", "custom_module", "custom_and_salmon_modules", "singlecell_module"sourcedir:
- your output directoryrefdir:
- your gtf fasta and all reference files directorysampledir:
- your txt samples files directoryenvsdir:
- your envs files directoryworkflow:
- your workflow (.smk) files directorysamples:
- your txt file containig the sra samples here!
stargenomedir, GRCh38.primary_assembly.genome:
- directory for STAR genome and Single cell index filesfasta: GRCh38.primary_assembly.genome:
- genome fasta reference file for STAR and Single cellfasta_salmon: GRCh38.primary_assembly.genome:
- transcript fasta reference for salmongtf: GRCh38.primary_assembly.genome:
- gtf file for all transcriptsgtf_personal: GRCh38.primary_assembly.genome:
- gtf file customize for your transcript of interest
SAindex
- star index{sample_name}_SE_small_Aligned.sortedByCoord.out.bam
- sliced bam of you gene of interest (BRAF in our case study), single endpolyA_filtered_3UTR204.csv
- peaks for poly A in BRAF-204 UTRspolyA_filtered_3UTR220.csv
- peaks for poly A in BRAF-220 UTRs
salmon_index
- salmon indexquant.sf
- all transcript quantified by salmonratio_salmon.pdf
- box plots ratio between our two isoforms of interestpie_charts.pdf
- pie charts expressions values of all our isoforms of interesttotal_salmon.pdf
- total expression levels of our gene of interest
SAindex
- star index{file}_small_Aligned.sortedByCoord.out.bam
- sliced bam of you gene of interest (BRAF in our case study)ratio_BRAF.pdf
- box plots ratio between our two isoforms of interest
filtered_data.h5ad
- filtered count from a single single cell sampleumap_plot.png
- umap plot with differents cell clustersqc_plots
- quality controll plot
- miniconda - install it according to the instructions.
- snakemake install using
conda
. - The rest of the dependencies are automatically installed using the
conda
feature ofsnakemake
.
Clone the repository:
git clone https://github.com/ctglab/isoworm
Edit config.yml
to set the input datasets and parameters, edit config.R
to set the input datasets and parameters for R and edit script.sh
with the directory where you want to download your fastqs, then issue:
snakemake -s snakefile_final.smk --use-conda --rerun-incomplete --core 2 -k