A pipeline for running single-cell demultiplexing simulations with demuxlet.
demux is a Snakemake pipeline for simulating a multiplexed droplet scRNA-seq (dscRNA-seq) experiment using data from individual scRNA-seq samples and quantifying the effectiveness of deconvoluting the sample identify of each cell in the simulated dataset with demuxlet. Such an analysis is helpful for reducing the cost of library preparations for dscRNA-seq experiments.
Here is an example flowchart depicting the demux pipeline with five input samples.
Each step is briefly described below:
- unique_barcodes: aggregate cell barcodes across all samples provided as input and remove any cell barcodes that appear more than once
- simulate: simulate a multiplexed dscRNA-seq experiment with a specified doublet rate (default: 0.3). The doublet rate specifies the percentage of cells from the aggregate dataset expected to be found in doubletes. We define two types of doublets: (1) doublets containing cells from different samples, and (2) doublets containing cells from the same samples
- table: create a reference table mapping the original (ground truth) barcodes to the new barcodes (for analyzing demuxlet performance)
- new bam: edit the BAM files corresponding to each sample provided as input to reflect simulated doublets. For ever pair of cells randomly selected to be in a doublet, we change the cell barcode of one cell in the pair to match that of the other cell.
- merge: merge the edited BAM files into one BAM file to reflect a multiplexed experiment.
- sort: sort the merged BAM file
- demux: run demuxlet with the BAM file as input
- results: analyze demuxlet performance
Execute the following command.
git clone https://github.com/zrcjessica/demux.git
The pipeline is written as a Snakefile which can be executed via Snakemake. We recommend installing version 5.18.0:
conda create -n snakemake -c bioconda -c conda-forge 'snakemake==5.18.0' --no-channel-priority
We highly recommend you install Snakemake via conda like this so that you can use the --use-conda
flag when calling snakemake
to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the env files.
demux minimally requires the following inputs, which must be specified in the config.yml
file:
- a list of individually processed samples
- for each sample above, the following Cell Ranger outputs from the
cellranger count
pipeline:- Barcoded BAM
- Cell barcodes from Filtered Feature-Barcode Matrix
- a vcf file containing the genotypes of all samples from above
See below for additional input parameters.
It is recommended to symlink your data into the gitignored data/
folder:
ln -s /path/to/your/data data
If you ever need to switch the input to a different dataset, you can just change the symlink path.
demux returns a table summarizing the performance of demuxlet on the simulated data and a plot showing the precision-recall curves.
You can also symlink your output, if you think you might want to change it in the future:
ln -s /iblm/netapp/data1/jezhou/Telese_Rat_Amygdala/demultiplex_simulation/out out
Locally:
./run &
or on a SGE cluster:
qsub run
You must modify the config.yml file to specify paths to your data. The config file is currently configured to run the pipeline on our data (in the git-ignored data/
folder). The config file contains the following variables:
The data
variable contains nested variables for each of your samples, with the paths to their corresponding BAM (reads
) and filtered barcodes (barcodes
) files (Cell Ranger output) as well as the sample's vcf_id
.
Give the path to the vcf file containing genotypes for all samples nested in the data
variable.
List the samples from those nested in the data
variable that you want to be included as input to the demultiplexing simulation. If this line is not provided or commented out, all samples from the data
variable will be used.
Doublet rate to be used for demultiplexing simulations. Defaults to 0.3.
Path to directory in which to write output files. If not provided, defaults to out
. The directory will be created if it does not already exist.
* Inputs required
A Snakemake pipeline for running the demultiplexing simulation.
Config file that defines options and input for the pipeline.
Various scripts used by the pipeline. See the script README for more information.
The dependencies of our pipeline, specified as conda
environment files. These are used by Snakemake to automatically install our dependencies at runtime.
An example bash script for executing the pipeline using snakemake
and conda
. Any arguments to this script are passed directly to snakemake
.