Skip to content

DRUG-seq (Digital RNA with pertUrbation of Genes) [1] is a new technique that provides a cost-effective way for high-throughput, whole transcriptome RNA-seq profiling.

License

Notifications You must be signed in to change notification settings

MSDLLCpapers/DRAGoN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

John Gaspar

  • 19 Jul, 2019

Scott Norton

  • 12 Jul, 2022
  • 01 Jul, 2024
  • 13 Sep, 2024
  • 22 Jan, 2025

Introduction

DRUG-seq (Digital RNA with pertUrbation of Genes) [1] is a new technique that provides a cost-effective way for high-throughput, whole transcriptome RNA-seq profiling. Cells grown in multiwell plates are lysed, and reverse transcription is performed using oligos that contain barcodes that are unique to each well (Fig. 1a, from Ye et al., 2018). After paired-end sequencing, the R1 reads contain the barcodes, as well as unique molecular identifiers (UMIs) that allow for the removal of PCR duplicates (Fig. 1b). The R2 reads match the original mRNA sequence.

The computational pipeline consists of several steps: aligning the R2 reads to a genome (or transcriptome), counting gene overlaps, and then demultiplexing and removing PCR duplicates using the R1 reads. The analyses in the DRUG-seq paper were performed with custom code, to which we do not have access. Instead, we modified the SpatialTranscriptomics Pipeline [2] to produce output matrices for DRUG-seq data (Fig. 2, from Fernández Navarro et al., 2017). This pipeline, based on ST Pipeline v1.7.6, would time out on production-scale datasets, necessitating a new approach. Attempts to co-opt zUMIs [4] allowed a modest improvement in the maximum dataset size we could handle, however this would fail on memory constraints.

This repository presents two alternative pipelines implemented in nextflow. The first, STARsolo.nf, wraps STAR running in Solo mode [5] to provide a rough estimate of UMI counts. The second is DRAGoN.nf, a novel proprietary pipeline called DRUG-seq/RNA-seq Analysis of Genetically-perturbed cells on Nextflow (DRAGoN), employs a more robust duplicate removal algorithm. Both of these pipelines leverage Nextflow's efficient resource distribution to accelerate processing while keeping memory usage in check.

Quick start

If your data are on S3...

You will need to explicitly tell nextflow which credentials to use. The recommendation is to use the following procedure:

  1. Create ~/.aws/credentials with mode 600 and define a profile there with your key pair. Example:
[my_aws_profile]
aws_access_key_id = YourAwsAccessKey
aws_secret_access_key = YourSecretAccessKey
  1. Create ~/.nextflow/config and add the following lines:
aws.profile = my_aws_profile
env.AWS_PROFILE = aws.profile

Replace my_aws_profile above with the actual name of your profile as defined in step 1.

  1. Update ~/.aws/credentials every time your access keys rotate.

Generic Linux environment

First, install git and whichever engine you wish to use to run processes (anaconda, singularity, docker). Then install nextflow 23.10.1 (or later) from https://get.nextflow.io. Finally submit your job using an appropriate nextflow run command for your system. If you have your own local cluster, you may need to configure nextflow to use it yourself. You will also need to specify full paths to reference datasets at runtime.

If you want to run this workflow using the conda profile, you will need to compile the binary tools yourself. See CONTRIBUTING.md for instructions.

Configuration profiles

This pipeline supports singularity, docker, and conda executors via the -profile argument. Additional executors can be enabled by supplying a custom config.

If no profile is provided:

  • You must have a minimum of 80 GB RAM available.
  • The conda executable must exist in the runtime environment.
  • Nextflow will submit jobs to the default (local) executor. The number of jobs that can be executed in parallel will therefore be limited by the resources available on the current host.

New in 1.1: You can now select between the DRAGoN and STARsolo pipelines via the main.nf script. By default, DRAGoN will be selected. Use the --pipeline switch to select the STARsolo pipeline if so desired.

With that configured, the pipeline can be launched as such:

nextflow run -w /SFS/scratch/$USER/.nextflow -profile ctc_hpc_sge MSDLLCPapers/DRAGoN [--PARAM VALUE ...|-params-file PARAMS.yml]



Usage

Typical pipeline command:

  nextflow run DRAGoN -profile <docker/singularity/.../institute> --IO.metadata metadata.xlsx --IO.fastqdir s3://bucket/path/to/fastq/folder/ --IO.outdir <OUTDIR>

--help                                [boolean, string] Show the help message for all top level parameters. When a parameter is given to `--help`, the full help message of that parameter will be printed.
--helpFull                            [boolean]         Show the help message for all non-hidden parameters.
--showHidden                          [boolean]         Show all hidden parameters in the help message. This needs to be used in combination with `--help` or `--helpFull`.

Input/Output options
  --IO.fastqdir                       [string]  Path containing fastq files. Can be an empty or nonexistent directory, in which case these will be generated.
  --IO.bcldir                         [string]  Path to raw base calls off the sequencer. Required if --IO.fastqdir is an empty or nonexistent directory.
  --IO.sampleSheet                    [string]  Path to sample sheet defining sample names and I5/I7 indices for use with bcl2fastq. Required if --IO.fastqdir is an empty or nonexistent directory.
  --IO.reference                      [string]  Name of a reference genome for building STAR indices. Soft required if --IO.star is an empty or nonexistent directory. Hard required if --IO.star is missing.
  --IO.star                           [string]  Path to STAR genome indices for the target species. Can be an empty or nonexistent directory, in which case these will be generated. If omitted, --IO.reference is required.
  --IO.gtf                            [string]  Path to gene annotation GTF file for building STAR indices. Required if --IO.star is an empty or nonexistent directory and --IO.reference is not provided.
  --IO.fasta                          [string]  Path to genome sequence fasta file for building STAR indices. Required if --IO.star is an empty or nonexistent directory and --IO.reference is not provided.
  --IO.contaminant                    [string]  A STAR (versionGenome == 2.7.4a) index of a contaminant genome reference (rRNA, ncRNA, adapter sequences, etc.). If supplied, removes reads that align to the contaminant genome from further consideration. Such reads will be marked with flag 0x800 (SUPPLEMENTARY).
  --IO.metadata                       [string]  Path to Excel spreadsheet defining the experiment metadata
  --IO.barcodes                       [string]  Path to tab-separated text file defining barcode sequences and well names. This file should have no header row. The first column contains the barcode sequence, and the second column contains the name of the well. All other columns are ignored. Required if --IO.metadata is omitted.
  --IO.sampleName                     [string]  Prefix of fastq filenames within --IO.fastqdir to process. For example, `--IO.sampleName Sample1` will select all files matching `path/to/fastq/Sample1_*.fastq.gz`. Specify multiple prefixes using a quoted list. Required if --IO.metadata is omitted.
  --IO.outdir                         [string]  Path to store the analysis outputs. Will be created if it does not already exist. Previous analyses in this directory will be overwritten.
  --IO.keep_bam                       [string]  If set, a BAM file is included in the output.
  --IO.phred                          [string]  Phred quality score offset  (accepted: 33, 64) [default: 33]
  --IO.no_prestage_index              [boolean] When running locally with locally-mounted STAR index, disable copying it to the workdir

Options for QC and well demultiplexing
  --Demux.dmuxn                       [integer] Split input fastq into chunks of this size (0: no chunking) [default: 0]
  --Demux.split                       [integer] Subdivide the subsequent tasks into at most this many parallel jobs. Reads will be grouped by barcode, and barcodes will be assorted into downstream tasks in order to distribute the workload as evenly as possible. Depending on the plate layout, this can still result in unbalanced workloads. If 0, each well gets its own downstream task. [default: 0]
  --Demux.umilen                      [integer] Length of the UMI portion of the R1 read. For an experiment with well barcodes of length B, the UMI begins at position B+1. [default: 10]
  --Demux.mismatch                    [integer] Maximum number of mismatches (substitutions) allowed in the barcode sequence. If 0, only reads whose barcodes exactly match the predefined barcode sequence for a given well will be assorted to that well. Increasing this figure lets you capture more reads per well at the cost of some added noise. Setting this too high will cause an error. [default: 1]
  --Demux.bcNcountMax                 [integer] Maximum number of base calls in the barcode sequence that are allowed to be N. Defaults to --Demux.mismatch.
  --Demux.minQuality                  [integer] Base call quality threshold for up-front UMI filtering and cDNA base trimming. [default: 20]
  --Demux.minUmiQualBases             [integer] Minimum number of bases in the UMI portion of R1 whose quality scores must meet or exceed --Demux.minQuality. [default: 6]
  --Demux.maxATcontent                [number]  Maximum fraction of bases in the cDNA (R2) read allowed to be A or T. Reads that exceed this are discarded. [default: 0.9]
  --Demux.maxGCcontent                [number]  Maximum fraction of bases in the cDNA (R2) read allowed to be G or C. Reads that exceed this are discarded. [default: 0.9]
  --Demux.minLength                   [integer] Minimum allowed length of the cDNA (R2) read before or after trimming. Reads that do not meet this are discarded. [default: 20]
  --Demux.homoPolymer                 [integer] Define a homopolymer to be a run of at least this many of the same base in the cDNA (R2) read. If one is found, the read is trimmed from the start of the homopolymer to the 3' end of the read. [default: 10]
  --Demux.maxHomoMmatch               [integer] Maximum number of bases in a suspected homopolymer which are allowed to vary from that base. If set to 1 or higher, there can be a significant performance impact. [default: 0]
  --Demux.adapter                     [string]  PCR/sequencing adapter to trim from the 3' end of R2 reads [default: CTGTCTCTTATA]
  --Demux.data_quality_errors         [string]  Governs how the pipeline should handle wells with low quality.
raise (default): Aborts the pipeline on error
warn: Emits a warning to stderr
ignore: Don't run heuristics  (accepted: ignore, warn, raise) [default: raise]
  --Demux.downsample_to               [string]  Determine the threshold number of QC-pass reads above which a well will be downsampled. If a decimal between 0 and 1, the threshold is that fraction of the total QC-pass reads in the experiment. If no value is provided, the threshold will be determined adaptively as the lower of 15% of the total number of reads in the experiment, or the expression `floor(pct75+1.75*IQR)`, where pct75 is the 75th percentile of reads assigned to wells and IQR is the difference between the 25th and 75th percentiles. Default: no downsampling.
  --Demux.downsample_threshold        [string]  Downsample wells to a maximum of N reads. If a decimal between 0 and 1, will downsample to that fraction regardless of original size. If not provided, will choose the threshold given by --Demux.downsample_threshold. Default: no downsampling.
  --Demux.downsample_seed             [integer] Set a seed for the downsampling logic, for the sake of reproducibility. Default: random seed
  --Demux.plate_layout_xoff           [integer]
  --Demux.plate_layout_yoff           [integer]

Options for genome alignment
  --Align.outFilterMultimapNmax       [integer] Discard reads with more than this many valid alignments to the genome [default: 20]
  --Align.outFilterMatchNmin          [integer] Discard alignments with fewer than this many bases matching the reference [default: 20]
  --Align.outFilterMultimapScoreRange [integer] Discard secondary alignments whose scores are more than this far away from the best alignment for that read [default: 1]
  --Align.alignIntronMin              [integer] Minimum number of intron bases to map the read [default: 1]
  --Align.alignIntronMax              [integer] Maximum number of intron bases to map the read [default: 1]
  --Align.clip5pNbases                [integer] Clip this many bases from the 5' end of reads [default: 0]
  --Align.clip3pNbases                [integer] Clip this many bases from the 3' end of reads [default: 0]
  --Align.outFilterMismatchNoverLmax  [number]  Discard alignments with more than this fraction of mismatched bases [default: 0.1]
  --Align.twopassMode                 [string]  Whether to run STAR in single-pass (default) or Basic two-pass mode  (accepted: None, Basic) [default: None]
  --Align.extraOptions                [string]  A quoted string of extra flags to pass to STAR. Use this for options not enumerated above.
  --Align.MEM                         [string]  Amount of memory to reserve per CPU when running alignment processes. Interpreted as a number of bytes, or you can add a memory suffix e.g. 8GB. (def. 20GB; recommend increasing it for larger genomes) [default: 64GB]

Options for feature annotation
  --Count.fracOverlap                 [number]  Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be within range [0,1]. 0 by default. Number of overlapping bases is counted from both reads if paired end. Both this option and '--minOverlap' option need to be satisfied for read assignment. [default: 0]
  --Count.strandness                  [integer] Denote whether reads are unstranded (0, def.), forward-strand (1) or reverse-strand (2) specific. [default: 0]
  --Count.extraOptions                [string]  A quoted string of extra flags to pass to featureCounts.
  --Count.adjust_effective_lengths    [boolean] If set, when normalizing counts, the gene or transcript length will be adjusted by the average read (fragment) length for each sample as part of the calculation.
  --Count.MEM                         [string]  Amount of memory to reserve per CPU when running alignment processes. Interpreted as a number of bytes, or you can add a memory suffix e.g. 8GB. (def. 8GB) [default: 8GB]

Options for duplicate detection
  --Dedup.umi_mismatch                [integer] Maximum edit distance between two UMIs to consider them duplicates. [default: 2]
  --Dedup.umiDist                     [integer] Search for duplicates within a sliding genomic window of this many bases. [default: 10]
  --Dedup.ambiguous                   [string]  Comma-separated list of strategies for handling multimapping reads. Accepted values are:
                                                  Unique: Only count reads assigned to a single gene.
                                                  Uniform: Distribute each multimapping read's count uniformly across all genes they map to.
                                                  PropUnique: Attempt to distribute multimapping reads proportional to the number of reads mapping uniquely to each gene. Falls back on Uniform.
                                                  EM: Computes the maximum likelihood distribution of multimapping reads using an expectation-maximization algorithm.
                                                  Rescue: Equivalent to running a single iteration of EM. [default: Unique]
  --Dedup.em_iter_max                 [integer] DRAGoN only: If --Dedup.ambiguous contains EM, the maximum number of iterations of EM [default: 100]
  --Dedup.em_max_diff                 [number]  DRAGoN only: If --Dedup.ambiguous contains EM, the tolerance for convergence, measured by the largest absolute change in counts for any gene in the current well [default: 0.01]
  --Dedup.em_small_thresh             [number]  DRAGoN only: If --Dedup.ambiguous contains EM, clamp counts smaller than this to 0 [default: 0.01]
  --Dedup.debug                       [boolean]

uncategorized_options
  --private.versionGenome             [string]  [default: 2.7.4a]
  --private.overhang                  [integer] [default: 89]
  --awsbatch_queue                    [string]  awsbatch queue to submit jobs to
  --awsbatch_jobrole                  [string]  IAM role to use for awsbatch worker instances
  --errorStrategy                     [string]  If a worker hits an error after two attempts, the workflow should either
    continue to "retry",
    allow all running tasks to "finish" before exiting (default), or
    "terminate" the workflow immediately.  (accepted: finish, terminate, retry) [default: finish]
  --maxRetries                        [integer] If --errorStrategy retry, the maximum number of times the task will be retried before the workflow terminates. [default: 3]
  --MINMEM                            [string]  Minimum amount of memory to reserve for any given job. Applies globally. [default: 2GB]
  --MAXMEM                            [string]  Maximum amount of memory to reserve for any given job. Applies globally. [default: 4GB]
  --MAXCPU                            [integer] Maximum number of CPUs to reserve for any given job. Applies globally. [default: 2]
  --PLATE_LAYOUT                      [string]  For MergeDemuxJSON, a CSV file representing the plate well barcodes in a grid layout. [default: assets/layout_384.csv]
  --notifyEmails                      [string]  Comma-separated list of emails to notify on workflow completion.



Parameters description

Parameters

Users have the option of using a YAML file to pass parameters to this workflow. A template is provided for your convenience and for a brief description of available parameters. You may copy this file to a new path, add the missing reuqired fields, edit optional parameters as appropriate, and submit it to nextflow using the option -params-file path/to/params.yml. Options are arranged in a categorical hierarchy, with categories IO for file paths, Demux for demultiplexing and QC options, Align for STAR options, Count for featureCounts options, and Dedup for deduplication and final counting options. These same parameters can be passed from the commandline i.e. --Demux.umilen 12 --IO.star path/to/genome-idxs would be equivalent to setting in your YAML file

Demux:
  umilen: 12
IO:
  star: path/to/genome-idxs



Accessing resources on AWS

Depending on your runtime environment and S3/batch policies, you may need to specify credentials. The pipeline can discover AWS credentials using any combination of environment variables, (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_PROFILE, AWS_DEFAULT_PROFILE, AWS_REGION, AWS_DEFAULT_REGION), configuration profiles (~/.aws/config, ~/.aws/credentials), or your EC2 headnode configuration. For more info, click here.



Downsampling and wells masking

If the pipeline detects wells with excessive read counts, it will by default raise an error. You can reduce these to warnings and, optionally, invoke downsampling logic. By default, downsampling is disabled, however you can control the behavior using two flags and an optional metadata column.

--Demux.downsample_threshold

This flag controls the number of QC-passing reads a well must have in order for downsampling to apply. The behavior is as such:

  • If not provided, downsampling is disabled.
  • If provided without a value, the threshold is determined by taking the lower of 15% of the total QC-pass reads in the experiment or an outlier threshold determined by the formula Q3 + 1.75 * (Q3 - Q1) where Q1 and Q3 are the first and third quartiles of the per-well QC-pass read counts, respectively.
  • If provided with a fraction between 0 and 1, the threshold is set at that fraction of the total QC-pass reads in the experiment.
  • If provided with a whole number greater than 1, the threshold is set at that number of reads.

--Demux.downsample_to

This flag controls the target number of reads to downsample a well to if it is flagged for downsampling per the above logic. The behavior is as such:

  • If not provided, downsampling is disabled.
  • If provided without a value, the target number of reads is set to equal the threshold.
  • If provided with a fraction between 0 and 1, the target number of reads is set per well to that percent of its QC-pass reads.
  • If provided with a whole number greater than 1, the target number of reads is set to that number.

Note

If either of these two options is set but not the other, the other option will be behave as though it was provided without a value.

Metadata column "MaskWells"

This is an optional column in the metadata sheet. The intention is for this to be a boolean value. Set this to true to discard that well from downstream consideration. The well will still be processed through FastQC and initial QC, but will be discarded at the point of downsampling.



Output

All pipeline outputs will be placed in a new directory or S3 prefix defined by --IO.outdir. For each plate in the metadata excel, a separate subdirectory is created for each sample, and the name of this directory is taken from the bcl2fastq samplesheet if provided or from the names of the Excel tabs. If passing sample info via --IO.barcodes and --IO.sampleName, the latter argument is used for the name of the sole subdirectory of --IO.outdir.

STARsolo Output

The output of STARsolo.nf follows.

  • counts.tsv - Tab-separated file containing counts of genes in wells. Rows are features (genes) and columns are barcodes (wells).
  • tpm.tsv, fpkm.tsv - Tab-separated files containing normalized counts of genes in wells, with the same shape as counts.tsv.
  • demux.json - JSON summary of demultiplexing stats with the following keys:
    • read_count: The total number of reads ingested.
    • well_counts: A list of per-barcode read counts. The order of this list is the same as the user-supplied barcode whitelist. An extra element at the end counts the number of reads discarded due to no matching barcode.
    • written_counts: A list of read counts per intermediate task. The length of this list is determined by --Demux.split plus an extra element for the noBC output.
    • mismatch_counts: A list of mappings from well name to the number of reads whose barcode was matched to that well with a given number of substitutions. The first element of this list counts the reads that were matched perfectly, the second those with one mismatch, and so on up to nmismatch.
    • fail counts: A mapping from QC fail reason (or PASS) to mappings from well name to the number of reads in that well passing or failing QC for the stated reason.
    • unmatched_counts: Observed frequencies of barcodes that failed to match one of the given well barcodes.
    • plate_map: 2d nested array representing the number of reads that are mapped to each well whether or not it has been assigned a sample ID. If the workflow complains about too many reads being discarded, you can use this as a diagnostic resource.
  • Aligned.sortedByCoord.out.bam - The sorted BAM file, merged from all the STAR runs, sorted by barcode index (read group) and genomic coordinate.
  • Aligned.sortedByCoord.out.flagstat.json - The result of running samtools flagstat -O json Aligned.sortedByCoord.out.bam. See the samtools documentation for details.
  • STARsoloReport.txt - Click here for details on this file.
  • Solo.out - A consolidation of the STARsolo reports from across the mapping tasks. Output is similar to that of a single run of STARsolo, except the three stats/summary tables (Solo.out/Barcodes.stats, Solo.out/Gene/Features.stats, and Solo.out/Gene/Summary.csv) are tables with each column representing a single STARsolo task. Column order is arbitrary. Additionally, contains subdirectory Solo.out/Gene/raw which includes three or more files:
    • features.tsv: A tab-separated file with 3 columns describing each annotated gene. Column 1 is the ENSEMBL gene ID (ENSG##########). Column 2 is the HGNC gene symbol (PARP2L, ACTB, etc.). Column 3 is the feature type (by default, this is "Gene Expression").
    • barcodes.tsv: A text file with one barcode sequence per line, copied from the first column of --IO.barcodes.
    • matrix.mtx: A MatrixMarket sparse matrix file where each entry is the number of uniquely-mapped, uniquely-assigned UMIs per gene per well.
    • UniqueAndMult-*.mtx: Depending on the argument to --Dedup.ambiguous, this file is a MatrixMarket sparse matrix where each entry is the number of UMIs per gene per well after correcting for multimapping sites using the user-selected strategy. The previously-mentioned counts.tsv will then be a dense version of this.
  • DRUGseq_Merged.bam: This BAM file contains all the reads in the experiment, sorted by barcode. The tags NH HI nM AS CR UR CY UY CB UB GX GN sS sQ sM NM MD jM jI areq as described in the STAR manual, section 4.2.2. The following tags are custom to this workflow:
    • BM: Number of mismatches (substitutions) between CR and CB
    • QC: The reason why the read was discarded in the initial filtering step
    • BI: Barcode index or "noBC", used for properly splitting the job after the STAR run (DRAGoN workflow only)

Note: This workflow generates a large number of intermediate files. For a 1.7B read pair experiment with 384 barcodes and 1M read chunks for demultiplexing, a total of over 650k SAM files will be generated. You should periodically clean up temporary files from old nextflow runs using nextflow clean -f.



STARsoloReport.txt

This is a TSV file with per-wellread counts summarizing the demultiplexing output. Column definition is as such:

  • TOTAL_IN - Total number of reads assigned to this well.
  • PCT_LOSS_QC - Percent of reads lost to initial QC failure.
  • PCT_LOSS_DOWNSAMPLING - Percent of reads lost to downsampling. If no downsampling occurs, this will be all 0s.
  • PCT_LOSS_UNMAPPED - Percent of reads surviving initial QC that fail map to the supplied reference genome.
  • PCT_LOSS_UMI - Percent of reads in genes that are removed as duplicates.
  • FINAL_UMI - Final number of UMIs, including multimappers.
Initial QC
  • QC_PASS - Number of reads passing all QC filter criteria.
  • QC_FAIL_UMI_QUALITY - Number of reads for which too few bases in the UMI sequence pass the configured quality threshold (default: <6 bases with quality score >= 20).
  • QC_FAIL_AT_CONTENT - Number of reads for which the cDNA sequence had too many bases that were A or T (default: >90% of base calls).
  • QC_FAIL_GC_CONTENT - Number of reads for which the cDNA sequence had too many bases that were G or C (default: >90% of base calls).
  • QC_FAIL_ADAPTER_TRIMMING - Number of reads for which the cDNA sequence became too short (default:< 20 bases) after trimming sequencing adapters (single sequence specified by --Demux.adapter).
  • QC_FAIL_QUALITY_TRIMMING - Number of reads for which the cDNA sequence became too short (default: <20 bases) after trimming homopolymer artifacts and low-quality bases.
  • QC_FAIL_LENGTH - Number of reads for which the cDNA sequence was too short before trimming (default: <20 bases).
Barcode matching
  • MATCH_EXACT - Number of reads (both QC-PASS and QC-FAIL) whose barcode sequence exactly matches the given well barcode.
  • MATCH_1MM, MATCH_2MM, etc. - Number of reads (both QC-PASS and QC-FAIL) whose barcode sequence matches the given well barcode with 1, 2, etc. base substitutions.
    • There will be as many such columns as the value of --Demux.mismatch.
Downsampling
  • DOWNSAMPLE_QC_PASS_READS - Number of reads going into downsampling, should be the same as QC_PASS.
  • DOWNSAMPLE_LIMIT - Threshold for downsampling, should be the same for the whole plate.
  • DOWNSAMPLE_FRACTION - Downsample-to fraction. If no downsampling happens, this will be 1.0. If the well is discarded, it will be 0.
Mapping and annotation
  • UNMAP_COUNT - Number of reads that were discarded because they failed to map to the genome reference.
  • SOLO_UNIQUE - Number of uniquely-assigned UMIs.
  • SOLO_UNIFORM, SOLO_PROPUNIQUE, SOLO_EM, SOLO_RESCUE - (optional) Number of unique and multimapped UMIs assigned according to the titular strategy.

DRAGoN

The DRAGoN.nf output includes the same counts.tsv matrix, flagstat.json and demux.json as from STARsolo, plus additional tables reporting stats at each intermediate step.

  • DRAGoNreport.txt: A tab-separated file reporting QC statistics from across the execution. Concatenates portions of the demultiplexing and deduplication reports.
  • Directory DRAGoN.out mimics the sparse matrix structure of Solo.out/Gene/raw.
  • featureCounts.summary is a tab-separated file with summary statistics for the assignment of reads to genes. The first column names the statistic, and the second column gives its value. This is aggregated across the entire experiment. See the featureCounts documentation for a description of this file.
  • dedup_stats.tsv is a tab-separated file with summary statistics for post-hoc filtering and UMI counts. This is entirely redundant to DRAGoNreport.txt and will be removed in a future update.
  • DRUGseq_Merged.bam: Same as from STARsolo, but the following additional custom tags are set:
    • MX: Optional: The coordinates of all alternate alignments in chr:start-end format, separated by semicolons
    • XS: featureCounts assignment result ("Assigned" or "Unassigned_{reason}") for all alignments (primary first, then secondaries), separated by semicolons.
    • XT: Optional: featureCounts gene assignment for all alignments (primary first, then secondaries), separated by semicolons.
    • XN: Optional: Number of genes to which this read was assigned, taking into account all alignments. This may not equal the length of tag XT i.e. in the case where multiple alignments of the same read overlap the same gene. If missing, the read did not overlap any genes.
    • DN: Optional: If the read is marked as a duplicate, the name of the original UMI.



DRAGoNreport.txt

This is a tab-separated file reporting QC and quantification summary statistics over the DRAGoN execution. Each row is a sample well, with an extra row at the end describing reads not assigning to any well. Columns are the same as in STARsoloReport.txt plus the following:

Index
  • barcode - Name of the well. "noBC" refers to reads not assigning to any well.
Filtering summary
  • TOTAL_IN - Total number of reads assigned to this well.
  • PCT_LOSS_QC - Percent of reads lost to initial QC failure.
  • PCT_LOSS_DOWNSAMPLING - Percent of reads lost to downsampling. If no downsampling occurs, this will be all 0s.
  • PCT_LOSS_UNMAPPED - Percent of reads surviving initial QC that fail map to the supplied reference genome.
  • PCT_LOSS_UNASSIGNED - Percent of reads that map to the genome that do not overlap any annotated gene.
  • PCT_LOSS_UMI - Percent of reads in genes that are removed as duplicates.
  • FINAL_UMI - Final number of UMIs, including multimappers.
Initial QC
  • QC_PASS - Number of reads passing all QC filter criteria.
  • QC_FAIL_UMI_QUALITY - Number of reads for which too few bases in the UMI sequence pass the configured quality threshold (default: <6 bases with quality score >= 20).
  • QC_FAIL_AT_CONTENT - Number of reads for which the cDNA sequence had too many bases that were A or T (default: >90% of base calls).
  • QC_FAIL_GC_CONTENT - Number of reads for which the cDNA sequence had too many bases that were G or C (default: >90% of base calls).
  • QC_FAIL_ADAPTER_TRIMMING - Number of reads for which the cDNA sequence became too short (default:< 20 bases) after trimming sequencing adapters (single sequence specified by --Demux.adapter).
  • QC_FAIL_QUALITY_TRIMMING - Number of reads for which the cDNA sequence became too short (default: <20 bases) after trimming homopolymer artifacts and low-quality bases.
  • QC_FAIL_LENGTH - Number of reads for which the cDNA sequence was too short before trimming (default: <20 bases).
Barcode matching
  • MATCH_EXACT - Number of reads (both QC-PASS and QC-FAIL) whose barcode sequence exactly matches the given well barcode.
  • MATCH_1MM, MATCH_2MM, etc. - Number of reads (both QC-PASS and QC-FAIL) whose barcode sequence matches the given well barcode with 1, 2, etc. base substitutions.
    • There will be as many such columns as the value of --Demux.mismatch.
Downsampling
  • DOWNSAMPLE_QC_PASS_READS - Number of reads going into downsampling, should be the same as QC_PASS.
  • DOWNSAMPLE_LIMIT - Threshold for downsampling, should be the same for the whole plate.
  • DOWNSAMPLE_FRACTION - Downsample-to fraction. If no downsampling happens, this will be 1.0. If the well is discarded, it will be 0.
Mapping and annotation
  • UNMAP_COUNT - Number of reads that were discarded because they failed to map to the genome reference.
  • DEDUP_TOTAL - Total number of reads assigning to this well.
  • DEDUP_DISCARD_QCFAIL - Number of reads that were discarded due to a QC fail status.
  • DEDUP_DISCARD_UNMAPPED - Number of reads that were discarded because they do not map to any genomic site.
  • DEDUP_DISCARD_NOFEATURE - Number of reads that were discarded because they do not overlap any gene feature.
  • DEDUP_PASSED_QC - Final number of reads passing all QC, alignment, and annotation filters, including multimappers.
  • DEDUP_N_MULTIMAPPED - Number of reads mapping to multiple genomic sites.
Deduplication and disambiguation
  • DEDUP_UMI_COUNTS - Final number of UMIs, including multimappers.
  • DEDUP_UNIQUE_COUNTS - Final number of QC-pass reads mapping unambiguously to a single site/gene.
  • DEDUP_UNIQUE_UMIS = Final number of uniquely-mapping UMIs.



References

[1] Ye C, Ho DJ, Neri M, Yang C, Kulkarni T, Randhawa R, Henault M, Mostacci N, Farmer P, Renner S, Ihry R, Mansur L, Keller CG, McAllister G, Hild M, Jenkins J, Kaykas A. DRUG-seq for miniaturized high-throughput transcriptome profiling in drug discovery. Nat Commun. 2018 Oct 17;9(1):4307. doi: 10.1038/s41467-018-06500-x.

[2] Navarro JF, Sjöstrand J, Salmén F, Lundeberg J, Ståhl PL. ST Pipeline: an automated pipeline for spatial mapping of unique transcripts. Bioinformatics. 2017 Aug 15;33(16):2591-2593. doi: 10.1093/bioinformatics/btx211.

[3] Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25.

[4] Parekh S, Ziegenhain C, Vieth B, Enard W, Hellmann I. zUMIs - A fast and flexible pipeline to process RNA sequencing data with UMIs. Gigascience. 2018 Jun 1;7(6):giy059. doi: 10.1093/gigascience/giy059. PMID: 29846586; PMCID: PMC6007394.

[5] Benjamin Kaminow, Dinar Yunusov, Alexander Dobin. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. bioRxiv 2021.05.05.442755; doi: 10.1101/2021.05.05.442755

About

DRUG-seq (Digital RNA with pertUrbation of Genes) [1] is a new technique that provides a cost-effective way for high-throughput, whole transcriptome RNA-seq profiling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published