Skip to content

Commit

Permalink
paper: incorporate reviewer comments
Browse files Browse the repository at this point in the history
  • Loading branch information
daylinmorgan committed Sep 30, 2024
1 parent 26fe851 commit bcff1f8
Show file tree
Hide file tree
Showing 6 changed files with 76 additions and 23 deletions.
4 changes: 0 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,3 @@ tmp/
!/tests/data/reference
# pixi environments
.pixi


paper/*.jats
paper/*.pdf
4 changes: 4 additions & 0 deletions paper/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
paper.pdf
paper.jats
jats/

4 changes: 3 additions & 1 deletion paper/Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
paper.pdf: paper.md paper.bib
paper.pdf: paper.md paper.bib workflow.png
docker run --rm \
--volume $$PWD/:/data \
--user $$(id -u):$$(id -g) \
--env JOURNAL=joss \
openjournals/inara

workflow.png: workflow.dot
dot -Tpng -s300 -o$@ $<
53 changes: 35 additions & 18 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,64 +22,79 @@ bibliography: paper.bib

# Summary

`Pycashier` is a tool designed to extract DNA barcode tags from next generation sequencing data.
It was originally developed for use with the
ClonMapper Barcoding System [@alkhafaji2018;@gardner2024;@gardner2022].
`Pycashier` is a tool designed to extract cellular DNA barcode tags from next generation sequencing data.
These DNA barcode tags are heritable and stably integrated genetic markers useful for clonal tracking (CITE) (clontracr?)
and lineage tracing (development) (gestalt) within *in vitro* and *in vivo* cell-based disease models.
These exogenous cell-based DNA barcodes, when amplified from genomic DNA and sequenced,
can be used as a proxy for assessing changes in clonal abundance and better understanding population dynamics.
`Pycashier` was originally developed for use with the
ClonMapper Barcoding System [@alkhafaji2018;@gardner2024;@gardner2022]),
which is comprised of random 20 nucleotide barcodes integrated as both a functional gRNA and expressed transcript.
`Pycashier` has been previously utilized to interrogate
tumor heterogeneity in barcoded cancer cell-line models [@gutierrez2021;@johnson2020],
however, it is generalizable to similar DNA barcoding systems with known flanking regions.

however, it is generalizable to similar DNA barcoding systems with known flanking regions and expected length.

# Statement of need

DNA sequencing and DNA barcoding specifically, have become
DNA sequencing and cellular DNA barcoding specifically, have become
more common as a modality for the characterization of clonal and lineage-specific subpopulations of cells.
As researchers leverage these technologies, they'll require tools easy to setup
and use to facilitate downstream biological analysis.
DNA barcode sequencing suffers from several sources of noise that must be accounted for prior to statistical analysis.
This noise can arise in typical Polymerase Chain Reaction (PCR) preparation [@kebschull2015;@potapov2017] or during read-out [@manley2016].
Historically, the analysis of DNA barcoding has relied on tailored computational workflows,
Historically, the analysis of cellular DNA barcoding has relied on tailored computational workflows,
such as TimeMachine [@emert2021], that are difficult to parameterize
or extend to similarly designed DNA barcoding systems.
or extend to similarly designed cellular DNA barcoding systems.
Recently, there has been the development of several NextFlow-based techniques,
such as BARtab [@holze2024] and NextClone [@putri2023],
which offer improved support for generalized approaches to processing barcode sequencing data.
However, they require familiarity with NextFlow, which may be uncommon for experimentalists.
`Pycashier` aims to be simple to install and generalizable enough to be useful to the broader community
while also providing a user friendly interface.
In particular BARtab and it's associated post-processing library bartools, offer an end-to-end toolkit for barcoding analysis.
As a more feature complete end-to-end toolkit BARtab differs from `pycashier` by including support for spatial transcriptomics data and reference-based processing of barcodes.Additionally, being based on NextFlow offers some advantages to these tools including sample-level parameterization and tighter control on system resources.
However, experience using NextFlow may be uncommon for experimentalists.
`Pycashier` aims to be simple to install and generalizable enough to be useful to the broader community while also providing a user friendly interface.

# Implementation and Usage

`Pycashier` was intentionally designed to be simple-to-use for both computational and experimental biologists.
It accomplishes this by leveraging purpose-built software for an opinionated DNA barcode processing pipeline.
`Pycashier` has a command-line interface (CLI) implemented in `python`.
Users have the option of installing `pycashier` with `pip`, `conda/mamba` (from conda-forge),
Users have the option of installing `pycashier` with `pip`, `conda`/`mamba`/`pixi` (from conda-forge),
or as a standalone `Docker` image which includes all necessary runtime dependencies for maximum reproducibility.
`Pycashier` maintains outputs and logs of all steps for simple debugging and reuse across a project.
The `pycashier` CLI has four subcommands to facilitate processing of DNA barcode sequencing data.
Users can specify parameters either through command-line flags or through a `toml` file.
The `pycashier` CLI has four subcommands to facilitate processing of DNA barcode sequencing data,
`extract`, `merge`, `scrna` and `receipt` (Figure 1).
Users can specify parameters either through command-line flags or through a configuration `toml` file.

![pycashier workflow](./workflow.png)

`Pycashier` is primarily used for generating counts of individual barcode sequences
from targeted PCR amplifications of DNA-barcoded cells.
`Pycashier` extracts these DNA barcode sequences without the use of any pre-defined list.
This makes it amenable to systems in which sequences are not known ahead of time
or randomly generated as in the case of ClonMapper.
This is done with `pycashier extract`, which accepts a directory of
`fastq` files directly from Illumina sequencing and generates
a `tsv` of individual barcodes and counts for each input `fastq`.
These sequences should be an expected length (specified with `--length`, by default 20),
and flanked by *known regions* which are detectable in sequencing reads.
These flanking regions can be specified either using CLI flags (as `--upstream-adapter`/`--downstream-adapter`)
or within a user provided configuration file.
First, filtering is performed with `fastp` [@chen2018] to remove low quality sequencing reads.
Next, flanking sequences are used to extract a region of interest with `cutadapt` [@martin2011].
Then, the list of identified sequences are corrected for errors introduced in either preparation
or sequencing using a message passing clustering powered by `starcode` [@zorita2015].
Finally, minimal count filtering is applied to remove any remaining noise from sequencing.

In addition to barcode extraction from targeted sequencing, `pycashier` facilitates
barcode extraction from single-cell RNA-sequencing (scRNA-seq).
barcode extraction from single-cell RNA-sequencing (scRNA-seq)
in which cellular DNA barcodes are expressed as poly-adenylated transcripts.
Specifically, it's compatible with data generated with the 10X Genomics 3' based single cell gene expression kit.
In this case, the command `pycashier scrna` accepts `sam` files derived from processed
In this case, the command `pycashier scrna` accepts `sam` files [^1] derived from processed
`10X` data and generates a `tsv` with cell/UMI resolved barcode sequences,
which can then be mapped directly back to the transcriptome of individual cells.
To accomplish this, `pycashier` first extracts cell/UMI and
read sequences from `sam` files using `pysam` [@pysam2024].
Next, sequences are individually extracted from `fastq` files again,
using `cutadapt` with known flanking sequences.
using `cutadapt` with known flanking sequences (See Figure 1).

`Pycashier` provides two additional convenience commands:
`merge`, to generate single read consensus sequences
Expand All @@ -90,6 +105,8 @@ along with calculating some basic metrics across samples.
Documentation and further usage instructions for `pycashier`
can be found at [docs.brocklab.com/pycashier](https://docs.brocklab.com/pycashier).

[^1]: CellRanger `bam` files can be converted to a `sam` with unmapped reads using `samptools view -f 4`.

# Acknowledgments

We acknowledge Aziz Al'Khafaji for the development of the
Expand Down
34 changes: 34 additions & 0 deletions paper/workflow.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
digraph G {
rankdir="LR"
// files
{
node [shape= rect];
fastqs [label = "fastqs" ];
bam [label = "10X bam" ];
sam [label = "unmapped 10X sam"];
{
node [ color = "#005f86"];
counts [label = "barcode counts" ];
combined_counts [label = "combined counts" ];
sc_counts [label = "single cell counts"];
}
}

// commands
samtools [label = "samtools view -f 4";]
{
node [ color = "#bf5700"];
extract [label = "pycashier extract"]
merge [label = "pycashier merge" ]
scrna [label = "pycashier scrna" ]
receipt [label = "pycashier receipt"]
}

// dag
fastqs -> extract [label = "single-end"];
fastqs -> merge [label = "paired-end"];
merge -> extract;
extract -> counts;
counts -> receipt -> combined_counts;
bam -> samtools -> sam -> scrna -> sc_counts;
}
Binary file added paper/workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit bcff1f8

Please sign in to comment.