We have an S3 bucket called 10x-pipeline
. At the top-level, there
are three keys: reference_transcriptome
, software
, and
oligo_sequences
.
For example:
10x-pipeline/
oligo_sequences/
adt_hto_bc_sequences.csv
citeseq_sample_indices.csv
reference_transcriptome/
GRCh38/
refdata-cellranger-GRCh38-1.2.0.tar.gz
refdata-cellranger-GRCh38-3.0.0.tar.gz
hg19/
refdata-cellranger-hg19-1.2.0.tar.gz
refdata-cellranger-hg19-3.0.0.tar.gz
mm10/
refdata-cellranger-mm10-1.2.0.tar.gz
refdata-cellranger-mm10-3.0.0.tar.gz
vdj/
refdata-cellranger-vdj-2.0.0.tar.gz
software/
bcl2fastq/
bcl2fastq2-v2.20.0-linux-x86-64.zip
cellranger/
cellranger-2.2.0.tar.gz
cellranger-3.0.2.tar.gz
This is the source of truth for up-to-date oligo sequences CSVs. New oligo sequence must be added to these files for use with the cellranger pipeline. These files include sample index and ADT/HTO oligo sequences.
Under reference_transcriptome
, there are keys are the names of
reference transcriptomes, each of which contain tar.gz
files for
different versions of the reference transcriptome. The tar.gz
files
are named consistently so that obtaining different versions of a
reference transcriptome is just a matter of construci
Many of the ref-data files can be downloaded from the 10x genomics site here. Ultimately, when decompressed, you'll be left with a directory including the following top-level items:
fasta/
genes/
pickle/
star/
README.BEFORE.MODIFYING
reference.json
version
Our executables expect unarchived reference transcriptomes to have this directory structure.
The software can be downloaded from 10X Genomics and Illumina, respectively. We mirror these files in our S3 bucket to provide a consistent and reliable way of downloading these softwares for use in our automation.
We have an s3 bucket called 10-data-backup
. The cellranger pipeline
reads raw data from this bucket, exports processed data and analysis
results to this bucket. The cellranger pipeline consumes raw data on a
per-experiment basis by downloading the files under
10-data-backup/<experiment_name>/raw_data/
.
We sometimes start with fastqs, instead of bcls, in which case:
- we don't need to run a processing step;
- and the fastq files must be uploaded manually according to our naming convention.
Our experiment directories use the following naming convention: {run id}_{himc pool #}_{sequencing date}
If the experiment has pooled runs, we use the same naming convention,
using a -
to separate runs. For example:
run406_himc40_071018-run407_himc40_071118
If we're starting with fastq files, the fastq files must be stored
under the experiment_name/sample_id/fastqs/run_id/
folder.
Note that we do not have any naming conventions around bcl archives.
For example:
10x-data-backup/
experiment_name/
raw_data/
bcl_filename_run1.tar.gz
bcl_filename_run2.tar.gz
sample_id_A/
fastqs/
run1/
...fastq.gz
run2/
...fastq.gz
cellranger_count_output/
reference_transcriptome/
outs/
...
...
sample_id_B/
fastqs/
run1/
...fastq.gz
run2/
...fastq.gz
cellranger_count_output/
reference_transcriptome/
outs/
...
...
sample_id_A-A/
fastqs/
run1/
...fastq.gz
run2/
...fastq.gz
sample_id_A-H/
fastqs/
run1/
...fastq.gz
run2/
...fastq.gz
fastqs_metadata/
citeseq/
run1/
Stats/
Reports/
run2/
Stats/
Reports/
gex/
run1/
Stats/
Reports/
run2/
Stats/
Reports/
The fastqs metadata is stored on a per run basis. It includes information that may be helpful for troubleshooting.
For example, the DemuxSummary
files under Stats
can be used to
troubleshoot the demultiplexing step in the case that we seem to have
provided an inaccurate sample index. The Most Popular Unknown Index Sequences
section of the files will contain barcode sequences that
correspond to 10x sample index sequences found
here.
The pipeline runs by running code in containers on AWS. Our
cellranger-$CELLRANGER_VERSION-bcl2fastq-$BCL2FASTQ_VERSION
images
contain those softwares (at the versions named) and the executables in
this repo's top-level bin
directory. These images are run by AWS
Batch.
The Makefile
enables easier docker builds; the Makefile
commands
are short, easy to remember, and consistent.
In production, we pin the version of the code that is being used;
namely, we enforce that the pipeline runs the code in this repo as of
a particular commit (for example, abc123
). This is helpful, because
we can tell just by glancing at the AWS Batch console, what code is
running in our production pipeline; if we were always using latest
,
then it would be more difficult to determine the version of the code
being run. For development, latest
is useful for making a tighter
feedback loop.
The Dockerfile
is parameterized with CELLRANGER_VERSION
and
BCL2FASTQ_VERSION
. This makes sense for us, because we need
different images that contain different versions of those softwares,
and also because everything else about the images should be the
same. This is DRYer than having two nearly identical Dockerfile
s.
The AWS ECR repository enables us to distribute our images; namely,
when we push our images to ECR, AWS Batch can then pull them and use
them. Images can be pushed to ECR by running make push
from the root
of this repo.
We use an ECS-optimized AMI with 1TB storage at /docker_scratch
. See
the AMI readme file for details on how we built
the AMI.
AWS Batch runs our containers with an IAM role, cellranger-pipeline
,
that allows the container to interact with S3. That role provides:
- read and write access to
10x-data-backup
S3 bucket (via thes3-10x-data-backup-read-write
IAM policy) - read access to
10x-pipeline
S3 bucket (via thes3-10x-pipeline-read
IAM policy)
There are also some default IAM roles that we assign to AWS Batch resources, for example the role that allows AWS Batch to pull our images from ECR.
These jobs run our Docker images with 124GB of memory and 16 CPUs, and make use of the 1TB of disk provided by the AMI.
The compute environment is a resource (CPU, memory) pool. The compute environment defines the minimum resources available and the maximum resources available. If we are using all of the compute environment's resources, then jobs will be queued, having to wait until running jobs complete until they can be started.
The job queue associates a job with a compute environment. The job queue is transparent unless the compute environment is at full capacity and jobs must wait to be run.
The job definition is a template that we use when submitting jobs.
Our pipeline relies on various python and bash scripts which can be
found in the top-level, scripts
, and bin
directories. These
scripts are responsible for tasks including the following:
- submitting jobs to AWS Batch
- pushing Dockerfile changes to AWS ECR
- validating our yaml config file
- running
mkfastq
,bcl2fastq
,count
, andvdj