diff --git a/docs/Bench-Protocol.md b/docs/Bench-Protocol.md new file mode 100644 index 0000000..bfb4d81 --- /dev/null +++ b/docs/Bench-Protocol.md @@ -0,0 +1,271 @@ +# Table of Contents + +- [Reagents](#reagents) +- [Consumables](#consumables) +- [Equipment](#equipment) +- [Day 1 - Picking colonies](#day-1---picking-colonies-10-15-mins-per-96-colonies) +- [Day 2 - Lysate and glycerol stock generation](#day-2---lysate-and-glycerol-stock-generation-1-2-hours) +- [Day 2 - "Miniaturized" Riptide protocol](#day-2---miniaturized-riptide-protocol-4-6-hours) + - [Random primer extension and biotinylated termination](#a-reaction-random-primer-extension-and-biotinylated-termination-1-hour) + - [DNA capture and library conversion](#b-reaction-dna-capture-and-library-conversion-1-2-hours) + - [Library Amplification](#pcr-amplification-45-mins) + - [Size Selection](#spri-bead-cleanup-and-gel-isolation-1-hour) +- [Day 2 - Library quantification and sequencing](#day-2---library-quantification-and-sequencing-setup-45-mins) + +# Reagents: + +- ddH2O +- 30% [Glycerol](https://www.fishersci.com/shop/products/glycerol-molecular-biology-fisher-bioreagents-2/BP2291) +- 0.1 M [Sodium Hydroxide](https://www.fishersci.com/shop/products/sodium-hydroxide-pellets-certified-acs-fisher-chemical-7/S318100) +- [2xYT](https://us.vwr.com/store/product/7437420/vwr-life-science-2xyt-medium-broth) +- 75 mM EDTA +- Ethanol +- Agarose +- TAE + +# Consumables: + +- [Riptide Kit](https://igenomx.com/product/riptide/) from iGenomX +- [Flat Bottom 96-well plates](https://www.sigmaaldrich.com/catalog/product/aldrich/br781602?lang=en®ion=US) +- [96-well PCR plates](https://www.thermofisher.com/order/catalog/product/AB0600) +- [384-well PCR plates](https://www.thermofisher.com/order/catalog/product/4483317?SID=srch-hj-4483317) (optional) +- Foil Plate seal (Axygen PCR-AS-200) +- Plastic Plate seal (Axygen PCR-SP) +- [Gel Extraction Kit](https://www.zymoresearch.com/collections/zymoclean-gel-dna-recovery-kits/products/zymoclean-gel-dna-recovery-kit) +- [MiSeq sequencing kits](https://www.illumina.com/products/by-type/sequencing-kits/cluster-gen-sequencing-reagents/miseq-reagent-kit-v2.html) + +# Equipment: + +- [96-well thermal cycler](https://www.bio-rad.com/en-us/product/c1000-touch-thermal-cycler?ID=LGTW9415) +- [384-well thermal cycler](https://www.thermofisher.com/order/catalog/product/4388444) (optional) +- Multichannel 10 +- Multichannel 200 +- [Liquidator 20](https://www.shoprainin.com/Products/Pipettes-and-Tips/Pipettes/High-throughput-Pipetting/Liquidator%E2%84%A2-96/Liquidator-96-2-20-%C2%B5L-LIQ-96-20/p/17014207) (optional) +- [Liquidator 200](https://www.shoprainin.com/Products/Pipettes-and-Tips/Pipettes/High-throughput-Pipetting/Liquidator%E2%84%A2-96/Liquidator96%2C-5-200-%C2%B5L-LIQ-96-200/p/17010335) (optional) +- 37˚C shaking Incubator +- Nanodrop, Qubit, BioAnalyzer or Tapestation +- [Illumina Miseq](https://www.illumina.com/systems/sequencing-platforms/miseq.html) +- [Dynamag](https://www.thermofisher.com/order/catalog/product/12321D) (or magnetic tube stand) +- [Multi-tube vortexer](https://ru.vwr.com/store/product/596528/multi-tube-vortexers) (optional) + +# Day 1 - Picking colonies (~10-15 min per 96 colonies) + +1. Pick single colonies into 150 µl of 2xYT with appropriate antibiotic in a flat-bottom 96-well plate + - Different antibiotics have different plasmid yield in the lysate + - When cloning in a pooled library format, we usually pick a minimum of 4 colonies per desired clone. + - *Tip: We pick colonies using 200 µL pipette tips, put them back into the box, and then use a multichannel pipette to inoculate the 2xYT plate. This saves a lot of time and should allow you to pick a 96-well plate in about ~10 minutes with some practice (potential upgrade is a colony picker - we have yet to identify a cost effective and robust model)* +2. Tape the lid tightly to the plate to reduce evaporation and grow overnight at 37˚C in a shaking incubator + - a 96-well plate specific incubator with a short throw and high rpm is great, but we just use a normal incubator/shaker meant for flasks at 250 rpm + +# Day 2 - Lysate and glycerol stock generation (~1-2 hours) + +1. Transfer 50 µL of resuspended culture from each well using a liquidator/multichannel into a 96-well PCR plate +2. Pellet the bacteria by spinning at 4000 rcf for 10 min at RT in a swinging bucket rotor +3. Store the remaining 100 µL as glycerol stock by adding 100 µL of 30% glycerol to the plates, shake in incubator for 15 min, foil seal, and store in -80˚C freezer. +4. Flick and blot the media from the pelleted bacteria (media in this step will inhibit proper lysis) +5. Add 40 µL of MillQ water to each pellet, plastic seal, and resuspend pellet by pipetting or with a multi-tube vortexer + - *tip: pulse by going from medium to max speed, then leave at near max speed for 5-10 secs* +6. Lyse cells in 96-well thermocycler by heating at 95˚C for 3 min and then cooling to 4˚C +7. Clarify lysate by spinning at 4000+ rcf for 10 min at RT +8. Remove the top 20 µLof clarified lysate and store in a separate 96 or 384-well plate +9. *At this point the protocol can be paused by freezing the lysate at -80˚C* + +# Day 2 - "Miniaturized" Riptide protocol (~4-6 hours) + +This protocol is adapted from the iGenomX Riptide library prep protocol [here](https://igenomx.com/resources/) + +## A Reaction: Random primer extension and biotinylated termination (~1 hour) + +Consumables needed for "A Reaction": + +``` +dNTP Mix 1 +10X Enzyme 1 Buffer +Primer A +Enzyme 1 +SPRI Beads 1 +75 mM EDTA +80% ethanol (freshly prepared) +10 mM Tris-HCl pH 8 +``` + +1. For each 96-well plate, prepare 200 µL of Reaction A master-mix by mixing + - 100 µL **dNTP Mix 1** + - 50 µL **10X Enzyme 1 Buffer** + - 50 µL **Enzyme 1** +2. Fill a 96-well plate with 2 µL each of the master-mix using a multichannel or liquidator + - Consider 384-well plates if working with 4 96-well plates +3. Pipette into each well 1 µL of **Primer A**, using liquidator or multichannel + - Make sure to properly mix primer A plate if thawing +4. Pipette into each well 2 µL of clarified lysate using liquidator or multichannel + - Make sure to properly mix lysate if thawing + - If sequencing mini-preps, use 2 µL of 5 ng/µL per well +5. Seal the plate with foil and run the following protocol on a thermocycler + - 92˚C for 3 min + - 16˚C for 5 min + - Slow ramp (0.1˚C/sec to 68˚C) + - 68˚C for 15 min + - Hold at 4˚C +6. **At this point the protocol can be paused by freezing the plate(s) at -20˚C** +7. Warm the **SPRI Beads I** to RT. +8. Stop the reaction with EDTA and pool the contents of each 96-well plate + - Option A: + 1. Use a liquidator to add 1 µL of **75 mM EDTA** to each well + 2. Use those same tips to aspirate and dispense the contents of each 96-well plate onto a clean liquidator tip-box lid and pool everything by tapping the lid at an angle. Make sure all drops are collected. + 3. Dispense pooled liquid into a low-retention Eppendorf tube (should recover around 400-500 µL of sample). + - Option B: + 1. Pool samples into a PCR strip tube already containing **75 mM EDTA**, using a multichannel +9. Ensure each 96-well plate has it's own pooled tube +10. Add 1.8 volumes of thoroughly-mixed room temperature **SPRI Beads I** to each pooled tube (use a 1 mL pipette to measure the volume). Mix by pipetting, incubate 10 min at room temperature. +11. Place tube(s) on dynamag, allow the solution to clear (2-5 min), and discard supernatant. +12. Keeping tube(s) on the dynamag, add 1300 µL of freshly prepared **80% ethanol** to each tube, wait 30 secs, then aspirate ethanol. + - Repeat this step, and ensure that all ethanol is removed. +13. Open the caps and allow the beads to air dry on the dynamag for 10 min, don't overdry. +14. Add 50 µL of RT **10 mM Tris-HCl pH 8** to the beads +15. Remove tube(s) from dynamag, and resuspend the beads until homogenous. Incubate at RT for 10 min to elute. +16. Allow the beads to clear on dynamag for 2 min, then transfer clarified elution to new low retention PCR tube(s) +17. **At this point the protocol can be paused by freezing tubes in the -20˚C** + +## B Reaction: DNA Capture and Library Conversion (~1-2 hours) + +Consumables needed for "B Reaction": + +``` +HS-Buffer +Capture Beads +Bead Wash Buffer +0.1M NaOH +Enzyme II +Enzyme II Buffer +dNTP Mix II +Primer B +Nuclease Free Water +``` + +1. Heat-denature the elution at 95˚C for 3 min and hold at 4˚C in a thermocycler +2. While it's heating, prep the **Capture Beads** + 1. Warm the **Capture Beads** and **HS-Buffer** to RT, resuspend thoroughly, and transfer 20 µL of slurry for each sample into a new PCR tube. + 2. Dynamag the **Capture Beads** and discard the supernatant. + 3. Remove tubes from dynamag and add 100 µL of **HS-Buffer** + 4. Dynamag and remove the wash. + 5. Resuspend the beads in 20 µL of **HS-Buffer**. +3. Add all 50 µL of the heat-denatured elution to the washed **Capture Beads**, mix, and incubate at RT for 10 min. +4. Pipette mix beads again and incubate at RT for 10 min. +5. Dynamag the beads and discard the supernatant. +6. Resuspend beads with 50 µL of **0.1M NaOH**, incubate 4 min at RT, dynamag and remove supernatant +7. Wash the beads 3 times (resuspend beads in 100 µL of RT **Bead Wash Buffer**, dynamag, remove supernatant). Make sure to remove any remaining liquid after final wash. +8. Prepare a mastermix for **Reaction B**: for every tube of beads, mix together on ice + - 4 µL 5x **Enzyme II Buffer** + - 1.5 µL **dNTP Mix II** + - 2 µL **Primer B** + - 12 µL **Nuclease-Free Water** + - 0.5 µL **Enzyme II** +9. Quickly add 19.5 µL of **Reaction B** mastermix to each tube (try not to let mastermix sit around at RT) +10. Incubate the tubes in a thermocycler for 20 min at 24˚C and hold at 4˚C for at least 3 min +11. Pipette mix the beads, dynamag and discard supernatant +12. Wash the beads 3 times (resuspend beads in 100 µL of RT **Bead Wash Buffer**, dynamag, remove supernatant). Make sure to remove any remaining liquid after final wash. + +## PCR Amplification (45 mins) + +Consumables needed for PCR Amplification: + +``` +Universal PCR primer +Index PCR primer(s) +2X PCR amplification mix from iGenomX +ddH2O +``` + +1. Resuspend the beads in 21 µL of nuclease free water, then setup the PCR reaction by adding: + - 2 µL **Universal PCR primer** + - 2 µL **Index PCR primer** (1-12) (choose one barcoded primer per pool) + - 25 µL **2X PCR Amplification Mix** + - 50 µL total +2. Input the following program into a thermocycler + +``` +1 cycle: 98˚C, 2 min +11 cycles: 98˚C, 20 sec + 60˚C, 30 sec + 72˚C, 30 sec +1 cycle: 72˚C, 5 min + 4˚C, hold +``` + +3. Record which samples received which Index PCR primer +4. Sample can be left in the thermocycler at 4˚C overnight. +5. Briefly spin the PCR tube in a picofuge, dynamag and transfer the supernatant to new low retention eppendorf 1.5 mL tubes. Discard the PCR tubes containing the Capture Beads. + +## SPRI bead cleanup and gel-isolation (~1 hour) + +Consumables needed for SPRI bead cleanup and gel-isolation: + +``` +SPRI Beads II +1% Agarose +TAE +80% Ethanol +10 mM Tris-HCl pH 8 +Gel-extraction kit +SYBR Safe +``` + +1. Add 70 µL of well resuspended **RT SPRI Beads II** to the samples. Mix well and incubate at RT for 10 min. +2. During this incubation, pour a 1.0% agarose gel with **SYBR Safe** for gel-extraction +3. Dynamag the beads for at least 2 min, and discard supernatant. +4. Add 200 µL of **80% ethanol** to each tube(s), wait 30 sec, then remove and discard the ethanol + - it is unnecessary to remove from magnet for the ethanol wash +5. Repeat with another 200 µL of **80% ethanol**, carefully remove all ethanol from tube without disturbing beads. +6. Open cap and allow to air dry for 10 min on dynamag (careful not to overdry). +7. Add 25 µL of RT **10 mM Tris-HCl pH 8** to beads. Remove from dynamag and fully resuspend the beads. +8. Incubate at RT for 10min to elute, place back on dynamag and transfer supernatant to new low-retention tubes. +9. Run sample(s) on a **1.0% agarose** gel and gel extract. It should come out as a visible smear, isolate the 400-1200bp region, taking care to avoid the potential primer dimer band. +10. **At this point the protocol can be paused by putting the tube(s) at -20˚C** + +# Day 2 - Library Quantification and sequencing setup (45 mins) + +Consumables needed for library quantification and sequencing: +- Appropriate MiSeq Kit (see below) + +Every quantification method is slightly different and each comes with its own pros and cons. These issues are exacerbated by the Riptide library product which is a heterogenous mixture of DNA fragment sizes (*see above*). The way we reduce variability in quantification is to quantify a previously run OCTOPUS library and use that as a baseline to estimate the concentration of the current libraries. In our hands, a fluorescence based assay gives sufficiently accurate quantification alongside a previously run library. If no previously run libraries are available, quantification by qPCR has been the most accurate method. +1. Select the appropriate sequencing kit: as a rule of thumb, 10,000 paired-end 150 reads/well, or 1,000,000 reads per 96-well plate (nano-kit V2 for one 96-well plate, micro-kit V2 for a 384-well plate, and a standard V2 kit for anything more than 384-well, the most we do is 3 x 384-wells due to the limited number of index primers in the riptide kit) +2. The only information necessary to generate the sample sheet is to provide the plate index sequences (see below) + +**Example sample-sheet.csv:** + +``` +[Header],,,,,, +IEMFileVersion,5,,,,, +Investigator Name,Octonaut,,,,, +Experiment Name,20190826_OCTOPUS_plate001-012,,,,, +Date,8/26/19,,,,, +Workflow,GenerateFASTQ,,,,, +Application,FASTQ Only,,,,, +Instrument Type,MiSeq,,,,, +Assay,Nextera DNA,,,,, +Index Adapters,"Nextera Index Kit (24 Indexes, 96 Samples)",,,,, +Description,Test on known samples,,,,, +Chemistry,Amplicon,,,,, +,,,,,, +[Reads],,,,,, +151,,,,,, +151,,,,,, +[Settings],,,,,, +,,,,,, +[Data],,,,,, +Sample_ID,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2 +OCTOPUS-plate001,plate001,,A001,ATCACG,, +OCTOPUS-plate002,plate002,,A002,CGATGT,, +OCTOPUS-plate003,plate003,,A003,TTAGGC,, +OCTOPUS-plate004,plate004,,A004,TGACCA,, +OCTOPUS-plate005,plate005,,A005,ACAGTG,, +OCTOPUS-plate006,plate006,,A006,GCCAAT,, +OCTOPUS-plate007,plate007,,A007,CAGATC,, +OCTOPUS-plate008,plate008,,A008,ACTTGA,, +OCTOPUS-plate009,plate009,,A009,GATCAG,, +OCTOPUS-plate010,plate010,,A010,TAGCTT,, +OCTOPUS-plate011,plate011,,A011,GCCTAC,, +OCTOPUS-plate012,plate012,,A012,CTTGTA,, +``` + +The sequencing run will take ~16-30 hours depending on the size of the kit (nano, micro, standard) \ No newline at end of file diff --git a/docs/Experimental-Validation.md b/docs/Experimental-Validation.md new file mode 100644 index 0000000..b545e85 --- /dev/null +++ b/docs/Experimental-Validation.md @@ -0,0 +1,11 @@ +## OCTOPUS is Reproducible and accurate + +To better understand the reproducibility and accuracy of OCTOPUS, we transformed a previously validated 8.6 kilobase plasmid into E. coli and used OCTOPUS to sequence one colony in 192 different wells. With ~1,000,000 paired-end 150 reads per plate (or ~10,000 paired-end reads per well), the coverage throughout the plasmid was relatively uniform (gray ribbon is the inter-quartile range of coverage), with an average coefficient of variation of 0.485 across the wells. In this run, OCTOPUS correctly verified 157/192 (81.8%; open) wells, with at least 3x coverage across the entire plasmid. The remaining 35/192 (18.2%; blue) wells had less than 3x coverage in some parts of the plasmid, and 5 of those reported a variant (red). We thought this was unlikely given the low somatic mutation rate of E. coli, and by manually inspecting the read pileups for those wells, we found that these likely incorrectly reported variants were in regions of 0 coverage. For this reason we don’t recommend accepting plasmids with less than 3x coverage without manual inspection. + +![](https://github.com/octantbio/octopus/blob/master/img/coverage-rank.png) + +## OCTOPUS Works on Diverse Sequences + +We analyzed a typical 384-well OCTOPUS run (~10,000 paired-end 150 reads per well) of a pooled GPCR cloning reaction for coverage. We excluded 45/384 wells that lacked sufficient coverage to properly identify the construct and 8/384 wells that lacked an insert. Of the remaining wells, the majority (173/331; 52.3%) have at least 10x coverage and almost all (300/331; 90.6%) have at least 3x coverage across the entire plasmid. These data suggest that OCTOPUS is able to verify a diverse set of sequences with sufficient coverage. + +![](https://github.com/octantbio/octopus/blob/master/img/gpcr-frac-coverage.png) diff --git a/docs/Installation.md b/docs/Installation.md new file mode 100644 index 0000000..4066b40 --- /dev/null +++ b/docs/Installation.md @@ -0,0 +1,45 @@ +# Prerequisites + +## Docker + +The computational pipeline is contained in a [docker image](https://hub.docker.com/repository/docker/octant/octopus). We strongly recommend using it for your analyses. Follow the official docs for: [MacOS](https://docs.docker.com/docker-for-mac/install/), [Linux](https://docs.docker.com/install/linux/docker-ce/ubuntu/), or [Windows](https://docs.docker.com/docker-for-windows/install/). + +Next, choose one of the following: + +**A.** Pull from DockerHub + +``` +docker pull octant/octopus +``` + +**B.** Build it yourself + +``` +git clone https://github.com/octantbio/octopus.git +cd octopus +docker build . +``` + +It is possible to run OCTOPUS without `docker`, but not recommended. If you insist, see the [dockerfile](docker/Dockerfile) for help. + +## OCTOPUS + +Clone this repository + +``` +git clone https://github.com/octantbio/octopus.git +``` + +## Hardware + +For optimal performance, we recommend deploying on a machine with at least 32 GB of RAM and 16 cores. We use [GNU Parallel](https://www.gnu.org/software/parallel/) to distribute tasks where possible, and empirically, the RAM requirements seems to scale with the number of cores (e.g. 64 GB of RAM for 64 cores). The compute times (for 384 wells) scale asymptotically, suggesting a potential disk bottleneck. + +``` +64 cores -> 64 GB RAM -> 13 Mins +32 cores -> 32 GB RAM -> 15 Mins +24 cores -> 24 GB RAM -> 20 Mins +16 cores -> 16 GB RAM -> 25 Mins +8 cores -> 8 GB RAM -> 45 Mins +``` + +We performed all trials on two Intel(R) Xeon(R) Gold 6130 CPUs @ 2.10GHz, limiting the cores through the docker `--cpuset-cpus=N` flag, and estimated peak RAM usage with `/bin/free`. \ No newline at end of file diff --git a/docs/Pipeline-Details.md b/docs/Pipeline-Details.md new file mode 100644 index 0000000..7e8e629 --- /dev/null +++ b/docs/Pipeline-Details.md @@ -0,0 +1,111 @@ +The OCTOPUS pipeline performs the following steps: + +# 1. Demultiplexing + +Due to the nature of the iGenomX protocol, there are effectively two demultiplexing steps. The first, automatically performed by the sequencer, reads standard Illumina indices to split your experiments up into plates. The second, handled here, uses [Fulcrum Genomic's fgbio](https://github.com/fulcrumgenomics/fgbio) to demultiplex each plate into individual wells. Note that `fgbio` will parse [src/igenomx-meta.txt](src/igenomx-meta.txt) for iGenomX's pre-specified primer indices. If you are using custom primers, please modify `src/igenomx-meta.txt` and/or the `--metadata` flag in the `Makefile`. + +# 2. Read Pre-processing + +To ensure a high quality _de novo_ assembly, we perform a number of processing steps. This protocol is adopted from one included with the Joint Genome Institute's [BBTools](https://jgi.doe.gov/data-and-tools/bbtools/), and is handled by [jgi-preproc.sh](src/jgi-preproc.sh). Broadly, it removes optical duplicates, trims Illumina adapters, filters contaminants, and error-corrects the remaining reads. For this application, we filter out PhiX, a list of known sequencing artifacts (included with BBTools), and the NEB 5a genome. + +## Alternative Contaminants + +Users with other applications can filter out a different set of contaminants (in the form of a fasta file) by replacing `src/background.fasta` with their own `src/background.fasta`. Alternatively you can modify the `Makefile` as follows: + +``` +old: pipeline/%/preproc: pipeline/%/demux src/background.fasta +new: pipeline/%/preproc: pipeline/%/demux path/to/your/fasta +``` + +If you would like to ignore PhiX reads or the list of known artifacts, update our preprocessing script [jgi-preproc.sh](src/jgi-preproc.sh) as follows: + +``` +old: ref=artifacts,phix,${CONTAM_REF} \ +new: ref=${CONTAM_REF} \ +``` + +# 3. _De Novo_-based Identification + +With the reads processed, we then attempt to assemble each well using [SPAdes](http://cab.spbu.ru/software/spades/). Following the JGI protocol, we attempt to merge these reads, quality trim any overlaps, and feed those to SPAdes for assembly with [jgi-denovo.sh](src/jgi-denovo.sh). Depending on your application, you may need to modify the SPAdes settings (or try a different assembler). + +We then align the _de novo_ assembly products to the user specified library (the `input.fasta` that you dropped into your sequencing run folder) to identify what's in each well using [minimap2](https://lh3.github.io/minimap2/). Since we will not know the orientation of the resulting assembly, we concatenate (or flatten) our reference library before the alignment with [flatten-fasta.py](src/flatten-fasta.py). This ensures we can align the entire assembly. For example, if our plasmid (`ref` below) starts at 0, and the _de novo_ assembly (`asm` below) happens to start at 6, the resulting alignment would look like + +``` +ref: 01234567890123456789 +asm: 6789012345 +``` + +as opposed to + +``` +ref: 0123456789 +asm: 6789012345 +``` + +It should be noted that the curent version of `SPAdes` (3.13.0) produces an assembly with the same starting and ending k-mer. This will not affect the alignment (see below) but users relying on these assemblies (found in `pipeline/your-run-id/spades-contigs.fasta`) should take this into account. + +``` +ref: 12345678901234567890 +asm: 67890123456 <- 6 is repeated! +``` + +# 4. Variant Calling + +While aligning the _de novo_ assembly will can reveal variants, we wanted a finer-grained control over the process. Thus, we use [BBMap](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/) to align the processed reads in each well to their cognate plasmid identified in the previous stemp. We then use [freebayes](https://github.com/ekg/freebayes) to call variants at positions where at least one read makes up 50% of the reads (e.g. 4 reads total, 2 call A, 2 call T -> variant call). To reduce the possibility of a sequencing error introducing a false positive, we exclude basecalls with Q<20. If you would like to adjust these parameters, please edit [denovo-guided-assembly.sh](src/denovo-guided-assembly.sh). + +# 5. Quality Control + +OCTOPUS provides a number of different quality control metrics to ensure the plasmids you select are correct. + +## Percent Contaminants + +The percentage of reads in each well that are from "contaminants" as specified in the [preprocessing](1.-read-preprocessing) step (PhiX, Illumina artifacts, and DH5a). + +## Coverage + +We report the percent of bases with <10x and <3x coverage. We recommend inspecting plasmids with a high percentage of bases at <3x to ensure critical regions are adequately covered. We provide `.bam` files for each well to assist in this process. For example, in `/path/to/octopus/pipeline/your-run-id/` you could run + +``` +samtools index plate-id/well-id.map.bam +samtools tview plate-id/well-id.map.bam lib/.fasta +``` + +to view a pileup of all the reads. Note that you will have to specify what reference file the well aligns to. + +## Barcode Filter + +Adding a barcode to each of our plasmids enables a number of useful downstream applications. For cloning, the barcode enables us to detect colonies that have multiple plasmids. To specify a barcode, simply place a string of N's as long as the barcode in the reference fasta. OCTOPUS will automatically detect barcodes declared in this fashion use them to check for plasmid contamination. First, we generate a pileup of all the reads at the barcode and collapse them at a [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) of 1 to eliminate potential sequencing errors + +``` +ATGC ---> ATGC 4 +ATGC / TTAA 1 +ATGC / +ATGA / +TTAA +``` + +We can use the relative frequencies of the barcodes to determine if the well is contaminated. Importantly, the iGenomX protocol will have a low level of template switching. This will result in a large amount of unique barcodes with few reads + +``` +AACCTTGGCCTTAA 50 +ATGCATTACAGACA 5 +TTACCATTCATGAT 2 +AGGGACCGATTAGC 1 +GGTATTAGGCCATA 1 +CTATAGCATTGCAT 1 +... +``` + +True contamination typically results in a secondary barcode with many reads + +``` +AACCTTGGCCTTAA 50 +ATGCATTACAGACA 10 +TTACCATTCATGAT 2 +AGGGACCGATTAGC 1 +GGTATTAGGCCATA 1 +CTATAGCATTGCAT 1 +... +``` + +With this in mind, we've developed a filtering heuristic that empirically eliminates wells that are actually contaminated without being too conservative. Specifically, if the second most common barcode is >10% of the total number of reads (10/65 ~15% in the above example), we call that well contaminated. If the second most common barcode is <4% of the total reads, it's most likely template swapping and we call the well clean. Lastly, if the second most common barcode is >4% and <10% of the reads (5/60 ~8% in the first example), we check the ratio of the second and third most common barcodes is < 2 (5/2 > 2). This ratio test is designed to capture our observation that true plasmid contaminations are a distinct population, while barcodes from template swapping are uniformly represented with low counts. In the case where there is only two barcodes, we will call the well contaminated if the second barcode makes up >4% of the reads. \ No newline at end of file diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..418381e --- /dev/null +++ b/docs/README.md @@ -0,0 +1,10 @@ +Welcome to the OCTOPUS wiki! + +Read more about each topic here: + +- [Bench Protocol](Bench-Protocol.md) +- [Experimental Validation](Experimental-Validation.md) +- [Installation](Installation.md) +- [Pipeline Details](Pipeline-Details.md) +- [Running the Analysis Pipeline](Running-the-Analysis-Pipeline.md) + diff --git a/docs/Running-the-Analysis-Pipeline.md b/docs/Running-the-Analysis-Pipeline.md new file mode 100644 index 0000000..03bbd42 --- /dev/null +++ b/docs/Running-the-Analysis-Pipeline.md @@ -0,0 +1,79 @@ +# 1. Linking Data + +As OCTOPUS uses `make` to orchestrate everything, there are some conventions your data must adhere to. First, deposit the output folder from your sequencing run in the `data` directory + +``` +cp -r /path/to/run-id /path/to/octopus/data +``` + +under a unique folder. We typically use the default folder name produced by the sequencer as an identifier. Second, many steps in the OCTOPUS pipeline will process the file name of the fastq's. To avoid issues, make sure all unique information is contained before the first underscore in your SampleSheet (most Illumina sequencers will automatically convert any \_'s in the `Sample_Name` column of the SampleSheet to -'s anyways). Importantly, the pipeline will trim out anything between the first underscore and the read specifier (e.g. `my-reads_foo_bar_baz_R1.fastq.gz -> my-reads_R1.fastq.gz`) to ensure everything behaves properly. + +Alternatively, you can manually add fastq's under `./octopus/pipeline/*run-id*/fastqs` provided they are not symlinks to outside of the `octopus` folder (if you are following our docker instructions). + +## Reference Library + +Next, place a fasta file containing the sequences of the plasmids you are trying to sequence under `./octopus/data/*run-id*/input.fasta`. The OCTOPUS pipeline will also automatically parse any barcodes in the form of N's for downstream analyses. + +Similar to the fastq's, you can manually place `input.fasta` at `./octopus/pipeline/*run-id*/input.fasta`. + +### De Novo Assembly + +If you do not know your input, run `make de-novo` instead to take the pipeline through the _de novo_ assembly step. If you forget, and run `make all` the pipeline will throw an error. + +# 2. Running the Pipeline + +After getting the data in place, make sure you `cd` into your octopus folder. From there we can drop into our docker image with + +``` +docker run --rm -it -v "$(pwd)":/root/octopus octant/octopus /bin/bash +``` + +This links your octopus folder (`/path/to/your/octopus`) to the docker image (`/root/octopus`). Note that Docker requires you to specify the *absolute* path to the folder (`$(pwd)` is a handy shortcut to do that for you). Also, be aware that that `--rm` makes the image ephemeral so anything written outside of the octopus directory will be lost if you logout of the shell. From the Docker image, we can + +``` +cd octopus +make all +``` + +to run the pipeline on every sequencing run in the `./data` directory and produce `octopus/pipeline/*run-id*/aggregated-stats.tsv`. You will get an error if you did not place the `input.fasta` file under `data/*run-id*/input.fasta`. If you don't have one try `make denovo`. + +## aggregated-stats.tsv + +As the name suggests, results pertinent to an OCTOPUS run are aggregated into a `tsv` file for your analysis. The columns are: + +- `Run`: Illumina run ID +- `Plate`: plate ID +- `Well`: well address +- `Plate_Well`: unique plate\_well identifier +- `DeNovo_Ref`: well identity based on aligning _de novo_ assembly to reference library +- `CIGAR`: CIGAR string from aligning the _de novo_ assembly to `DeNovo_Ref` +- `LT_10`: percentage of input reference with < 10x coverage (ideally close to 0) +- `LT_3`: percentage of input reference sequence with < 3x coverage (if not 0 inspect read pileup) +- `BC_Contam`: are there multiple plasmids in this well (TRUE/FALSE)? ([more details](https://github.com/octantbio/octopus/wiki/Pipeline-Details#barcode-filter)) +- `n_vars`: number of variants detected by FreeBayes (note barcodes count as variants) +- `n_barcodes`: number of barcodes detected +- `expected_bcs`: expected number of barcodes based on the reference (in a perfect plasmid `n_vars = n_barcodes = expected_bcs`) +- `bc_1`: sequence of barcode 1 pulled from the variant caller (may be reverse complement) +- `pos_1`: position of barcode 1 in _de novo_ assembly +- `bc_N`: sequence of barcode N pulled from the variant caller (may be reverse complement; NA if missing) +- `pos_N`: position of barcode N in _de novo_ assembly (NA if missing) +- `Contaminants`: number of reads from "contaminants" ([more details](https://github.com/octantbio/octopus/wiki/Pipeline-Details#alternative-contaminants)) +- `Leftover`: number of reads in well leftover after filtering out "contaminants" +- `Percent_aligned`: percentage of Leftover reads that align with the reference sequence. +- `Contig`: the _de novo_ assembly. Note the first and last N bases (often 55 or 125) are repeated +- `Ref_Seq`: sequence that _de novo_ assembly aligns to + +# 3. Picking perfects + +One way you can analyze the results is by pasting the `aggregated-stats.tsv` into a spreadsheet + +1. If applicable, filter out any "TRUE" values under `BC_Contam` +2. If applicable, flag or filter out any duplicate barcodes +3. Filter out any unexpected variants. The pipeline will automatically detect any strings of N's in the `input.fasta` and report the number of `expected_bcs` for that reference. Perfect clones should have `expected_bcs = n_barcodes = n_vars` +4. Ensure that there is adequate coverage by checking `LT_10` and `LT_3`. We recommend only picking wells with `LT_3 = 0`. You can be more conservative by using `LT_10` to specify your cutoff. (For a 10kb plasmid an `LT_3` of 0.001 means that 10 bp of the plasmid did not have a coverage of at least three). +5. If there happens to be a clone that does not have sufficient coverage (by `LT_10` or `LT_3` but is absolutely required, use `samtools tview` to manually inspect the read pileup in critical areas of your plasmid: + 1. In a new terminal, `cd` into your octopus directory + 2. Open up a new docker instance - `docker run --rm -it -v "$(pwd)":/root/octopus octant/octopus /bin/bash` + 3. Navigate into the folder that contains the analyzed data from that run - `cd octopus/pipeline/your_run_id` + 4. View the pileup - `samtools tview your_plate/your_well.map.bam lib/your_ref.fasta` + - Note `your_ref` will be `DeNovo_Ref` in `aggregated-stats.tsv` \ No newline at end of file