This is a collection of the computational methods used for the publication Increased efficiency in identifying mixed pollen samples by meta-barcoding with a dual-indexing approach . Here we provide all custom scripts and commands to replicate our results. Furthermore you can download the pre-computed training sets for the RDPclassifier here and UTAX here. If you find this information useful please consider citing our article.
This is a list of software tools and versions used in the original analysis. We provide precomputed results for most steps so don’t worry if you can’t get a certain tool in a certain version. If you get substantially different results (e.g. using other versions/tools) feel free to contact me. You can also do so by opening an Issue here on GitHub.
Most of the analyses were performed on a Laptop with
- Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz Processor
- 16GB RAM
- Ubuntu 14.04 operating system
Only the RDP training was performed on a machine with more RAM available.
- ITS2 database (Version 4.0.0 and inofficial release (included))
- UTAX (Version usearch v8.0.1477_i86linux32) ATTENTION: does not work with versions >=8.1 due to changes in the utax algorithm
- RDPclassifier (Version master from GitHub (63c637b2c8b3b941cf021d1e549caec0041b4c3c, 2015-02-20))
- NCBI::Taxonomy - perl module (Version v0.70.5, doi:10.5281/zenodo.17375)
- R (Version 3.1.2 (2014-10-31) – “Pumpkin Helmet”, Platform: x86_64-pc-linux-gnu (64-bit))
- perl (Version v5.20.2 built for x86_64-linux-gnu-thread-multi)
- QIIME (Version 1.8.0+dfsg-4)
- fastq-join (Version 1.01.759)
If you want to start to classify your own data right now, this is the easiest way to do it:
- Make sure you have installed perl, usearch, fastq-join (and RDPclassifier if needed).
- Put all your fastq read files in an empty directory
- Execute the following commands:
# git clone the repository
git clone https://github.com/iimog/meta-barcoding-dual-indexing
# alternatively you can download and extract the release
cd meta-barcoding-dual-indexing
mkdir myanalyses
cd myanalyses
# extract the plant reference databases for classification
tar xzvf ../training/rdp/rdp_trained.tar.gz
tar xzvf ../training/utax/utax_trained.tar.gz
# this step is not required but recommended
# if you choose to skip it use .fa instead of .udb when calling classify_reads.pl
usearch8.0 -makeudb_usearch utax_trained/viridiplantae_all_2014.utax.fa\
-output utax_trained/viridiplantae_all_2014.utax.udb
# call the wrapper script that performs:
# joining, filtering, classification and aggregation for all samples
perl ../code/classify_reads.pl --out results <path_to_reads>/*.fastq\
--utax-db utax_trained/viridiplantae_all_2014.utax.udb\
--utax-taxtree utax_trained/viridiplantae_all_2014.utax.tax\
--rdp --rdp-jar <path_to_RDPTools>/classifier.jar\
--rdp-train-propfile rdp_trained/its2.properties
The ITS2 sequences can be retrieved from the ITS2 database. The previous reference db consisted of all “Direct folds & Homology modeled” sequences from “Viridiplantae” (db version v4.0.0, 2011). This file viridiplantae_folds_2011.fasta is included for comparison. The new reference database consists of all sequences from “Viridiplantae” of an inofficial ITS2 database release (2014). This file viridiplantae_all_2014.fasta is also included in this repository.
Assign NCBI taxonomy on kingdom, phylum, class, order, family, genus and species level by mapping gi to taxid.
ATTENTION: Results of the taxonomic assignment may vary depending on the date of calculation as NCBI Taxonomy is continuously revised. The viridiplantae_all_2014 dataset was processed on <2015-05-28 Do> and the viridiplantae_folds_2011 dataset on <2015-06-29 Mo>.
Create an analysis folder and execute the following commands there (you need the NCBI::Taxonomy module for this):
# First create a list of gi numbers from the fasta file
# This only works if the header is in the form
# '>123456 any description' with 123456 being the taxid without any prefix like 'gi|'
# If your fasta headers have a different format adjust the substitude expression accordingly.
grep "^>" ../data/viridiplantae_all_2014.fasta |
perl -pe 's/^>(\d+).*/$1/' >viridiplantae_all_2014.gis
# Now find taxonomic lineages for the gis
perl ../code/gi2taxonomy.pl\
--gis viridiplantae_all_2014.gis\
--out viridiplantae_all_2014.tax\
--species viridiplantae_all_2014.species.taxids\
--genus viridiplantae_all_2014.genus.taxids
# This is only needed for comparison of the old reference db to the new one
grep "^>" ../data/viridiplantae_folds_2011.fasta |
perl -pe 's/^>(\d+).*/$1/' >viridiplantae_folds_2011.gis
perl ../code/gi2taxonomy.pl\
--gis viridiplantae_folds_2011.gis\
--out viridiplantae_folds_2011.tax\
--species viridiplantae_folds_2011.species.taxids\
--genus viridiplantae_folds_2011.genus.taxids
This generates the following files:
- viridiplantae_all_2014.gis
- viridiplantae_all_2014.tax
- viridiplantae_all_2014.species.taxids
- viridiplantae_all_2014.genus.taxids
and
- viridiplantae_folds_2011.gis
- viridiplantae_folds_2011.tax
- viridiplantae_folds_2011.species.taxids
- viridiplantae_folds_2011.genus.taxids
All of those are also included in the precomputed folder.
ATTENTION If the gi2taxonomy.pl command throws the following error message:
20xx/xx/xx xx:xx:xx Unable to open taxonomic database at './t/data//gi_taxid.bin'
Unable to open taxonomic database at './t/data//gi_taxid.bin' at /xxx/xxx/NCBI-Taxonomy/lib//NCBI/Taxonomy.pm line 162
You have to download an NCBI Taxonomy dump by running:
<in NCBI::Taxonomy dir>: ./make_gi_taxid.pl --overwrite
And then adjust the $TAXDIR variable in NCBI-Taxonomy/lib/NCBI/Taxonomy.pm line 28.
The following commands executed in the analysis folder generate the required fasta and tax files for RDP and UTAX:
perl ../code/tax2rdp_utax.pl viridiplantae_all_2014.tax\
../data/viridiplantae_all_2014.fasta viridiplantae_all_2014
This generates the following files:
- viridiplantae_all_2014.gi_tax.map
- viridiplantae_all_2014.rdp.fa
- viridiplantae_all_2014.rdp.tax
- viridiplantae_all_2014.utax.fa
- viridiplantae_all_2014.utax.tax
The first three are also included in the precomputed folder. And the last two are included in the training/utax folder. The required file format changed in the new versions of usearch. The compatible file is included in
The utax files are ready to be used for classification. However to speed up the initial step a udb file can be created as follows (not needed for SINTAX):
usearch8.0 -makeudb_usearch viridiplantae_all_2014.utax.fa\
-output viridiplantae_all_2014.utax.udb
This creates the file viridiplantae_all_2014.utax.udb which is not included as it is not required and its size is 225MB. To train the RDPclassifier execute the following commands (warning for the train command 16GB RAM did not suffice, but 32 did):
mkdir rdp_trained
java -jar classifier.jar rm-dupseq --infile viridiplantae_all_2014.rdp.fa\
--outfile viridiplantae_all_2014.rdp.rm-dupseq.fa\
--duplicates --min_seq_length 150
java -jar classifier.jar rm-partialseq viridiplantae_all_2014.rdp.fa\
viridiplantae_all_2014.rdp.rm-dupseq.fa\
viridiplantae_all_2014.rdp.rm-dupseq.rm-partialseq.fa\
--alignment-mode overlap --min_gaps 50 --knn 20
java -Xmx32g -jar classifier.jar train --out_dir rdp_trained\
--seq viridiplantae_all.rdp.rm-dupseq.rm-partialseq.fa\
--tax_file viridiplantae_all.rdp.tax
cp data/its2.properties rdp_trained/its2.properties
This generates the following files:
All of those are also included in the precomputed folder. And the folder rdp_trained including five files:
- rdp_trained/bergeyTrainingTree.xml
- rdp_trained/genus_wordConditionalProbList.txt
- rdp_trained/its2.properties
- rdp_trained/wordConditionalProbIndexArr.txt
- rdp_trained/logWordPrior.txt
Those are the files required for RDP classification and are included as rdp_trained.tar.gz in training/rdp
Now you have everything you need to classify sequences with either RDP classifier or UTAX/SINTAX.
Sintax is not yet included in the classify_reads.pl script. To classify your reads run this on preprocessed (paired-end merged, primer-trimmed and filtered) fasta files:
for f in <path_to_reads>/*.fasta
do
out=$(basename $f .fasta)
usearch10 -sintax $f \
-db precomputed/viridiplantae_all_2014.sintax.fa \
-tabbedout $out.sintax.txt \
-strand both \
-sintax_cutoff 0.90
done
The number of sequences 2011 and 2014 can be calculated by using grep on header lines in the fasta files:
old=$(grep -c "^>" data/viridiplantae_folds_2011.fasta)
new=$(grep -c "^>" data/viridiplantae_all_2014.fasta)
increase=$(printf %.0f $(echo "100*$new/$old - 100" | bc -l))
echo "Sequences_2011: $old"
echo "Sequences_2014: $new"
echo "Increase: $increase%"
Sequences_2011: | 73879 |
Sequences_2014: | 182505 |
Increase: | 147% |
ATTENTION: You may notice the discrepancy between 73,879 and the 73,853 reported in the publication. The difference of 26 sequences is due to the fact that no taxonomy could be assigned to those 26 sequences at the time of training (of the first reference database). Those sequences have therefore been excluded.
Just to be sure:
printf %.0f%% $(echo "100*182505/73853 - 100" | bc -l)
147%
The number of species can be calculating by counting the lines in *.specis.taxids which is a uniq list.
old=$(cat precomputed/viridiplantae_folds_2011.species.taxids | wc -l)
new=$(cat precomputed/viridiplantae_all_2014.species.taxids | wc -l)
increase=$(printf %d $(echo "100*$new/$old - 100" | bc -l))
echo "Species_2011: $old"
echo "Species_2014: $new"
echo "Increase: $increase%"
Species_2011: | 37403 |
Species_2014: | 72325 |
Increase: | 93% |
To assess the completeness of species and genera in the reference database in respect to known plant species in Bavaria and the USA lists of taxa were obtained from bayernflora.de (<2015-01-30 Fr>) and BISON (<2015-02-13 Fr>). In the analysis folder execute the following commands:
mkdir flora_bavaria flora_usa
cd flora_bavaria
../../code/get_taxa_bayern.sh
cd ../flora_usa
../../code/get_taxa_bison.sh
This generates the following files in analysis/flora_bavaria
- bayern.genus.taxids
- bayern.genus.taxids.tsv
- bayern.genus.txt
- bayern.species.cleaned.taxids
- bayern.species.cleaned.taxids.tsv
- bayern.species.cleaned.txt
- bayern.species.taxids.tsv
- bayern.species.txt
And for each state of the USA the following files in analysis/flora_usa
- <fips>.checklist
- <fips>.genus
- <fips>.genus.taxids
- <fips>.genus.tsv
- <fips>.species
- <fips>.species.taxids
- <fips>.species.tsv
The results may vary depending on the date of data retrieval, therefore those files are included in the precomputed folder.
Attention: There was an issue with the precomputed checklists for bavaria, see: #5. After fixing those and using the most recent version of the bayern-flora checklist (July 2018) you end up with the following numbers:
SPECIES_BAVARIA=$(cat flora_bavaria/bayern.species.cleaned.taxids | wc -l)
COMMON_OLD=$(cat viridiplantae_folds_2011.species.taxids flora_bavaria/bayern.species.cleaned.taxids | sort | uniq -d | wc -l)
COMMON_NEW=$(cat viridiplantae_all_2014.species.taxids flora_bavaria/bayern.species.cleaned.taxids | sort | uniq -d | wc -l)
echo Bavaria Species 2014 $(printf %.1f $(echo "100 * $COMMON_NEW/$SPECIES_BAVARIA" | bc -l))%
echo Bavaria Species 2011 $(printf %.1f $(echo "100 * $COMMON_OLD/$SPECIES_BAVARIA" | bc -l))%
GENERA_BAVARIA=$(cat flora_bavaria/bayern.genus.taxids | wc -l)
COMMON_OLD=$(cat viridiplantae_folds_2011.genus.taxids flora_bavaria/bayern.genus.taxids | sort | uniq -d | wc -l)
COMMON_NEW=$(cat viridiplantae_all_2014.genus.taxids flora_bavaria/bayern.genus.taxids | sort | uniq -d | wc -l)
echo Bavaria Genus 2014 $(printf %.1f $(echo "100 * $COMMON_NEW/$GENERA_BAVARIA" | bc -l))%
echo Bavaria Genus 2011 $(printf %.1f $(echo "100 * $COMMON_OLD/$GENERA_BAVARIA" | bc -l))%
Bavaria | Species | 2014 | 73.6% |
Bavaria | Species | 2011 | 48.7% |
Bavaria | Genus | 2014 | 86.6% |
Bavaria | Genus | 2011 | 71.3% |
To get a list of species and genus coverage for each state execute the following in the analysis folder:
(echo -e "Fips\tSpecState\tSpec2011\tSpec2014\tGenusState\tGenus2011\tGenus2014"
for i in $(seq 1 56)
do
# Excludes 3, 7, 14, 43 and 52.
if [ "$i" -eq 3 ] || [ "$i" -eq 7 ] || [ "$i" -eq 14 ] || [ "$i" -eq 43 ] || [ "$i" -eq 52 ]
then
continue # Those fips are not used
fi
i=$(printf "%02d" $i)
STATE_SPEC=$(cat flora_usa/$i.species.taxids | wc -l)
STATE_GENUS=$(cat flora_usa/$i.genus.taxids | wc -l)
COMMON_SPEC_2011=$(cat viridiplantae_folds_2011.species.taxids flora_usa/$i.species.taxids | sort | uniq -d | wc -l)
COMMON_GENUS_2011=$(cat viridiplantae_folds_2011.genus.taxids flora_usa/$i.genus.taxids | sort | uniq -d | wc -l)
COMMON_SPEC_2014=$(cat viridiplantae_all_2014.species.taxids flora_usa/$i.species.taxids | sort | uniq -d | wc -l)
COMMON_GENUS_2014=$(cat viridiplantae_all_2014.genus.taxids flora_usa/$i.genus.taxids | sort | uniq -d | wc -l)
echo -e "$i\t$STATE_SPEC\t$COMMON_SPEC_2011\t$COMMON_SPEC_2014\t$STATE_GENUS\t$COMMON_GENUS_2011\t$COMMON_GENUS_2014"
done) >flora_usa/states.common.tsv
This creates the file
- flora_usa/states.common.tsv
which is also included in the precomputed/flora_usa folder.
This file is further analysed with R:
data=read.table("states.common.tsv", header=T, sep="\t")
print(summary(data$Spec2014/data$SpecState))
print(summary(data$Genus2014/data$GenusState))
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
Species coverage | 0.665 | 0.750 | 0.761 | 0.756 | 0.766 | 0.791 |
Genera coverage | 0.738 | 0.832 | 0.849 | 0.840 | 0.858 | 0.873 |
The number of genera per order in the old reference database and the new one were calculated with the following commands:
cat viridiplantae_folds_2011.tax | grep "Viridiplantae" | perl -pe 's/.*(o__[^;]+);.*(g__[^;]+);.*/$1\t$2/' | sort -u | grep -v undef | datamash -g 1 count 2 >2011_genera_per_order
cat viridiplantae_all_2014.tax | grep "Viridiplantae" | perl -pe 's/.*(o__[^;]+);.*(g__[^;]+);.*/$1\t$2/' | sort -u | grep -v undef | datamash -g 1 count 2 >2014_genera_per_order
echo -e "Order\ttaxid\told\tnew" >increase_genera_per_order.tsv
join -t$'\t' -a1 2014_genera_per_order 2011_genera_per_order | perl -pe 's/^([^\s]+\t\d+)$/$1\t0/' | perl -F"\t" -ane 'chomp $F[2];print "$F[0]\t$F[2]\t$F[1]\n"' | sed 's/o__//;s/_/\t/' >>increase_genera_per_order.tsv
join -t$'\t' -v2 2014_genera_per_order 2011_genera_per_order | perl -pe 's/\n/\t0\n/;s/o__//;s/_/\t/' >>increase_genera_per_order.tsv
The created files:
are included in the precomputed folder.
Creation of the latex table
cat <<EOF >additional_file2.tex
\documentclass{article}
\usepackage{tabu}
\usepackage{longtable}
\newcolumntype{R}{>{\raggedleft\arraybackslash}X}
\usepackage{booktabs}
\renewcommand{\thetable}{S\arabic{table}}%
\begin{document}
\begin{longtabu}{lXRR}
\caption{Comparison of the number of genera per order for all orders.}\\\\
\toprule
Order & TaxID & Genera old & Genera new \\\\
\midrule
\endhead
EOF
join -t$'\t' -a1 2014_genera_per_order 2011_genera_per_order | perl -pe 's/^([^\s]+\t\d+)$/$1\t0/' | perl -F"\t" -ane 'chomp $F[2];print "$F[0]\t$F[2]\t$F[1]\n"' | sed 's/o__//;s/_/\t/;' | perl -pe 's/\t/ & /g;s/\n/\\\\\n/' >>additional_file2.tex
join -t$'\t' -v2 2014_genera_per_order 2011_genera_per_order | perl -pe 's/\n/\t0\n/;s/o__//;s/_/\t/' | perl -pe 's/\t/ & /g;s/\n/\\\\\n/' >>additional_file2.tex
cat <<EOF >>additional_file2.tex
\bottomrule
\end{longtabu}
\end{document}
EOF
pdflatex additional_file2.tex
pdflatex additional_file2.tex
The created tex file is included in the precomputed folder
The increase of sequences for a number of selected groups can simply be determined by:
cat <<EOF >additional_file3.tex
\documentclass{article}
\usepackage{tabu}
\usepackage{longtable}
\newcolumntype{R}{>{\raggedleft\arraybackslash}X}
\usepackage{booktabs}
\setcounter{table}{1}
\renewcommand{\thetable}{S\arabic{table}}%
\begin{document}
\begin{longtabu}{XRR}
\caption{Comparison of the number of sequences per group for selected taxonomic groups.}\\\\
\toprule
Group & old & new \\\\
\midrule
\endhead
EOF
echo "Vitaceae & "$(grep -c Vitaceae viridiplantae_folds_2011.tax)" & "$(grep -c Vitaceae viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Heracleum} & "$(grep -c Heracleum viridiplantae_folds_2011.tax)" & "$(grep -c Heracleum viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Carduus} & "$(grep -c Carduus viridiplantae_folds_2011.tax)" & "$(grep -c Carduus viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Phacelia} & "$(grep -c Phacelia viridiplantae_folds_2011.tax)" & "$(grep -c Phacelia viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Convolvulus} & "$(grep -c Convolvulus viridiplantae_folds_2011.tax)" & "$(grep -c Convolvulus viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
echo '\\'"textit{Helianthus} & "$(grep -c Helianthus viridiplantae_folds_2011.tax)" & "$(grep -c Helianthus viridiplantae_all_2014.tax) '\\\\' >>additional_file3.tex
cat <<EOF >>additional_file3.tex
\bottomrule
\end{longtabu}
\end{document}
EOF
pdflatex additional_file3.tex
pdflatex additional_file3.tex
The created tex file is included in the precomputed folder
Create a folder called raw and download data from EBI SRA repository project accession number PRJEB8640. Extract into separate .fastq files (two for each sample). I assume your directory contains all the samples in the following form: <SampleName>_S<SampleNr>_L001_R<1|2>_001.fastq e.g. PoJ1_S1_L001_R1_001.fastq Where R1 is the file containing forward reads and R2 the file containing reverse reads for each sample. This can be accomplished by loading the list of archives from EBI:
wget http://www.ebi.ac.uk/ena/data/warehouse/filereport\?accession\=PRJEB8640\&result\=read_run\&fields\=study_accession,secondary_study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy\&download\=txt -O reads.tsv
This downloads the reads.tsv file which is also included in the precomputed folder. It contains a list of the 384 sequencing libraries of this project. This file can now be used to download all the reads with the following command executed in the raw folder:
for i in $(cut -f13 reads.tsv | grep fastq.gz | perl -pe 's/;/\n/')
do
wget $i
done
gunzip *.gz
# Fix typo - lowercase j in some samples:
rename 's/Poj/PoJ/' *.fastq
Now your folder should contain 768 .fastq files in the format described above.
In the raw folder create a subfolder joined and execute the following commands
qiime
for i in ../*_R1_001.fastq
do
BASE=$(basename $i _R1_001.fastq)
join_paired_ends.py -f $i -r ../${BASE}_R2_001.fastq -o $BASE
done
This creates a folder for each sample in the form <SampleName>_S<SampleNr>_L001 containing three files:
- fastqjoin.join.fastq
- fastqjoin.un1.fastq
- fastqjoin.un2.fastq
In the raw folder create a subfolder filtered and execute the following commands
for i in ../joined/*
do
BASE=$(basename $i)
usearch8.0 -fastq_filter $i/fastqjoin.join.fastq\
-fastq_truncqual 19 -fastq_minlen 150 -fastqout $BASE.q20.fq
done
Now you have one .fq file for each sample in the following form <SampleName>_S<SampleNr>_L001.q20.fq with joined and quality filtered reads.
In the raw folder create a subfolder utax and execute the following commands: You can use viridiplantae_all_2014.utax.udb instead of viridiplantae_all_2014.utax.fa if you generated the udb file in the previous steps.
for i in $(find ../filtered -name "*.fq")
do
BASE=$(basename $i .fq)
usearch8.0 -utax $i -db ../../training/utax/viridiplantae_all_2014.utax.udb\
-utax_rawscore -tt ../../training/utax/viridiplantae_all.utax.tax\
-utaxout $BASE.utax
done
This way you end up with a .utax file for each sample containing the utax classification. Create a subfolder called counts and there execute this:
for i in ../*.utax
do
BASE=$(basename $i .utax)
perl ../../../code/count_taxa_utax.pl --in $i --cutoff 20 >$BASE.count
done
Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:
perl ../../../code/aggregate_counts.pl *.count >utax_aggregated_counts.tsv
perl -i -pe 's/(PoJ\d+)_S\d+_L001\.q20\.count/$1/g' utax_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' utax_aggregated_counts.tsv >utax_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' utax_aggregated_counts.tsv >utax_tax_table
The files
are included in the precomputed folder
In the raw folder create a subfolder rdp and execute the following commands:
for i in $(find ../filtered -name "*.fq")
do
BASE=$(basename $i .fq)
java -jar classifier.jar classify\
--train_propfile ../../training/rdp/rdp_trained/its2.properties\
--outputFile $BASE.rdp $i
done
This way you end up with a .rdp file for each sample containing the RDP classification. Create a subfolder called counts and there execute this:
for i in ../*.rdp
do
BASE=$(basename $i .rdp)
perl ../../../code/count_taxa_rdp.pl --in $i --cutoff 0.85 >$BASE.count
done
Now you have a list of counts per taxon for each sample. To aggregate the counts of all samples into a common matrix and to create files for phyloseq use the following commands:
perl ../../../code/aggregate_counts.pl *.count >rdp_aggregated_counts.tsv
perl -i -pe 's/(PoJ\d+)_S\d+_L001\.q20\.count/$1/g' rdp_aggregated_counts.tsv
perl -pe 's/^([^\t]+)_(\d+)\t/TID_$2\t/' rdp_aggregated_counts.tsv >rdp_otu_table
perl -ne 'if(/^([^\t]+)_(\d+)\t/){print "TID_$2\t"; $tax=$1; $tax=~s/_\d+,/\t/g; $tax=~s/__sub__/__/g; $tax=~s/__super__/__/g; print "$tax\n"; }' rdp_aggregated_counts.tsv >rdp_tax_table
The files
are included in the precomputed folder
The reads are directly counted on the fastq files with the following commands in the analysis folder:
grep -c "^+$" ../raw/*_R1_001.fastq | sed 's/..\/raw\///;s/_R1_001.fastq:/\t/' >read_count_raw.tsv
echo -e "Sum\tMean\tSD\tMedian"
cat read_count_raw.tsv | datamash sum 2 mean 2 sstdev 2 median 2
Sum | Mean | SD | Median |
11624087 | 30271 | 11373 | 30900 |
The created file:
is also included in the precomputed folder.
To get the counts for filtered reads (rare taxa removed) use this R code (in the analysis folder):
library(phyloseq)
data = read.table("utax_otu_table", sep="\t", header=T, row.names=1)
otu = otu_table(data, taxa_are_rows=T)
otu_rel = transform_sample_counts(otu, function(x) x/sum(x))
otu_table(otu)[otu_table(otu_rel)<0.001]<-0
summary(colSums(otu))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 7 11000 15740 15580 19650 36940
sum(colSums(otu))
# [1] 5984543
sd(colSums(otu))
# [1] 6597.562
Sum | Mean | SD | Median |
5984543 | 15580 | 6598 | 15740 |
The following R code can be used to create the species accumulation curves:
library(vegan)
library(phyloseq)
data = read.table("utax_otu_table", sep="\t", header=T, row.names=1)
map = import_qiime_sample_data("../data/mapFile.txt")
otu = otu_table(data, taxa_are_rows=T)
otu_rel = transform_sample_counts(otu, function(x) x/sum(x))
otu_table(otu)[otu_table(otu_rel)<0.001]<-0
phy = merge_phyloseq(otu, map)
trunc = subset_samples(phy, BeeSpecies == "H.truncorum")
rufa = subset_samples(phy, BeeSpecies == "O.rufa")
veganotu <- function(physeq) {
require("vegan")
OTU <- otu_table(physeq)
if (taxa_are_rows(OTU)) {
OTU <- t(OTU)
}
return(as(OTU, "matrix"))
}
trunc.v = veganotu(trunc)
rufa.v = veganotu(rufa)
pdf("Figure2.pdf")
par(mfrow =c(1,2))
par(mar=c(3,3,1,1)+0.1, pin= c(2.73, 2.73))
rarecurve(rufa.v, step = 1, xlab = "", ylab = "",label = FALSE, xlim =c(-0.2, 5000), ylim = c(-0.2, 90), lwd = 0.5)
title(ylab = "No. Taxa", line= 2)
title(xlab = "Sequencing Depth [reads]", line = 2)
text(x = 100, y = 87, "a", cex = 2)
rarecurve(trunc.v, step = 1, xlab = "", ylab = "", label = FALSE, xlim = c(-0.2, 5000), ylim = c(-0.2, 90), lwd = 0.5)
title(ylab = "No. Taxa", line= 2)
title(xlab = "Sequencing Depth [reads]", line = 2)
text(x=100, y = 87, "b", cex = 2)
dev.off()
This code executed in the analysis folder compares the assignment of RDP and UTAX on the genus level (ignoring confidence values):
pv rdp/*.rdp | cut -f1,21 | grep -v undef | perl -pe 's/g__//;s/_/\t/' | sort -k 1b,1 >PoJ.genus.rdp
pv utax/*.utax | perl -pe 's/,/\t/g' | cut -f1,7 | grep -v undef | perl -pe 's/g__//;s/\(.*\)//;s/_/\t/' | sort -k 1b,1 >PoJ.genus.utax
printf %.1f%% $(echo "100 * " $(join PoJ.genus.rdp PoJ.genus.utax | cut -f2,4 -d" " | perl -F"\s" -ane '$g++;chomp $F[1];$c++ if($F[0] eq $F[1]);END{$e=$c/$g;print "$c / ( $g + "}') $(join PoJ.genus.rdp PoJ.genus.utax -v1 | wc -l) " + " $(join PoJ.genus.rdp PoJ.genus.utax -v2 | wc -l) ")" | bc -l)
90.3%
The two files:
are also included in the precomputed folder (as gzipped archives).
The file data/genera_flowering contains a list of genera found near the sampling plots. The following R code calculates the fraction of reads in all samples (with rare taxa removed) that belong to genera listed in the genera_flowering file:
library(phyloseq)
otu = otu_table(read.table("utax_otu_table", sep="\t", header=T, row.names=1), taxa_are_rows=T)
otu_rel = transform_sample_counts(otu, function(x) x/sum(x))
# remove rare taxa from each sample in otu
otu_table(otu)[otu_table(otu_rel)<0.001]<-0
tax = tax_table(as.matrix(read.table("utax_tax_table", sep="\t", fill=T, row.names=1)))
otu = merge_phyloseq(otu, tax)
# remove taxa that have only 0 counts after rare filtering and restriction to PoJ
otu_pruned = prune_taxa(rowSums(otu_table(otu))>0, otu)
# accumulate at genus level (ignoring species names)
otu_pruned_glom = tax_glom(otu_pruned, taxrank="V7")
write.table(as.factor(tax_table(otu_pruned_glom)[,6]),file="utax_genera_pruned_glom",quote=F,row.names=F,col.names=F)
flowering = c(read.table("../data/genera_flowering", stringsAsFactors=F))
# Remove undefined genera from the total set
otu_pruned_glom_noundef = subset_taxa(otu_pruned_glom, V7 != "g__undef_")
flowering_genera = tax_table(otu_pruned_glom_noundef)[,6] %in% paste("g__",flowering$V1, sep="")
100 * sum(otu_table(otu_pruned_glom_noundef)[flowering_genera,]) / sum(otu_table(otu_pruned_glom_noundef))
73.7%
As a side product the file:
was created (included in the precomputed folder).
To determine the fraction of documented flowering genera also found in at least one of the samples the following code can be executed (analysis folder):
TOTAL=$(cat ../data/genera_flowering | wc -l)
COMMON=$(cat ../data/genera_flowering <(perl -pe 's/^g__//' utax_genera_pruned_glom | grep -v undef_ | sort -u) | sort | uniq -d | wc -l)
echo $COMMON" / "$TOTAL" = "$(printf %.1f $(echo "100 * "$COMMON" / "$TOTAL | bc -l))"%"
98 / 201 = 48.8%