-
Notifications
You must be signed in to change notification settings - Fork 4
Tutorial
- 1. Installing CAARS
-
2. Running CAARS: example on mouse test data
- 2.1. Data acquisition
- 2.2. Running CAARS
-
3. Going further with CAARS: playing around with CAARS possibilities
- 3.1. I just have a transcriptome assembly and no RNA-Seq data: can I still use CAARS?
- 3.2. I want to use CAARS on my RNA-Seq data but I already have a transcriptome: can I use it?
- 3.3. I want to add new families to a previous caars run: is it possible?
- 3.4. I have families with less than 3 species: what can I do with them?
- 3.5. I want to add new samples in a previous CAARS run: is it possible?
- 3.6. I do not want to build a whole-transcriptome but just use the assisted assembler for specific families: is it possible?
- 3.7. In CAARS documentation, it is said that we can use multiple guide species and RNA-Seq data coming from different species: how to do this?
- 3.8. I already have multiple-sequence alignments; I want to use CAARS tools to refine my alignments but I have neither RNA-Seq data nor transcriptome
- 3.9. I have an error with the normalization step.
- 3.10. I have an error.
With CAARS you can at the same time both assemble and annotate transcripts. The assembly is facilitated by using guide taxa (i. e. sister species that can be highly divergent). The transcripts are then inserted in user-provided multi-species alignments. Gene trees are subsequently inferred and the annotation is performed using the phylogenetic information in the trees.
In this tutorial we will assemble and annotate a mouse test RNA-Seq dataset.
CAARS uses a lot of dependencies (such as BLAST, Trinity...). In order to avoid installing all of these dependencies on your machine, we suggest you to use Docker
. Docker
will create a local environment on your computer that will contain all CAARS dependencies; they all will be packaged within the CAARS Docker
image. Of course, you can also use CAARS without Docker
.
If you don't have Docker
on your machine, get it here first. (Be aware that installation might differ if you're a Linux, a Mac or a Windows user.)
We will use the Docker
image named carinerey/caars
. This image will be run in a Docker
container on your local machine (this container is a closed environment where CAARS and all its dependencies are already installed).
In order to interact with the Docker
container environment, you first need to create a shared directory on your machine (that will be used for the interaction between the Docker
container and your machine).
# ON YOUR MACHINE
mkdir /home/crey/shared/ # whatever directory
cd /home/crey/shared/
Download the last versions (and check update) by running:
# ON YOUR MACHINE
# the first image download can take several minutes (around 2Go)
# loading the image the next times should take a few seconds
docker pull carinerey/caars
docker pull carinerey/caars_tuto
We will now prepare the command lines to use caars throw its docker.
The image carinerey/caars
contains CAARS and its dependencies and the image carinerey/caars_tuto
contains 2 dependencies needed for this tutorial (sratoolkit
and the Ensembl API
).
# ON YOUR MACHINE
export SHARED_DIR=$PWD # We will use the variable $SHARED_DIR as the path shared by your machine and the docker containers
function docker_prep_data_cmd { echo "docker run --rm -u $UID:$UID -v $SHARED_DIR:$SHARED_DIR -w `pwd` carinerey/caars_tuto_prep_data "
}
function docker_caars_cmd { echo "docker run --rm -e LOCAL_USER_ID=`id -u $USER` -v $SHARED_DIR:$SHARED_DIR -w `pwd` carinerey/caars "
}
# "-e LOCAL_USER_ID=`id -u $USER`" will ensure that you have all permissions on files created in the `Docker` container
# "-e SHARED_DIR=$SHARED_DIR" exports the variable $SHARED_DIR in Docker container (that we will use later)
Now to use CAARS we need only to type (after have been defined $SHARED_DIR
):
# ON YOUR MACHINE
export SHARED_DIR=$PWD # We will use the variable $SHARED_DIR as the path shared by your machine and the docker containers
`docker_caars_cmd` caars [options]
For example to get the help message:
# ON YOUR MACHINE
`docker_caars_cmd` caars -h
If you get this error message: docker: Got permission denied while trying to connect to the Docker daemon socket [...]
, you can solve it by running sudo docker run [...]
as a temporary solution. To fix this issue permanently, see here.
Please note that SHARED_DIR
must contain an absolute path – as we did here. Indeed, CAARS builds links with absolute path. These links will be broken if you don't use the same directory tree.
If you don't want to use Docker
and prefer to install CAARS from source, follow these instructions.
To follow the tutorial, after install CAARS and its dependencies, you must just have the export the environment variable $SHARED_DIR
* and define as null $CAARS_DOCKER_CMD
and $PREP_DATA_DOCKER_CMD
*
# ON YOUR MACHINE
mkdir /home/crey/working_dir/ # whatever directory
cd /home/crey/working_dir/
export SHARED_DIR=$PWD # We will use the variable $SHARED_DIR as the path to your working directory
function docker_prep_data_cmd { echo ""
}
function docker_caars_cmd { echo ""
}
Now that installation is complete, we can run CAARS. We first need to collect the input data. CAARS needs as input 5 items (full description of each item is below):
- RNA-Seq data from one or several species
- a sample sheet file describing RNA-Seq data
- a directory with gene family alignments
- a directory with species-sequence map files
- a species tree file
For each item, you will can either download directly the prebuild data or get by yourself the data (when it is possible).
RNA-Seq data in Amalgam can be from one or multiple model or non-model species and can be single- or paired-end (stranded or not). Input files must be in fastq
or fasta
format. In this tutorial, we will use a subset of a paired-end unstranded RNA-Seq library from Mus musculus kidney (available in the SRA database from Fushan AA et al. (2015)):
Get the prebuild data
To get the test data, simply run in your Docker
container terminal:
# ON YOUR MACHINE
export DATA_DIR=$SHARED_DIR/data/ # create a variable DATA_DIR which contains the path to all input data
mkdir -p $DATA_DIR
cd $DATA_DIR
# download and unzip data in DATA_DIR
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/rna_seq.tar.gz && tar xvzf rna_seq.tar.gz
This dataset is just a subset of the SRA SRR636918
archive. It is much enough for this tutorial.
Get by yourself the data
To get SRA archive, we will use fastq-dump
from sra-toolkit
.
Note that sratoolkit
is already install in the carinerey/caars_tuto
image but not in carinerey/caars
image.
If you want to download just a subset of an archive you can use the -X
and -N
options of fastq-dump
.
For example:
export DATA_DIR=$SHARED_DIR/data/
mkdir -p $DATA_DIR
# download data
cd $DATA_DIR
mkdir -p rna_seq
for SRR in SRR636918 SRR636917 SRR636916
do
## Download a subset of the archive $SRR
`docker_prep_data_cmd` fastq-dump --split-files -O rna_seq --defline-seq "@\$ac_\$si/\$ri" --defline-qual "+" -M 51 -X 100000 -N 200000 $SRR
done
Note that 2 files will be created for each archive.
If you want to retrieve the full archive to reproduce the same analyses as in the paper you must also download SRR636917 and SRR636916 in addition of SRR636918.
Run in your Docker
container terminal:
export DATA_DIR=$SHARED_DIR/data/
mkdir -p $DATA_DIR
# download data
cd $DATA_DIR
mkdir -p rna_seq
for SRR in SRR636918 SRR636917 SRR636916
do
## Download the whole archive $SRR
`docker_prep_data_cmd` fastq-dump --split-files -O rna_seq --defline-seq "@\$ac_\$si/\$ri" --defline-qual "+" -M 51 $SRR
done
Now, we rename the files:
cd $DATA_DIR
cat rna_seq/SRR*_1.fastq > rna_seq/Mus_musculus_1.fq
cat rna_seq/SRR*_2.fastq > rna_seq/Mus_musculus_2.fq
This file contains the information describing RNA-Seq data and associates each sample file to some characteristics. This file is composed of 11 tab-delimited columns:
-
Sample ID: a unique identifier describing the sample (such as
CMM
for CAARS Mus Musculus). This will be the prefix of all sequences. -
Sample species name (here,
Mus_musculus
) -
Sample group defined the belonging of the sample to this group. (All samples of a group will be treated at the same time in the assisted assembly, sample will no be merged see the assisted assembly documentation) (here,
g1
) -
Reference species name: Guide species to annotate the sample (if several, separated by comma) (here,
Homo_sapiens
) -
Path for a SE RNA-Seq file:
/path/to/my/SE/single.fastq
or/path/to/my/SE/single.fasta
(if PE data write:-
). Format are automatically detected, used extension in.fasta
,.fa
,.fastq
or.fq
. -
Path for a left PE RNA-Seq file:
/path/to/my/PE/data.R1.fastq
or/path/to/my/PE/data.R1.fasta
(if SE data write:-
). Format are automatically detected, used extension in.fasta
,.fa
,.fastq
or.fq
. -
Path for a right PE RNA-Seq file:
/path/to/my/PE/data.R2.fastq
or/path/to/my/PE/data.R2.fasta
(if SE data write:-
). Format are automatically detected, used extension in.fasta
,.fa
,.fastq
or.fq
. -
Strand and type of the RNA-Seq run:
F
,R
,RF
,FR
,US
orUP
-
US
/UP
: unstranded SE/PE -
F
/R
: SE stranded Forward/Reverse -
RF
/FR
: PE stranded
-
-
Run standard pipeline on data:
Yes
orNo
(ifYes
(typical case), this builds a draft whole-transcriptome assembly (if not provided in column 9) and dispatches sequences in families according to sequence similarities. (see CAARS overview)) -
Path to a given draft assembly:
/path/to/my/fasta/draft/assembly.fa
(UTRs have to removed, CDS only!) -
Run assisted assembly on data:
Yes
orNo
(ifYes
(typical case), this builds an assisted assembly by gene family using guide sequences from this given gene family as baits (see CAARS overview))
Column order is important and column names are required.
Here is an example table:
id | species | group | ref_species | path_fastx_single | path_fastx_left | path_fastx_right | orientation | run_standard | path_assembly | run_apytram |
---|---|---|---|---|---|---|---|---|---|---|
CMM | Mus_musculus | g1 | Homo_sapiens | - | /home/crey/shared/data/rna_seq/Mus_musculus_1.fq | /home/crey/shared/data/rna_seq/Mus_musculus_2.fq | UP | yes | - | yes |
You can copy this example table or get it from here and put it in $DATA_DIR
.
Get the prebuild data
cd $DATA_DIR
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.tsv
2.1.3. Prepare 2 directories (i) with gene family alignments and (ii) with species-sequence map files
Here we retrieve the existing gene families that will be used in CAARS. In this tutorial, we will use some gene families from the Ensembl Compara dataset (but in real life you can use other nucleotide level multispecies alignments, from public or in-house databases). To speed up the process, in this tutorial we subset the Compara dataset to keep only 13 Mammal representatives (we voluntary discarded mouse).
Common name | Taxon ID | Scientific name |
---|---|---|
Armadillo | 9361 | Dasypus_novemcinctus |
Elephant | 9785 | Loxodonta_africana |
Pig | 9823 | Sus_scrofa |
Sheep | 9940 | Ovis_aries |
Microbat | 59463 | Myotis_lucifugus |
Cat | 9685 | Felis_catus |
Ferret | 9669 | Mustela_putorius_furo |
Bushbaby | 30611 | Otolemur_garnettii |
Marmoset | 9483 | Callithrix_jacchus |
Human | 9606 | Homo_sapiens |
Rabbit | 9986 | Oryctolagus_cuniculus |
Guinea Pig | 10141 | Cavia_porcellus |
Squirrel | 43179 | Ictidomys_tridecemlineatus |
Get the prebuild data
The fast way to get the subset of families with only sequences from the 13 species listed above (useful for this tutorial) is to run:
cd $DATA_DIR
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/MSA.tar.gz && tar xvzf MSA.tar.gz
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/Seq2SpTable.tar.gz && tar xvzf Seq2SpTable.tar.gz
Warning: Alignments must only contain "ATGCNUWSMKRYBDHV-" characters.
Get by yourself the data
If ever you wanted to use other gene families, you can also download them directly from the Ensembl API using some scripts (see below).
Note that the Ensembl API is already install in the carinerey/caars_tuto
image but not in carinerey/caars
image.
(See here for more information on Ensembl API.)
First, create a directory to keep the scripts.
export SCRIPTS_DIR=$SHARED_DIR/scripts/
mkdir -p $SCRIPTS_DIR
The script Get_MSAs_EnsemblCompara.pl
is available here.
cd $SCRIPTS_DIR
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/src/Get_MSAs_EnsemblCompara.pl
# To get given families
cd $DATA_DIR
`docker_prep_data_cmd` perl $SCRIPTS_DIR/Get_MSAs_EnsemblCompara.pl -s "Dasypus_novemcinctus,Loxodonta_africana,Sus_scrofa,Ovis_aries,Myotis_lucifugus,Felis_catus,Mustela_putorius_furo,Otolemur_garnettii,Callithrix_jacchus,Homo_sapiens,Oryctolagus_cuniculus,Cavia_porcellus,Ictidomys_tridecemlineatus" -f ENSGT00550000074800,ENSGT00550000074846,ENSGT00870000136549 -g "Homo_sapiens" -o $DATA_DIR
# To get all families as in the paper
`docker_prep_data_cmd` perl $SCRIPTS_DIR/Get_MSAs_EnsemblCompara.pl -s "Dasypus_novemcinctus,Loxodonta_africana,Sus_scrofa,Ovis_aries,Myotis_lucifugus,Felis_catus,Mustela_putorius_furo,Otolemur_garnettii,Callithrix_jacchus,Homo_sapiens,Oryctolagus_cuniculus,Cavia_porcellus,Ictidomys_tridecemlineatus" -f all -g "Homo_sapiens" -o $DATA_DIR
After running this script, 2 directories are created that contain (i) gene family alignements (in $DATA_DIR/MSA
) and (ii) the sequence-species maps (in $DATA_DIR/Seq2SpTable
).
We finally need a last input file for CAARS: the species tree file. Here, we get the Ensembl Compara species tree and we keep only 14 previous species (mouse included).
/-Loxodonta_africana
/-|
| \-Dasypus_novemcinctus
|
| /-Mustela_putorius_furo
| /-|
| /-| \-Felis_catus
--| | |
| /-| \-Myotis_lucifugus
| | |
| | | /-Ovis_aries
| | \-|
| | \-Sus_scrofa
| |
\-| /-Homo_sapiens
| /-|
| /-| \-Callithrix_jacchus
| | |
| | \-Otolemur_garnettii
| |
\-| /-Mus_musculus
| /-|
| /-| \-Ictidomys_tridecemlineatus
| | |
\-| \-Cavia_porcellus
|
\-Oryctolagus_cuniculus
Get the prebuild data
The species tree is available here or run:
cd $DATA_DIR
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/species.tree
Data verification can be useful in order to avoid running CAARS for hours before getting a bug (e. g. because of misspelling of a file name).
export RUN_DIR=$SHARED_DIR/CAARS_dir
export OUT_DIR=$SHARED_DIR/CAARS_dir/output
mkdir -p $RUN_DIR $OUT_DIR
cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR --sample-sheet $DATA_DIR/sample_sheet.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --just-parse-input
No error message should appear; if ever this happens, the error messages should guide you towards the problematic input file(s). If the problem persists, please contact me: carine.rey [at] ens-lyon.fr.
You will find in the output directory some preliminary statistic files:
DetectedFamilies.txt
FamilyMetadata.txt
SpeciesMetadata.txt
UsableFamilies.txt
UsedFamilies.txt
See the output description 2.2.2.1 and the section 3.4 for the meaning and the usage of these files.
You can now run CAARS, hurrah! \o/ This should take several minutes on your machine.
cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR --sample-sheet $DATA_DIR/sample_sheet.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50
--mpast
is the most useful CAARS option. It stands for "minimum percentage of alignment on the sister taxon". If set to 50
, all sequences that correspond to a coverage of less than 50% compared to the sister sequence in the gene tree won't be output. This option allows filtering of low quality sequences.
For more information on CAARS options, run caars -h
.
Note that CAARS "remembers" what has already been runned. In the directory where you launched CAARS, a directory called _caars
has been built. _caars
contains all cache files ordered and named by their ID. IDs are automatically given and are not "user-understandable" (for instance: 2dc0b68e18f6838b8969cee52d9726c4, 5f5d716bb48acf278015199e73914f5f ...). This data structure allows checking if a step has ever been made in a previous run; if some cache file(s) exist, the corresponding step(s) will not be run again by CAARS.
CAARS results are stored in the directory specified by --outdir
. All files contained in this output directory are in fact links to their corresponding files in the _caars
directory (you don't need to access the _caars
directory but it must be in the shared directory between the Docker
container and your local machine).
In the outdir
you will find 4 directories and 2 files. All these directories are created at the very end of the CAARS process. Hence, do not worry if no file appears in the output whan launching CAARS. Only 3 output directories can be of interest for the user.
-
Files:
-
all_fam.seq2sp.tsv
: map between sequence names and species (for all families). Can be the input of a next CAARS run if you cut the 2 firsts columns (cut -f 1-2 all_fam.seq2sp.tsv > input_next_run.seq2sp.txt
)- a. Sequence name
- b. Species name
- c. Gene family name
- d. Name of th biggest set of orthologs containing this sequence in this family
-
all_fam.orthologs.tsv
: map between all sequence names and their orthologs. A tabular file with 4 columns:- a. Sequence name
- b. Closest orthologs of this sequence
-
DetectedFamilies.txt
: all families detected in thealignment-dir
. -
FamilyMetadata.txt
: Meta data about each detected family (# of sequences and species, and if it is usable in caars). -
SpeciesMetadata.txt
: Meta data about each species (# of sequences and families where it is present) -
UsableFamilies.txt
: List of family usable in caars (# de sequences >= 3 and # of species >= 3). -
UsedFamilies.txt
: all families used in this CAARS run (By default, it is all detected families but you can reduce it by using--family-subset
)
-
-
Directories:
-
assembly_results_by_fam
: contains several subdirectories, each containing the results for each gene family:- a.
MSA_out
: final MSAs (by family) - b.
GeneTreeReconciled_out
: final reconciled gene trees (by family) - c.
GeneTree_out
: final gene trees before reconciliation (by family) - d.
Orthologs_out
: ortholog relationship defined at the reconciliation step (by family) - e.
DL_out
: Duplication/Losses events defined at the reconciliation step (by family) - f.
FilterSummary_out
: Filtering summary (stats and sequences discarded by the--mpast
option) (by family)
- a.
-
draft_assemblies
: contains draft assemblies, i. e. raw sequences and cds for each target species -
assembly_results_only_seq
: contains all sequences reconstructed by CAARS (concatenated in a single file for each species) and a table with the annotation by family
-
Note that in the outdir
directory another subdirectory is created: outdir/_files
(contains links to files in the _caars
directory). This is not useful for the user.
You've now finished the CAARS "basic" tutorial. You reconstructed a mouse transcriptome from RNA-Seq data; in addition, these transcripts have been assigned to gene family alignments that you can use directly for comparative phylogenomic analyses. Finally, gene trees integrating your newly assembled transcripts are available for each family.
2.2.2.3. I used Docker to run CAARS: can I exit the Docker container without losing my results? Where are the results on my local machine?
Don't worry, your results are on your machine too. They will be in the directory specified by the option --outdir
(which is either the shared directory between your computer and the Docker
container or one of its subdirectories). The --outdir
directory does not contain the "real" results but rather links to the _caars
directory (where the "real" results are stored). If you wish to copy the --outdir
directory, you must first get rid of these links. For example, use cp -rL /home/crey/shared/outdir/ /home/crey/new/location/for/outdir
.
The "basic" tutorial above concerns the typical use of CAARS starting from only RNA-Seq data. However, one can already have an assembly or existing alignments on its local machine. Below are examples of CAARS usage in these cases. We will use the environment and the data already used in the "basic" tutorial above but we will just change the $OUT_DIR
:
export OUT_DIR=$SHARED_DIR/CAARS_dir/output_going_further
mkdir -p $OUT_DIR
Of note: you might already have some files on your computer and you would like to access them from the Docker
container. These files must be in the shared directory (such as /home/crey/shared/
) or a subdirectory of it. For example, RNA-Seq data would be in /home/crey/shared/rna_seq/
, the gene family alignments would be in the directory /home/crey/shared/gene/family/alignments/
... If you run CAARS without using Docker
, this is not necesary.
(Remind that an CAARS command line needs at least 4 arguments that specify the input directories or files: --sample-sheet
, --species-tree
, --alignment-dir
and --seq2sp-dir
.)
Yes, you can. If you don't have RNA-Seq data or simply don't want to use it, you can add your own transcriptome to multi-sequence alignments.
For demonstration purposes, you can get such a transcriptome here (small fasta file) and put it in $DATA_DIR
.
cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/draft_transcriptome.fa
Note that draft_transcriptome.fa
sequences must be CDSs. UTRs will not be trimmed by CAARS.
Here is the associated sample sheet to this transcriptome (you might have to adapt this to your data).
id | species | ref_species | path_fastx_single | path_fastx_left | path_fastx_right | orientation | run_standard | path_assembly | run_apytram |
---|---|---|---|---|---|---|---|---|---|
CMM | Mus_musculus | Homo_sapiens | - | - | - | - | yes | /home/crey/shared/data/draft_transcriptome.fa | no |
You can copy it or get it from here and put it in $DATA_DIR
.
cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.only_draft_transcriptome.tsv
The next step is to run CAARS, and you're done! The CAARS output will be in $OUT_DIR/only_draft_transcriptome
.
cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR/only_draft_transcriptome --sample-sheet $DATA_DIR/sample_sheet.only_draft_transcriptome.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50
Providing a transcriptome assembly as long as RNA-Seq data will speed up CAARS (compared to only RNA-Seq data) because the assembly process is included within CAARS; this assembly step will thus be bypassed. It will also refine your transcriptome with the help of the RNA-Seq data during the assisted assembly step.
For demonstration purposes, you can get such a transcriptome here (small fasta file) and put it in $DATA_DIR
.
cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/draft_transcriptome.fa
If you want your assmbly file to be readable in the Docker
container, it must be in the shared directory (such as /home/crey/shared/
) or a subdirectory of it. Note that draft_transcriptome.fa
sequences must be CDSs. UTRs will not be trimmed by CAARS.
Here is the associated sample sheet to this transcriptome (you might have to adapt this to your data).
id | species | group | ref_species | path_fastx_single | path_fastx_left | path_fastx_right | orientation | run_standard | path_assembly | run_apytram |
---|---|---|---|---|---|---|---|---|---|---|
CMM | Mus_musculus | g1 | Homo_sapiens | - | /home/crey/shared/data/rna_seq/Mus_musculus_1.fq | /home/crey/shared/data/rna_seq/Mus_musculus_2.fq | UP | yes | /shared/directory/data/draft_transcriptome.fa | yes |
You can copy this table or get it from here and put it in $DATA_DIR
.
cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.with_draft_transcriptome.tsv
The next step is to run CAARS, and you're done! The CAARS output will be in $OUT_DIR/with_draft_transcriptome
.
cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR/with_draft_transcriptome --sample-sheet $DATA_DIR/sample_sheet.with_draft_transcriptome.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50
This is possible only if you have prepare your data before the first run.
If the content of the alignment-dir
and/or the sp2seq-dir
change, CAARS will interpret that as new data.
So to bypass this comportment, you must:
- put all your families in the use
alignment-dir
and thesp2seq-dir
- use the option
--family-subset
with a file containing the names of the families from the first subset - add new family names in this previous file
CAARS can't annotate sequence from a family with less than 3 species because the reconciliation step can't run with less than 3 species.
But sequences in these families contains information useful during the checking of the family belonging of sequences from other families.
The way to keep this information is to keep the families in the CAARS sequence database but don't use it to assembly and annotate sequence. For that you have to use put these families in input of CAARS in the alignment-dir
and the sp2seq-dir
but discarding them of the "Usable" families using the --family-subset
.
To get the "Usable" families of your alignment-dir
you can first run CAARS on your data with the option --just-parse-input
and look at the file "UsableFamily.txt
" in the output directory.
Copy this file in your $DATA_DIR
and use it in your complete CAARS command line using the option --family-subset
.
cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR/with_draft_transcriptome --sample-sheet $DATA_DIR/sample_sheet.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50
--family-subset $DATA_DIR/UsableFamily.txt
If you want to add samples to a previous CAARS run, you have to had a new line per sample in the sample-sheet
.
For example, if you want to add sequence from another species (Meriones_auratus). The sample-sheet will look like that:
id | species | group | ref_species | path_fastx_single | path_fastx_left | path_fastx_right | orientation | run_standard | path_assembly | run_apytram |
---|---|---|---|---|---|---|---|---|---|---|
CMM | Mus_musculus | g1 | Homo_sapiens,Cavia_porcellus | - | /home/crey/shared/data/rna_seq/Mus_musculus_1.fq | /home/crey/shared/data/rna_seq/Mus_musculus_2.fq | UP | yes | - | yes |
CMA | Meriones_auratus | g2 | Homo_sapiens,Cavia_porcellus | - | /home/crey/shared/data/rna_seq/Meriones_auratus_1.fq | /home/crey/shared/data/rna_seq/Meriones_auratus_2.fq | UP | yes | - | yes |
To note: The assisted assembly is run on each family for all samples having the same guide species and the same group at the same time. So here, if the Mus_musculus has ever run and you don't want to run again the assisted assembly on the Mus_musculus, you have to specify a different group to the Meriones_auratus g2, in order to dissociate the assisted assembly for these 2 samples. In this way, CAARS will run independently the assisted assembly on the Meriones_auratus and will reuse the output of the previous run for the Mus_musculus.
3.6. I do not want to build a whole-transcriptome but just use the assisted assembler for specific families: is it possible?
This is possible; it will bypass the standard assembly process. You still need RNA-Seq data.
Here is the sample sheet that you must use (note a no
in the run_standard
column)
id | species | group | ref_species | path_fastx_single | path_fastx_left | path_fastx_right | orientation | run_standard | path_assembly | run_apytram |
---|---|---|---|---|---|---|---|---|---|---|
CMM | Mus_musculus | g1 | Homo_sapiens | - | /home/crey/shared/data/rna_seq/Mus_musculus_1.fq | /home/crey/shared/data/rna_seq/Mus_musculus_2.fq | UP | no | - | no |
You can copy it or get this table from here and put it in $DATA_DIR
cd $DATA_DIR
wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.only_assisted.tsv
The next step is to run CAARS, and you're done! The CAARS output will be in $OUT_DIR/only_assisted
.
cd $RUN_DIR
caars --outdir $OUT_DIR/only_assisted --sample-sheet $DATA_DIR/sample_sheet.only_assisted.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50
3.7. In CAARS documentation, it is said that we can use multiple guide species and RNA-Seq data coming from different species: how to do this?
Here is an example sample sheet that you can use. The species
column refers to the RNA-Seq data, and the ref_species
column to the guide species. To use RNA-Seq data from multiple species as input, you must have as many rows as the species number. Guide species can be different for every input species. Each line is independent from the others, so the options yes
or no
in the run_standard
, path_assembly
and run_apytram
columns can be combined.
id | species | group | ref_species | path_fastx_single | path_fastx_left | path_fastx_right | orientation | run_standard | path_assembly | run_apytram |
---|---|---|---|---|---|---|---|---|---|---|
CMM | Mus_musculus | Homo_sapiens,Cavia_porcellus | - | /home/crey/shared/data/rna_seq/Mus_musculus_1.fq | /home/crey/shared/data/rna_seq/Mus_musculus_2.fq | UP | yes | - | yes | |
CTT | Tursiops_truncatus | Sus_scrofa | - | /home/crey/shared/data/rna_seq/Tursiops_truncatus_1.fq | /home/crey/shared/data/rna_seq/Tursiops_truncatus_2.fq | UP | no | - | yes |
3.8. I already have multiple-sequence alignments; I want to use CAARS tools to refine my alignments but I have neither RNA-Seq data nor transcriptome
You just have to feed CAARS with an empty sample sheet (only column names) and run CAARS as usual (remind that your MSAs must be in a directory specified by the option --alignment-dir
).
id | species | group | ref_species | path_fastx_single | path_fastx_left | path_fastx_right | orientation | run_standard | path_assembly | run_apytram |
---|
You can copy it or get it from here and put it in $DATA_DIR
cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.nornaseq.tsv
You can then run CAARS:
cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR/nornaseq --sample-sheet $DATA_DIR/sample_sheet.nornaseq.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5
Sometimes, when you use CAARS with several samples, the script running normalization step may have an error due to memory usage because the normalization is one of the first step of the pipeline and all samples will be run at the same time.
-> Try to reduce the available memory and launch again the CAARS command line. Only the samples with an error will be run again.
First, if there is a message error and you can correct it. Correct it and launch again CAARS with the same command line.
Else, try to launch again CAARS with the same command line (Sometimes the error can come from a conflict with one of the numerous step running at the same time.)
If the error persist and you don't understand the error don't hesitate to contact me at carine.rey [at] ens-lyon.fr.