Skip to content

Tutorial

Carine edited this page Jan 29, 2018 · 103 revisions

With CAARS you can at the same time both assemble and annotate transcripts. The assembly is facilitated by using guide taxa (i. e. sister species that can be highly divergent). The transcripts are then inserted in user-provided multi-species alignments. Gene trees are subsequently inferred and the annotation is performed using the phylogenetic information in the trees.

In this tutorial we will assemble and annotate a mouse test RNA-Seq dataset.

1. Installing CAARS

CAARS uses a lot of dependencies (such as BLAST, Trinity...). In order to avoid installing all of these dependencies on your machine, we suggest you to use Docker. Docker will create a local environment on your computer that will contain all CAARS dependencies; they all will be packaged within the CAARS Docker image. Of course, you can also use CAARS without Docker.

1.1 Using Docker: the easy way (several seconds to a few minutes)

If you don't have Docker on your machine, get it here first. (Be aware that installation might differ if you're a Linux, a Mac or a Windows user.)

We will use the Docker image named carinerey/caars. This image will be run in a Docker container on your local machine (this container is a closed environment where CAARS and all its dependencies are already installed).

In order to interact with the Docker container environment, you first need to create a shared directory on your machine (that will be used for the interaction between the Docker container and your machine).

# ON YOUR MACHINE

mkdir /home/crey/shared/     # whatever directory
cd /home/crey/shared/ 

Download the last versions (and check update) by running:

# ON YOUR MACHINE
# the first image download can take several minutes (around 2Go)
# loading the image the next times should take a few seconds
docker pull carinerey/caars
docker pull carinerey/caars_tuto

We will now prepare the command lines to use caars throw its docker. The image carinerey/caars contains CAARS and its dependencies and the image carinerey/caars_tuto contains 2 dependencies needed for this tutorial (sratoolkit and the Ensembl API).

# ON YOUR MACHINE

export SHARED_DIR=$PWD      # We will use the variable $SHARED_DIR as the path shared by your machine and the docker containers

function docker_prep_data_cmd { echo "docker run --rm -u $UID:$UID -v $SHARED_DIR:$SHARED_DIR -w `pwd` carinerey/caars_tuto_prep_data "
}

function docker_caars_cmd { echo "docker run --rm -e LOCAL_USER_ID=`id -u $USER` -v $SHARED_DIR:$SHARED_DIR -w `pwd` carinerey/caars "
}

# "-e LOCAL_USER_ID=`id -u $USER`"    will ensure that you have all permissions on files created in the `Docker` container
# "-e SHARED_DIR=$SHARED_DIR"         exports the variable $SHARED_DIR in Docker container (that we will use later)

Now to use CAARS we need only to type (after have been defined $SHARED_DIR):

# ON YOUR MACHINE
export SHARED_DIR=$PWD      # We will use the variable $SHARED_DIR as the path shared by your machine and the docker containers
`docker_caars_cmd` caars [options]

For example to get the help message:

# ON YOUR MACHINE
`docker_caars_cmd` caars  -h

If you get this error message: docker: Got permission denied while trying to connect to the Docker daemon socket [...] , you can solve it by running sudo docker run [...] as a temporary solution. To fix this issue permanently, see here.

Please note that SHARED_DIR must contain an absolute path – as we did here. Indeed, CAARS builds links with absolute path. These links will be broken if you don't use the same directory tree.

1.2. On your local machine (without Docker)

If you don't want to use Docker and prefer to install CAARS from source, follow these instructions.

To follow the tutorial, after install CAARS and its dependencies, you must just have the export the environment variable $SHARED_DIR * and define as null $CAARS_DOCKER_CMD and $PREP_DATA_DOCKER_CMD*

# ON YOUR MACHINE
mkdir /home/crey/working_dir/     # whatever directory
cd /home/crey/working_dir/ 
export SHARED_DIR=$PWD      # We will use the variable $SHARED_DIR as the path to your working directory
function docker_prep_data_cmd { echo ""
}
function docker_caars_cmd { echo ""
}

2. Running CAARS: example on mouse test data

Now that installation is complete, we can run CAARS. We first need to collect the input data. CAARS needs as input 5 items (full description of each item is below):

  1. RNA-Seq data from one or several species
  2. a sample sheet file describing RNA-Seq data
  3. a directory with gene family alignments
  4. a directory with species-sequence map files
  5. a species tree file

2.1. Data acquisition

For each item, you will can either download directly the prebuild data or get by yourself the data (when it is possible).

2.1.1. Get RNA-Seq data

RNA-Seq data in Amalgam can be from one or multiple model or non-model species and can be single- or paired-end (stranded or not). Input files must be in fastq or fasta format. In this tutorial, we will use a subset of a paired-end unstranded RNA-Seq library from Mus musculus kidney (available in the SRA database from Fushan AA et al. (2015)):

2.1.1.1. Download SRA archive in a directory named $DATA_DIR

Get the prebuild data

To get the test data, simply run in your Docker container terminal:

 # ON YOUR MACHINE

 export DATA_DIR=$SHARED_DIR/data/    # create a variable DATA_DIR which contains the path to all input data
 mkdir -p $DATA_DIR
 cd $DATA_DIR
 # download and unzip data in DATA_DIR
 `docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/rna_seq.tar.gz && tar xvzf rna_seq.tar.gz

This dataset is just a subset of the SRA SRR636918 archive. It is much enough for this tutorial.

Next step

Get by yourself the data

To get SRA archive, we will use fastq-dump from sra-toolkit.

Note that sratoolkit is already install in the carinerey/caars_tuto image but not in carinerey/caars image.

a. Get a subset of SRA archives

If you want to download just a subset of an archive you can use the -X and -N options of fastq-dump. For example:

export DATA_DIR=$SHARED_DIR/data/
mkdir -p $DATA_DIR
# download data
cd $DATA_DIR
mkdir -p rna_seq
for SRR in SRR636918 SRR636917 SRR636916
do
## Download a subset of the archive $SRR
`docker_prep_data_cmd` fastq-dump --split-files -O rna_seq  --defline-seq  "@\$ac_\$si/\$ri" --defline-qual "+" -M 51 -X 100000 -N 200000 $SRR
done

Note that 2 files will be created for each archive.

b. Get full archive (if interested in reproducing the paper results)

If you want to retrieve the full archive to reproduce the same analyses as in the paper you must also download SRR636917 and SRR636916 in addition of SRR636918.

Run in your Docker container terminal:

export DATA_DIR=$SHARED_DIR/data/
mkdir -p $DATA_DIR
# download data
cd $DATA_DIR
mkdir -p rna_seq
for SRR in SRR636918 SRR636917 SRR636916
do
## Download the whole archive $SRR
`docker_prep_data_cmd` fastq-dump --split-files -O rna_seq  --defline-seq  "@\$ac_\$si/\$ri" --defline-qual "+" -M 51 $SRR
done

2.1.1.2. Rename the RNA-Seq libraries files

Now, we rename the files:

cd $DATA_DIR
cat rna_seq/SRR*_1.fastq > rna_seq/Mus_musculus_1.fq
cat rna_seq/SRR*_2.fastq > rna_seq/Mus_musculus_2.fq

2.1.2. Prepare the sample sheet file describing RNA-Seq data

This file contains the information describing RNA-Seq data and associates each sample file to some characteristics. This file is composed of 11 tab-delimited columns:

  1. Sample ID: a unique identifier describing the sample (such as CMM for CAARS Mus Musculus). This will be the prefix of all sequences.
  2. Sample species name (here, Mus_musculus)
  3. Sample group defined the belonging of the sample to this group. (All samples of a group will be treated at the same time in the assisted assembly, sample will no be merged see the assisted assembly documentation) (here, g1)
  4. Reference species name: Guide species to annotate the sample (if several, separated by comma) (here, Homo_sapiens)
  5. Path for a SE RNA-Seq file: /path/to/my/SE/single.fastq or /path/to/my/SE/single.fasta (if PE data write: -). Format are automatically detected, used extension in .fasta, .fa, .fastq or .fq.
  6. Path for a left PE RNA-Seq file: /path/to/my/PE/data.R1.fastq or /path/to/my/PE/data.R1.fasta (if SE data write: -). Format are automatically detected, used extension in .fasta, .fa, .fastq or .fq.
  7. Path for a right PE RNA-Seq file: /path/to/my/PE/data.R2.fastq or /path/to/my/PE/data.R2.fasta (if SE data write: -). Format are automatically detected, used extension in .fasta, .fa, .fastq or .fq.
  8. Strand and type of the RNA-Seq run: F,R,RF,FR,USor UP
    • US/UP: unstranded SE/PE
    • F/R: SE stranded Forward/Reverse
    • RF/FR: PE stranded
  9. Run standard pipeline on data: Yes or No (if Yes (typical case), this builds a draft whole-transcriptome assembly (if not provided in column 9) and dispatches sequences in families according to sequence similarities. (see CAARS overview))
  10. Path to a given draft assembly: /path/to/my/fasta/draft/assembly.fa (UTRs have to removed, CDS only!)
  11. Run assisted assembly on data: Yes or No (if Yes (typical case), this builds an assisted assembly by gene family using guide sequences from this given gene family as baits (see CAARS overview))

Column order is important and column names are required.

Here is an example table:

id species group ref_species path_fastx_single path_fastx_left path_fastx_right orientation run_standard path_assembly run_apytram
CMM Mus_musculus g1 Homo_sapiens - /home/crey/shared/data/rna_seq/Mus_musculus_1.fq /home/crey/shared/data/rna_seq/Mus_musculus_2.fq UP yes - yes

You can copy this example table or get it from here and put it in $DATA_DIR.

Get the prebuild data

cd $DATA_DIR
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.tsv

2.1.3. Prepare 2 directories (i) with gene family alignments and (ii) with species-sequence map files

Here we retrieve the existing gene families that will be used in CAARS. In this tutorial, we will use some gene families from the Ensembl Compara dataset (but in real life you can use other nucleotide level multispecies alignments, from public or in-house databases). To speed up the process, in this tutorial we subset the Compara dataset to keep only 13 Mammal representatives (we voluntary discarded mouse).

Common name Taxon ID Scientific name
Armadillo 9361 Dasypus_novemcinctus
Elephant 9785 Loxodonta_africana
Pig 9823 Sus_scrofa
Sheep 9940 Ovis_aries
Microbat 59463 Myotis_lucifugus
Cat 9685 Felis_catus
Ferret 9669 Mustela_putorius_furo
Bushbaby 30611 Otolemur_garnettii
Marmoset 9483 Callithrix_jacchus
Human 9606 Homo_sapiens
Rabbit 9986 Oryctolagus_cuniculus
Guinea Pig 10141 Cavia_porcellus
Squirrel 43179 Ictidomys_tridecemlineatus

Get the prebuild data

The fast way to get the subset of families with only sequences from the 13 species listed above (useful for this tutorial) is to run:

cd $DATA_DIR
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/MSA.tar.gz && tar xvzf MSA.tar.gz
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/Seq2SpTable.tar.gz && tar xvzf Seq2SpTable.tar.gz

Warning: Alignments must only contain "ATGCNUWSMKRYBDHV-" characters.

Next step

Get by yourself the data

If ever you wanted to use other gene families, you can also download them directly from the Ensembl API using some scripts (see below).

2.1.3.1. Gene family downloading from Ensembl API

Note that the Ensembl API is already install in the carinerey/caars_tuto image but not in carinerey/caars image.

(See here for more information on Ensembl API.)

First, create a directory to keep the scripts.

export SCRIPTS_DIR=$SHARED_DIR/scripts/
mkdir -p $SCRIPTS_DIR

The script Get_MSAs_EnsemblCompara.pl is available here.

cd $SCRIPTS_DIR
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/src/Get_MSAs_EnsemblCompara.pl

# To get given families
cd $DATA_DIR

`docker_prep_data_cmd` perl $SCRIPTS_DIR/Get_MSAs_EnsemblCompara.pl -s "Dasypus_novemcinctus,Loxodonta_africana,Sus_scrofa,Ovis_aries,Myotis_lucifugus,Felis_catus,Mustela_putorius_furo,Otolemur_garnettii,Callithrix_jacchus,Homo_sapiens,Oryctolagus_cuniculus,Cavia_porcellus,Ictidomys_tridecemlineatus"  -f ENSGT00550000074800,ENSGT00550000074846,ENSGT00870000136549  -g "Homo_sapiens" -o $DATA_DIR

# To get all families as in the paper

`docker_prep_data_cmd` perl $SCRIPTS_DIR/Get_MSAs_EnsemblCompara.pl -s "Dasypus_novemcinctus,Loxodonta_africana,Sus_scrofa,Ovis_aries,Myotis_lucifugus,Felis_catus,Mustela_putorius_furo,Otolemur_garnettii,Callithrix_jacchus,Homo_sapiens,Oryctolagus_cuniculus,Cavia_porcellus,Ictidomys_tridecemlineatus" -f all  -g "Homo_sapiens" -o $DATA_DIR

After running this script, 2 directories are created that contain (i) gene family alignements (in $DATA_DIR/MSA) and (ii) the sequence-species maps (in $DATA_DIR/Seq2SpTable).

2.1.4. Prepare the species tree file

We finally need a last input file for CAARS: the species tree file. Here, we get the Ensembl Compara species tree and we keep only 14 previous species (mouse included).

      /-Loxodonta_africana
   /-|
  |   \-Dasypus_novemcinctus
  |
  |            /-Mustela_putorius_furo
  |         /-|
  |      /-|   \-Felis_catus
--|     |  |
  |   /-|   \-Myotis_lucifugus
  |  |  |
  |  |  |   /-Ovis_aries
  |  |   \-|
  |  |      \-Sus_scrofa
  |  |
   \-|         /-Homo_sapiens
     |      /-|
     |   /-|   \-Callithrix_jacchus
     |  |  |
     |  |   \-Otolemur_garnettii
     |  |
      \-|         /-Mus_musculus
        |      /-|
        |   /-|   \-Ictidomys_tridecemlineatus
        |  |  |
         \-|   \-Cavia_porcellus
           |
            \-Oryctolagus_cuniculus

Get the prebuild data

The species tree is available here or run:

cd $DATA_DIR
`docker_prep_data_cmd` wget https://github.com/CarineRey/caars/wiki/data/species.tree

2.2. Running CAARS

2.2.1. Most frequent usage

2.2.1.1. First step: check input files (OPTIONAL but highly recommended)

Data verification can be useful in order to avoid running CAARS for hours before getting a bug (e. g. because of misspelling of a file name).

export RUN_DIR=$SHARED_DIR/CAARS_dir
export OUT_DIR=$SHARED_DIR/CAARS_dir/output
mkdir -p $RUN_DIR $OUT_DIR
cd $RUN_DIR
`docker_caars_cmd` caars  --outdir $OUT_DIR --sample-sheet $DATA_DIR/sample_sheet.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --just-parse-input

No error message should appear; if ever this happens, the error messages should guide you towards the problematic input file(s). If the problem persists, please contact me: carine.rey [at] ens-lyon.fr.

You will find in the output directory some preliminary statistic files:

  1. DetectedFamilies.txt
  2. FamilyMetadata.txt
  3. SpeciesMetadata.txt
  4. UsableFamilies.txt
  5. UsedFamilies.txt

See the output description 2.2.2.1 and the section 3.4 for the meaning and the usage of these files.

2.2.1.2. Second step: run CAARS

You can now run CAARS, hurrah! \o/ This should take several minutes on your machine.

cd $RUN_DIR
`docker_caars_cmd` caars  --outdir $OUT_DIR --sample-sheet $DATA_DIR/sample_sheet.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50

--mpast is the most useful CAARS option. It stands for "minimum percentage of alignment on the sister taxon". If set to 50, all sequences that correspond to a coverage of less than 50% compared to the sister sequence in the gene tree won't be output. This option allows filtering of low quality sequences.

For more information on CAARS options, run caars -h.

Note that CAARS "remembers" what has already been runned. In the directory where you launched CAARS, a directory called _caars has been built. _caars contains all cache files ordered and named by their ID. IDs are automatically given and are not "user-understandable" (for instance: 2dc0b68e18f6838b8969cee52d9726c4, 5f5d716bb48acf278015199e73914f5f ...). This data structure allows checking if a step has ever been made in a previous run; if some cache file(s) exist, the corresponding step(s) will not be run again by CAARS.

2.2.2. You made it! Now, how to interpret and use CAARS output?

2.2.2.1. What are the output files?

CAARS results are stored in the directory specified by --outdir. All files contained in this output directory are in fact links to their corresponding files in the _caars directory (you don't need to access the _caars directory but it must be in the shared directory between the Docker container and your local machine).

In the outdir you will find 4 directories and 2 files. All these directories are created at the very end of the CAARS process. Hence, do not worry if no file appears in the output whan launching CAARS. Only 3 output directories can be of interest for the user.

  • Files:

    1. all_fam.seq2sp.tsv: map between sequence names and species (for all families). Can be the input of a next CAARS run if you cut the 2 firsts columns (cut -f 1-2 all_fam.seq2sp.tsv > input_next_run.seq2sp.txt)

      • a. Sequence name
      • b. Species name
      • c. Gene family name
      • d. Name of th biggest set of orthologs containing this sequence in this family
    2. all_fam.orthologs.tsv: map between all sequence names and their orthologs. A tabular file with 4 columns:

      • a. Sequence name
      • b. Closest orthologs of this sequence
    3. DetectedFamilies.txt : all families detected in the alignment-dir.

    4. FamilyMetadata.txt : Meta data about each detected family (# of sequences and species, and if it is usable in caars).

    5. SpeciesMetadata.txt : Meta data about each species (# of sequences and families where it is present)

    6. UsableFamilies.txt : List of family usable in caars (# de sequences >= 3 and # of species >= 3).

    7. UsedFamilies.txt: all families used in this CAARS run (By default, it is all detected families but you can reduce it by using --family-subset )

  • Directories:

    1. assembly_results_by_fam: contains several subdirectories, each containing the results for each gene family:

      • a. MSA_out: final MSAs (by family)
      • b. GeneTreeReconciled_out: final reconciled gene trees (by family)
      • c. GeneTree_out: final gene trees before reconciliation (by family)
      • d. Orthologs_out: ortholog relationship defined at the reconciliation step (by family)
      • e. DL_out: Duplication/Losses events defined at the reconciliation step (by family)
      • f. FilterSummary_out: Filtering summary (stats and sequences discarded by the --mpast option) (by family)
    2. draft_assemblies: contains draft assemblies, i. e. raw sequences and cds for each target species

    3. assembly_results_only_seq: contains all sequences reconstructed by CAARS (concatenated in a single file for each species) and a table with the annotation by family

Note that in the outdir directory another subdirectory is created: outdir/_files (contains links to files in the _caars directory). This is not useful for the user.

2.2.2.2. How to use CAARS results?

You've now finished the CAARS "basic" tutorial. You reconstructed a mouse transcriptome from RNA-Seq data; in addition, these transcripts have been assigned to gene family alignments that you can use directly for comparative phylogenomic analyses. Finally, gene trees integrating your newly assembled transcripts are available for each family.

2.2.2.3. I used Docker to run CAARS: can I exit the Docker container without losing my results? Where are the results on my local machine?

Don't worry, your results are on your machine too. They will be in the directory specified by the option --outdir (which is either the shared directory between your computer and the Docker container or one of its subdirectories). The --outdir directory does not contain the "real" results but rather links to the _caars directory (where the "real" results are stored). If you wish to copy the --outdir directory, you must first get rid of these links. For example, use cp -rL /home/crey/shared/outdir/ /home/crey/new/location/for/outdir.



3. Going further with CAARS: playing around with CAARS possibilities

The "basic" tutorial above concerns the typical use of CAARS starting from only RNA-Seq data. However, one can already have an assembly or existing alignments on its local machine. Below are examples of CAARS usage in these cases. We will use the environment and the data already used in the "basic" tutorial above but we will just change the $OUT_DIR:

export OUT_DIR=$SHARED_DIR/CAARS_dir/output_going_further
mkdir -p $OUT_DIR

Of note: you might already have some files on your computer and you would like to access them from the Docker container. These files must be in the shared directory (such as /home/crey/shared/) or a subdirectory of it. For example, RNA-Seq data would be in /home/crey/shared/rna_seq/, the gene family alignments would be in the directory /home/crey/shared/gene/family/alignments/... If you run CAARS without using Docker, this is not necesary.

(Remind that an CAARS command line needs at least 4 arguments that specify the input directories or files: --sample-sheet, --species-tree, --alignment-dir and --seq2sp-dir.)

3.1. I just have a transcriptome assembly and no RNA-Seq data: can I still use CAARS?

Yes, you can. If you don't have RNA-Seq data or simply don't want to use it, you can add your own transcriptome to multi-sequence alignments.

For demonstration purposes, you can get such a transcriptome here (small fasta file) and put it in $DATA_DIR.

cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/draft_transcriptome.fa

Note that draft_transcriptome.fa sequences must be CDSs. UTRs will not be trimmed by CAARS.

Here is the associated sample sheet to this transcriptome (you might have to adapt this to your data).

id species ref_species path_fastx_single path_fastx_left path_fastx_right orientation run_standard path_assembly run_apytram
CMM Mus_musculus Homo_sapiens - - - - yes /home/crey/shared/data/draft_transcriptome.fa no

You can copy it or get it from here and put it in $DATA_DIR.

cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.only_draft_transcriptome.tsv

The next step is to run CAARS, and you're done! The CAARS output will be in $OUT_DIR/only_draft_transcriptome.

cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR/only_draft_transcriptome --sample-sheet $DATA_DIR/sample_sheet.only_draft_transcriptome.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50 

3.2. I want to use CAARS on my RNA-Seq data but I already have a transcriptome: can I use it?

Providing a transcriptome assembly as long as RNA-Seq data will speed up CAARS (compared to only RNA-Seq data) because the assembly process is included within CAARS; this assembly step will thus be bypassed. It will also refine your transcriptome with the help of the RNA-Seq data during the assisted assembly step.

For demonstration purposes, you can get such a transcriptome here (small fasta file) and put it in $DATA_DIR.

cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/draft_transcriptome.fa

If you want your assmbly file to be readable in the Docker container, it must be in the shared directory (such as /home/crey/shared/) or a subdirectory of it. Note that draft_transcriptome.fa sequences must be CDSs. UTRs will not be trimmed by CAARS.

Here is the associated sample sheet to this transcriptome (you might have to adapt this to your data).

id species group ref_species path_fastx_single path_fastx_left path_fastx_right orientation run_standard path_assembly run_apytram
CMM Mus_musculus g1 Homo_sapiens - /home/crey/shared/data/rna_seq/Mus_musculus_1.fq /home/crey/shared/data/rna_seq/Mus_musculus_2.fq UP yes /shared/directory/data/draft_transcriptome.fa yes

You can copy this table or get it from here and put it in $DATA_DIR.

cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.with_draft_transcriptome.tsv

The next step is to run CAARS, and you're done! The CAARS output will be in $OUT_DIR/with_draft_transcriptome.

cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR/with_draft_transcriptome --sample-sheet $DATA_DIR/sample_sheet.with_draft_transcriptome.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50

3.3. I want to add new families to a previous caars run: is it possible?

This is possible only if you have prepare your data before the first run.

If the content of the alignment-dir and/or the sp2seq-dir change, CAARS will interpret that as new data.

So to bypass this comportment, you must:

  • put all your families in the use alignment-dir and the sp2seq-dir
  • use the option --family-subset with a file containing the names of the families from the first subset
  • add new family names in this previous file

3.4. I have families with less than 3 species: what can I do with them?

CAARS can't annotate sequence from a family with less than 3 species because the reconciliation step can't run with less than 3 species.

But sequences in these families contains information useful during the checking of the family belonging of sequences from other families.

The way to keep this information is to keep the families in the CAARS sequence database but don't use it to assembly and annotate sequence. For that you have to use put these families in input of CAARS in the alignment-dir and the sp2seq-dir but discarding them of the "Usable" families using the --family-subset.

To get the "Usable" families of your alignment-dir you can first run CAARS on your data with the option --just-parse-input and look at the file "UsableFamily.txt" in the output directory.

Copy this file in your $DATA_DIR and use it in your complete CAARS command line using the option --family-subset.

cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR/with_draft_transcriptome --sample-sheet $DATA_DIR/sample_sheet.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50
--family-subset $DATA_DIR/UsableFamily.txt

3.5. I want to add new samples in a previous CAARS run: is it possible?

If you want to add samples to a previous CAARS run, you have to had a new line per sample in the sample-sheet.

For example, if you want to add sequence from another species (Meriones_auratus). The sample-sheet will look like that:

id species group ref_species path_fastx_single path_fastx_left path_fastx_right orientation run_standard path_assembly run_apytram
CMM Mus_musculus g1 Homo_sapiens,Cavia_porcellus - /home/crey/shared/data/rna_seq/Mus_musculus_1.fq /home/crey/shared/data/rna_seq/Mus_musculus_2.fq UP yes - yes
CMA Meriones_auratus g2 Homo_sapiens,Cavia_porcellus - /home/crey/shared/data/rna_seq/Meriones_auratus_1.fq /home/crey/shared/data/rna_seq/Meriones_auratus_2.fq UP yes - yes

To note: The assisted assembly is run on each family for all samples having the same guide species and the same group at the same time. So here, if the Mus_musculus has ever run and you don't want to run again the assisted assembly on the Mus_musculus, you have to specify a different group to the Meriones_auratus g2, in order to dissociate the assisted assembly for these 2 samples. In this way, CAARS will run independently the assisted assembly on the Meriones_auratus and will reuse the output of the previous run for the Mus_musculus.

3.6. I do not want to build a whole-transcriptome but just use the assisted assembler for specific families: is it possible?

This is possible; it will bypass the standard assembly process. You still need RNA-Seq data.

Here is the sample sheet that you must use (note a no in the run_standard column)

id species group ref_species path_fastx_single path_fastx_left path_fastx_right orientation run_standard path_assembly run_apytram
CMM Mus_musculus g1 Homo_sapiens - /home/crey/shared/data/rna_seq/Mus_musculus_1.fq /home/crey/shared/data/rna_seq/Mus_musculus_2.fq UP no - no

You can copy it or get this table from here and put it in $DATA_DIR

cd $DATA_DIR
wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.only_assisted.tsv

The next step is to run CAARS, and you're done! The CAARS output will be in $OUT_DIR/only_assisted.

cd $RUN_DIR
caars --outdir $OUT_DIR/only_assisted --sample-sheet $DATA_DIR/sample_sheet.only_assisted.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5 --mpast 50

3.7. In CAARS documentation, it is said that we can use multiple guide species and RNA-Seq data coming from different species: how to do this?

Here is an example sample sheet that you can use. The species column refers to the RNA-Seq data, and the ref_species column to the guide species. To use RNA-Seq data from multiple species as input, you must have as many rows as the species number. Guide species can be different for every input species. Each line is independent from the others, so the options yes or no in the run_standard, path_assembly and run_apytram columns can be combined.

id species group ref_species path_fastx_single path_fastx_left path_fastx_right orientation run_standard path_assembly run_apytram
CMM Mus_musculus Homo_sapiens,Cavia_porcellus - /home/crey/shared/data/rna_seq/Mus_musculus_1.fq /home/crey/shared/data/rna_seq/Mus_musculus_2.fq UP yes - yes
CTT Tursiops_truncatus Sus_scrofa - /home/crey/shared/data/rna_seq/Tursiops_truncatus_1.fq /home/crey/shared/data/rna_seq/Tursiops_truncatus_2.fq UP no - yes

3.8. I already have multiple-sequence alignments; I want to use CAARS tools to refine my alignments but I have neither RNA-Seq data nor transcriptome

You just have to feed CAARS with an empty sample sheet (only column names) and run CAARS as usual (remind that your MSAs must be in a directory specified by the option --alignment-dir).

id species group ref_species path_fastx_single path_fastx_left path_fastx_right orientation run_standard path_assembly run_apytram

You can copy it or get it from here and put it in $DATA_DIR

cd $DATA_DIR
`docker_caars_cmd` wget https://github.com/CarineRey/caars/wiki/data/sample_sheet.nornaseq.tsv

You can then run CAARS:

cd $RUN_DIR
`docker_caars_cmd` caars --outdir $OUT_DIR/nornaseq --sample-sheet $DATA_DIR/sample_sheet.nornaseq.tsv --species-tree $DATA_DIR/species.tree --alignment-dir $DATA_DIR/MSA --seq2sp-dir $DATA_DIR/Seq2SpTable --np 2 --memory 5

3.9. I have an error with the normalization step.

Sometimes, when you use CAARS with several samples, the script running normalization step may have an error due to memory usage because the normalization is one of the first step of the pipeline and all samples will be run at the same time.

-> Try to reduce the available memory and launch again the CAARS command line. Only the samples with an error will be run again.

3.10. I have an error.

First, if there is a message error and you can correct it. Correct it and launch again CAARS with the same command line.

Else, try to launch again CAARS with the same command line (Sometimes the error can come from a conflict with one of the numerous step running at the same time.)

If the error persist and you don't understand the error don't hesitate to contact me at carine.rey [at] ens-lyon.fr.

Clone this wiki locally