Skip to content

Paper Appendix

Kristoffer edited this page Jan 3, 2019 · 2 revisions

Paper Appendix

De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm (RECOMB 2019). Kristoffer Sahlin, Paul Medvedev, Pennsylvania State University

Alignment parameters

For Smith-Waterman alignment of a read r to a representative c, isONclust uses the following parameters: match=2, mismatch= -2, gapExt= -1. GapOpen is set as a function of the combined error rate of r and c, denoted by f1 = f2 + f3: gapOpen = 2 for f1 > 0.1, gapOpen=3 for f4, gapOpen=4 for f5, gapOpen=5 for f6.

Commands used for running tools

Carnac-LR

For CARNAC-LR we used version a5d8271d1bc503bcac00b615ee0673537ff99468 (git commit ID) and command line:

$ minimap -Sw2 -L100 -t8 {input.flnc_fasta} {input.flnc_fasta}
$ python paf_to_CARNAC.py {minimap_out} {input.flnc_fasta} {carnac_input}
$ CARNAC-LR -f {carnac_input} -t 8 -o {carnac_output}

We used the same parameter for minimap as they did in their experiments CARNAC-LR.

isoseq3-cluster

For isoseq3, we used version sierra 0.7.1 (commit v0.4.0-121-g22a3096*) and command line:

$ isoseq3 cluster --num-threads 8 {input.ccs} {output.consensus} 

isONclust

For isONclust with Iso-Seq data, we used:

$ isonclust.py --t 8 --flnc {input.flnc} 
            --ccs {input.ccs}  --outfolder {outfolder}

For the ONT data we ran isONclust as

$ isonclust.py --k 13 --w 20 --t 8 --fastq {input.fastq}  
    --outfolder {outfolder}

isONclust sets the following default parameters by default:

--mapped_threshold=0.7 
--aligned_threshold=0.4
--min_prob_no_hits=0.1

Linclust

For LINCLUST, we used version 822c8b57bb3ded9f37540b7cc2c9b97cf319d6e8 (the git commit ID) and command line:

$ mmseqs easy-linclust --seq-id-mode 1 --cov-mode 1  --threads 8
                       --kmer-per-seq [21, 100, 1000, 100000] 
                       -c [0.0, 0.4, 0.5, 0.6, 0.8] -e [0.1, 0.001] 
                       {input.flnc_fasta} {linclust_output} {tmp_dir}

We ran linclust with default parameters, as well as with --seq-id-mode 1 --cov-mode 1 and various combinations of --kmer-per-seq, -c, and -e after personal communication with author about suitable parameters for this type of data. In general we observed conservative results across all combinations. We present the results with the parameter setting performing the most permissive clustering --seq-id-mode 1 --cov-mode 1 --kmer-per-seq 10000 -c 0.0 -e 0.1 as they in general gave the the best V-measure, completeness and percentage of non-trivially clustered reads without notably sacrificing homogeneity. The runtime and memory usage for these parameter settings were significantly higher than for other parameter settings.

qCluster

With qCluster we used the version available at website as of 10/23/1018 (no version number available) and ran

$ qCluster -d e -c [20000, 1000, 100] -k [15,31] {input.reads} 
    >  {output_file}

We tried the parameter combinations within brackets but qCluster returned segmentation faults and seemed to occupy more than 264Gb of memory for all the combinations on our two smallest datasets SIM-100k and RC0.

MeShClust

With MeShClust we ran version 1.0.0 as follows

$ meshclust {input.reads} --id [0.80,0.90] --threads 8 
    --output {output_file}

We tried the parameter combinations within brackets but we encountered a runtime error for all combinations that we tested on SIM-100k and RC0 (issue submitted).

DNACLUST

With DNACLUST we ran release 3 as follows

$ dnaclust {input.reads}  -t 8 [--left_gaps_allowed] 
    -s [0.8, 0.85,0.9,0.95]  -k 5 --approximate-filter > {output_file}

We tried the parameter combinations within brackets but encountered segmentation fault 5 minutes in on the smallest simulated dataset SIM-100k and 3 hours in on RC0, but DNACLUST did not occupy more memory than what was available (264Gb).

Cogent

We ran Cogent version 3.3 according to tutorial on how to cluster large datasets here. As the tutorial suggested, we ran precluster first, and then cogent on each cluster created by precluster separately. With cogent installed through conda, we ran:

$ run_preCluster.py —cpus=8   #generates a folder "precluster_out"
$ generate_batch_cmd_for_Cogent_family_finding.py --cpus=8
    --cmd_filename=cmd_file_{dataset_name}
    preCluster.cluster_info.csv precluster_out
    {dataset_name}_cogent_out
$ chmod +x cmd_file_{dataset_name}
$ ./cmd_file_{dataset_name}

Cogents algorithm is however not suitable for large clusters generated by pre_cluster, and will halt due to runtime complexity (personal communication with author). On the RC0 dataset we observed that one of the pre-clusters generated contained over 80,000 sequences. While we observed Cogent making progress on smaller clusters ($<100$) it halted for the larger cluster (we killed the program after 72 hours).

IsoCon

We ran IsoCon v0.3.2.

IsoCon pipeline -fl_reads <flnc.fastq> -outfolder </path/to/output>

IsoCon is not designed for nontargeted Iso-Seq or ONT data. The tool relies on exact start and end positions in transcript coming for the primer pairs designed for a targeted dataset.

minimap2

To align Iso-Seq reads to a reference genome we ran minimap2 with the following suggested parameters:

minimap2 -t 8 -ax splice -uf -C5 {ref} {output.fastq} >  {output.alignment}

To align ONT reads to a reference genome we ran minimap2 with the following suggested parameters:

minimap2 -t 8 -ax splice -uf -k14 {ref} {input.reads} >  {output.alignment}