-
Notifications
You must be signed in to change notification settings - Fork 8
Paper Appendix
De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm (RECOMB 2019). Kristoffer Sahlin, Paul Medvedev, Pennsylvania State University
For Smith-Waterman alignment of a read r to a representative c,
isONclust uses the following parameters: match=2, mismatch= -2, gapExt= -1
.
GapOpen
is set as a function of the combined error rate of r and c,
denoted by = + :
gapOpen = 2
for > 0.1, gapOpen=3
for , gapOpen=4
for , gapOpen=5
for .
For CARNAC-LR we used version a5d8271d1bc503bcac00b615ee0673537ff99468 (git commit ID) and command line:
$ minimap -Sw2 -L100 -t8 {input.flnc_fasta} {input.flnc_fasta}
$ python paf_to_CARNAC.py {minimap_out} {input.flnc_fasta} {carnac_input}
$ CARNAC-LR -f {carnac_input} -t 8 -o {carnac_output}
We used the same parameter for minimap as they did in their experiments CARNAC-LR.
For isoseq3, we used version sierra 0.7.1 (commit v0.4.0-121-g22a3096*) and command line:
$ isoseq3 cluster --num-threads 8 {input.ccs} {output.consensus}
For isONclust with Iso-Seq data, we used:
$ isonclust.py --t 8 --flnc {input.flnc}
--ccs {input.ccs} --outfolder {outfolder}
For the ONT data we ran isONclust as
$ isonclust.py --k 13 --w 20 --t 8 --fastq {input.fastq}
--outfolder {outfolder}
isONclust sets the following default parameters by default:
--mapped_threshold=0.7
--aligned_threshold=0.4
--min_prob_no_hits=0.1
For LINCLUST, we used version 822c8b57bb3ded9f37540b7cc2c9b97cf319d6e8 (the git commit ID) and command line:
$ mmseqs easy-linclust --seq-id-mode 1 --cov-mode 1 --threads 8
--kmer-per-seq [21, 100, 1000, 100000]
-c [0.0, 0.4, 0.5, 0.6, 0.8] -e [0.1, 0.001]
{input.flnc_fasta} {linclust_output} {tmp_dir}
We ran linclust with default parameters, as well as with --seq-id-mode 1 --cov-mode 1
and various combinations of --kmer-per-seq
, -c
, and -e
after personal communication with author about suitable parameters for this type of data. In general we observed conservative results across all combinations. We present the results with the parameter setting performing the most permissive clustering --seq-id-mode 1 --cov-mode 1 --kmer-per-seq 10000 -c 0.0 -e 0.1
as they in general gave the the best V-measure, completeness and percentage of non-trivially clustered reads without notably sacrificing homogeneity. The runtime and memory usage for these parameter settings were significantly higher than for other parameter settings.
With qCluster we used the version available at website as of 10/23/1018 (no version number available) and ran
$ qCluster -d e -c [20000, 1000, 100] -k [15,31] {input.reads}
> {output_file}
We tried the parameter combinations within brackets but qCluster returned segmentation faults and seemed to occupy more than 264Gb of memory for all the combinations on our two smallest datasets SIM-100k and RC0.
With MeShClust we ran version 1.0.0 as follows
$ meshclust {input.reads} --id [0.80,0.90] --threads 8
--output {output_file}
We tried the parameter combinations within brackets but we encountered a runtime error for all combinations that we tested on SIM-100k and RC0 (issue submitted).
With DNACLUST we ran release 3 as follows
$ dnaclust {input.reads} -t 8 [--left_gaps_allowed]
-s [0.8, 0.85,0.9,0.95] -k 5 --approximate-filter > {output_file}
We tried the parameter combinations within brackets but encountered segmentation fault 5 minutes in on the smallest simulated dataset SIM-100k and 3 hours in on RC0, but DNACLUST did not occupy more memory than what was available (264Gb).
We ran Cogent version 3.3 according to tutorial on how to cluster large datasets here. As the tutorial suggested, we ran precluster first, and then cogent on each cluster created by precluster separately. With cogent installed through conda, we ran:
$ run_preCluster.py —cpus=8 #generates a folder "precluster_out"
$ generate_batch_cmd_for_Cogent_family_finding.py --cpus=8
--cmd_filename=cmd_file_{dataset_name}
preCluster.cluster_info.csv precluster_out
{dataset_name}_cogent_out
$ chmod +x cmd_file_{dataset_name}
$ ./cmd_file_{dataset_name}
Cogents algorithm is however not suitable for large clusters generated by pre_cluster, and will halt due to runtime complexity (personal communication with author). On the RC0 dataset we observed that one of the pre-clusters generated contained over 80,000 sequences. While we observed Cogent making progress on smaller clusters (
We ran IsoCon v0.3.2.
IsoCon pipeline -fl_reads <flnc.fastq> -outfolder </path/to/output>
IsoCon is not designed for nontargeted Iso-Seq or ONT data. The tool relies on exact start and end positions in transcript coming for the primer pairs designed for a targeted dataset.
To align Iso-Seq reads to a reference genome we ran minimap2 with the following suggested parameters:
minimap2 -t 8 -ax splice -uf -C5 {ref} {output.fastq} > {output.alignment}
To align ONT reads to a reference genome we ran minimap2 with the following suggested parameters:
minimap2 -t 8 -ax splice -uf -k14 {ref} {input.reads} > {output.alignment}