Question on job parallelization #42

tamuanand · 2021-08-12T02:51:52Z

Hello Osamu

First off, thanks a lot for the great program and continued support/enhancement on this.

I had a question on job parallelization. Assume I have a protein query file of 15K sequences and I used 2 approaches

Approach 1 - all 15K query sequences in 1 file - goes into Job 1a followed by sortgrcd to get gff3 files
Approach 2 - split the 15K sequences into 3 files, each file containing 5K sequences - goes into Job 2a, 2b, 2c followed by sortgrcd

In both cases, spaln was called appropriately after formatting the database:

spaln -t20 -Q7 -O12 -M1 [other options] -dDatabase Query for Job 1a
spaln -t20 -Q7 -O12 -M1 [other options] -dDatabase Query_1[2,3] for Job 2a, Job 2b, Job 2c for each the appropriate query files

The question: Will there be any major differences with the 2 outputs

Output of Approach 1 - sortgrcd -P40 -C50 -O0 Query.grd > spaln_single_job.gff3
Output of Approach 2- sortgrcd -P40 -C50 -O0 Query_1.grd Query_2.grd Query_3.grd > spaln_multi_job.gff3 -- this is done after ensuring all the relevant *.{erd, qrd} files are in the same directory as well as ensuring that *.{ent, idx, grp, seq} files of the database are also present in the directory where the sortgrcd job is running

I did look thru' both outputs in many different ways and could not find any differences. I am going to productionize a pipeline and I felt I should ask you if there would be any specific caveats I should be aware of if I use Approach 2

Thanks in advance,

The text was updated successfully, but these errors were encountered:

ogotoh · 2021-08-16T09:49:32Z

Thank you for your interest in Spaln. Frankly speaking, I have not used spaln and sortgrcd in the way that you suggested, after the time when spaln supported multi-thread operations; In my environment, I cannot easily use cluster machines. So, probably you know better than me about the performance of the combined use of spaln and sortgrcd under multi-machine environments.

However, please wait a few days before you start your large-scale calculation. I have found a few bugs that can cause segmentation faults (see issue #41) in rare situations. I have fixed them and am now testing the modified version on real data. I will announce you through this medium when I release the fixed version.

Osamu,

tamuanand · 2021-08-17T01:06:05Z

Thanks a lot Osamu.

I would like to wait for your new/modified version of spaln.

ogotoh · 2021-09-14T02:31:24Z

Although it took unexpectedly long time, I have finished modification of spaln. Tested upon more than 100 pairs of genomic and assembled transcript DNA sequences in the DDBJ database of various sequence similarity levels, the new version (Ver.2.4.6) runs without segmentation faults. For protein queries, tests have not been done in this detail. However, it works fine for a few examples. Thus, I wanted not to further delay the release of this version.

I thank you for your patience. If you encounter any problems with this or previous versions of spaln, please let me know at your convenience.

Osamu,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on job parallelization #42

Question on job parallelization #42

tamuanand commented Aug 12, 2021 •

edited

Loading

ogotoh commented Aug 16, 2021

tamuanand commented Aug 17, 2021

ogotoh commented Sep 14, 2021

Question on job parallelization #42

Question on job parallelization #42

Comments

tamuanand commented Aug 12, 2021 • edited Loading

ogotoh commented Aug 16, 2021

tamuanand commented Aug 17, 2021

ogotoh commented Sep 14, 2021

tamuanand commented Aug 12, 2021 •

edited

Loading