Add optional input support #9

asishallab · 2021-12-03T18:01:41Z

In main.rs the following optional inputs should be added and, if given by the user, replace the respective default set in default.rs.

Implementation

For all not required, non mandatory options, have the default value been used from default.rs. In main overwrite these default values, if and only if the user provides a custom value with the command line call.

Multiple vs single options

Please note, that prot-scriber makes use of clap which supports the usage of certain command line options multiple times. In prot-scriber these are among others e.g. seq-sim-table. This means that a user can call prot-scriber with multiple Blast result tables like this:

prot-scriber --seq-sim-table my_proteins_vs_Swissprot_blast_tableout.txt --seq-sim-table my_proteins_vs_trEMBL_blast_tableout.txt ...

Inside prot-scriber the order of appearance of these multiple options is kept. Because in prot-scriber we also want to enable to also specify some options specific for a given input, it is important to note, that several input options can relate to each other by there order of appearance. See blacklist regex list below for a clear example.

Allow keyword `default` for all non mandatory options

Any not required option should be able to be provided with the default keyword. This is in order to enable any combination of positions sensitive custom and default options. See above "Multiple vs single options" for more details. So, make sure, that if a command-line options is provided with default the default value from default.rs is used for that specific option.

regex to split genes in gene families

This option should be named family-gene-separator-regex, the default has been discussed in another issue #6 .

regex to split gene identifier from gene-list in gene family input files

This option should be named family-id-gene-list-separator, the default has been discussed in the respective issue #6. Enable the user to provide her/his custom regular expression.

blacklist regex list

Imagine you want to provide a custom blacklist for your blast input table where you ran a search of your query proteins against a non standard database, e.g. a genome which does not adhere to the Uniprot stitle line standards. In order to provide such blacklists particular only for the blast input table of the same position this blacklist is stated, the user would do this:

prot-scriber  --seq-sim-table my_proteins_vs_alien_genome.txt --seq-sim-table my_proteins_vs_Swissprot_blast_tableout.txt \
  --blacklist-regex-list alien_genome_blacklist_regexs.txt

Because the argument --blacklist-regex-list only appears once, while the argument --seq-sim-table appears twice the first --blacklist-regex-list argument will be applied on the --seq-sim-table my_proteins_vs_alien_genome.txt but not on the second --seq-sim-table my_proteins_vs_Swissprot_blast_tableout.txt.

filter regex list

Enable custom lists of regular expression which with the stitles of Blast Hits are filtered. Remember that filtering cuts out undesired parts of stitle strings. See model_funs.rs filter_stitle for more details.

Note that this option is multiple and position sensitive like the above blacklist-regex-list. So, depending on the position, i.e. the times this option is provided with a value (valid file path), it applies to the corresponding --seq-sim-table.

informative regex list

Currently being implemented as specified here is a method to distinguish informative from un-informative words. Un-informative words are not removed from phrases, but are not scored either. They are detected by applying a list of regular expressions in sequence on each word. If any regex matches that indicates un-informativeness. Enable users to supply their own regex lists of un-informative words. Again, this options is position context sensitive. Its order of appearance ties it to the respective --seq-sim-table input at the same position (see above examples for more details on this).

The text was updated successfully, but these errors were encountered:

Atemia · 2021-12-22T13:51:09Z

Since the informative regex list "is position context-sensitive" and it's optional for the user to provide a list of uninformative words, then what should be in the defaults? Should it be empty or are there obvious uninformative word examples that could be included?

asishallab · 2022-04-13T10:40:05Z

Mostly (where applicable) done months ago by myself

asishallab added the enhancement New feature or request label Dec 3, 2021

asishallab assigned wunderbarr, coeit and Atemia Dec 3, 2021

asishallab closed this as completed Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional input support #9

Add optional input support #9

asishallab commented Dec 3, 2021

Atemia commented Dec 22, 2021

asishallab commented Apr 13, 2022

Add optional input support #9

Add optional input support #9

Comments

asishallab commented Dec 3, 2021

Implementation

Multiple vs single options

Allow keyword default for all non mandatory options

regex to split genes in gene families

regex to split gene identifier from gene-list in gene family input files

blacklist regex list

filter regex list

informative regex list

Atemia commented Dec 22, 2021

asishallab commented Apr 13, 2022

Allow keyword `default` for all non mandatory options