Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional input support #9

Closed
asishallab opened this issue Dec 3, 2021 · 2 comments
Closed

Add optional input support #9

asishallab opened this issue Dec 3, 2021 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@asishallab
Copy link
Contributor

In main.rs the following optional inputs should be added and, if given by the user, replace the respective default set in default.rs.

Implementation

For all not required, non mandatory options, have the default value been used from default.rs. In main overwrite these default values, if and only if the user provides a custom value with the command line call.

Multiple vs single options

Please note, that prot-scriber makes use of clap which supports the usage of certain command line options multiple times. In prot-scriber these are among others e.g. seq-sim-table. This means that a user can call prot-scriber with multiple Blast result tables like this:

prot-scriber --seq-sim-table my_proteins_vs_Swissprot_blast_tableout.txt --seq-sim-table my_proteins_vs_trEMBL_blast_tableout.txt ...

Inside prot-scriber the order of appearance of these multiple options is kept. Because in prot-scriber we also want to enable to also specify some options specific for a given input, it is important to note, that several input options can relate to each other by there order of appearance. See blacklist regex list below for a clear example.

Allow keyword default for all non mandatory options

Any not required option should be able to be provided with the default keyword. This is in order to enable any combination of positions sensitive custom and default options. See above "Multiple vs single options" for more details. So, make sure, that if a command-line options is provided with default the default value from default.rs is used for that specific option.

regex to split genes in gene families

This option should be named family-gene-separator-regex, the default has been discussed in another issue #6 .

regex to split gene identifier from gene-list in gene family input files

This option should be named family-id-gene-list-separator, the default has been discussed in the respective issue #6. Enable the user to provide her/his custom regular expression.

blacklist regex list

Imagine you want to provide a custom blacklist for your blast input table where you ran a search of your query proteins against a non standard database, e.g. a genome which does not adhere to the Uniprot stitle line standards. In order to provide such blacklists particular only for the blast input table of the same position this blacklist is stated, the user would do this:

prot-scriber  --seq-sim-table my_proteins_vs_alien_genome.txt --seq-sim-table my_proteins_vs_Swissprot_blast_tableout.txt \
  --blacklist-regex-list alien_genome_blacklist_regexs.txt

Because the argument --blacklist-regex-list only appears once, while the argument --seq-sim-table appears twice the first --blacklist-regex-list argument will be applied on the --seq-sim-table my_proteins_vs_alien_genome.txt but not on the second --seq-sim-table my_proteins_vs_Swissprot_blast_tableout.txt.

filter regex list

Enable custom lists of regular expression which with the stitles of Blast Hits are filtered. Remember that filtering cuts out undesired parts of stitle strings. See model_funs.rs filter_stitle for more details.

Note that this option is multiple and position sensitive like the above blacklist-regex-list. So, depending on the position, i.e. the times this option is provided with a value (valid file path), it applies to the corresponding --seq-sim-table.

informative regex list

Currently being implemented as specified here is a method to distinguish informative from un-informative words. Un-informative words are not removed from phrases, but are not scored either. They are detected by applying a list of regular expressions in sequence on each word. If any regex matches that indicates un-informativeness. Enable users to supply their own regex lists of un-informative words. Again, this options is position context sensitive. Its order of appearance ties it to the respective --seq-sim-table input at the same position (see above examples for more details on this).

@asishallab asishallab added the enhancement New feature or request label Dec 3, 2021
@Atemia
Copy link
Contributor

Atemia commented Dec 22, 2021

Since the informative regex list "is position context-sensitive" and it's optional for the user to provide a list of uninformative words, then what should be in the defaults? Should it be empty or are there obvious uninformative word examples that could be included?

@asishallab
Copy link
Contributor Author

Mostly (where applicable) done months ago by myself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants