char conversion to dna4 is misbehaving #1864

alphahmed · 2020-05-30T12:26:12Z

When using assign_char_to and assign_char to assign a char to dna4, the normal behavior should result in implicit conversion of any character other than 'C', 'G', 'T' into 'A'; however, it is not really behaving as expected for all characters.

For example, the 'S' character is converted into 'C'... NOT 'A'

The text was updated successfully, but these errors were encountered:

marehr · 2020-05-31T11:30:20Z

Thank you for your input!

TLDR; S is a iupac symbol which can mean either C or G, we decided for the lexicographical smallest. So C in your case. This seems to be a documentation error, because we don't mention it anywhere.

Can you provide us your sequence? Is it publicly available?
Do you think that our assumption of an iupac alphabet is wrong?
Do you need a fast and efficient implementation that has a different conversion strategy? Everything except A, C, G, T should convert to N = A for dna4?

Longer Explanation:

We use the following conversion table:

seqan3/include/seqan3/alphabet/nucleotide/dna4.hpp

Lines 108 to 121 in 4fcbeb8

    
           // set U equal to T 
        
           ret['U'] = ret['T']; ret['u'] = ret['t']; 
        
           // iupac characters get special treatment, because there is no N 
        
           ret['R'] = ret['A']; ret['r'] = ret['A']; // or G 
        
           ret['Y'] = ret['C']; ret['y'] = ret['C']; // or T 
        
           ret['S'] = ret['C']; ret['s'] = ret['C']; // or G 
        
           ret['W'] = ret['A']; ret['w'] = ret['A']; // or T 
        
           ret['K'] = ret['G']; ret['k'] = ret['G']; // or T 
        
           ret['M'] = ret['A']; ret['m'] = ret['A']; // or T 
        
           ret['B'] = ret['C']; ret['b'] = ret['C']; // or G or T 
        
           ret['D'] = ret['A']; ret['d'] = ret['A']; // or G or T 
        
           ret['H'] = ret['A']; ret['h'] = ret['A']; // or C or T 
        
           ret['V'] = ret['A']; ret['v'] = ret['A']; // or C or G

We once had the behaviour of converting anything except ACGT to unknown = A, but we changed this to handle iupac symbols. See #55

I guess that this is a documentation issue and that we forgot to adapt the documentation years ago.

Our documentation does not say anything about the iupac conversion and falsely states

seqan3/test/snippet/alphabet/nucleotide/dna4.cpp

Lines 1 to 16 in 4fcbeb8

    
           #include <seqan3/alphabet/nucleotide/dna4.hpp> 
        
           #include <seqan3/core/debug_stream.hpp> 
        
           int main() 
        
           { 
        
               using seqan3::operator""_dna4; 
        
               seqan3::dna4 my_letter{'A'_dna4}; 
        
               my_letter.assign_char('C'); 
        
               my_letter.assign_char('F'); // unknown characters are implicitly converted to A. 
        
               if (my_letter.to_char() == 'A') 
        
                   seqan3::debug_stream << "yeah\n"; // "yeah"; 
        
           }

See https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1dna4.html#details

alphahmed · 2020-05-31T13:10:07Z

Thank you for the clarification!

Unfortunately, my data is not available for public sharing yet.
Alternative implementation of dna5 would be the option to pursue if 'N' is required instead of IUPAC equivilants; however, I think dna5 occupies larger memory.
Is the bitcompressed vector of dna5 less efficient than a dna4 bitcompressed vector? If so, would it be possible to have a second version of assign_char( ) for dna4 that converts to 'A' instead of the least of IUPAC?! Using IUPAC-based conversion is good for an IUPAC-intended sequence, but it would provide more randomness (one for each IUPAC symbol) if the text was intended to be ACTG-specific and there were errors, this way we can safely assume that there is bias for 'A' only.

marehr · 2020-05-31T18:18:35Z

Thank you for the quick reply.

Unfortunately, my data is not available for public sharing yet.

Okay, but can you give us a general idea which kind of data you have? Does similar data already exists? I'm not so much interested in the actual sequence, I'm more interested in the format you use. Why are there S characters in the sequence data? If it is not IUPAC-intended sequence, what is your data about? What is your domain?

Alternative implementation of dna5 would be the option to pursue if 'N' is required instead of IUPAC equivalents; however, I think dna5 occupies larger memory.

Oh you are right, I skimmed over the fact that dna5 assigns IUPAC symbols to N.

seqan3/include/seqan3/alphabet/nucleotide/dna5.hpp

Lines 101 to 118 in 4fcbeb8

    
           std::array<rank_type, 256> ret{}; 
        
           // initialize with UNKNOWN (std::array::fill unfortunately not constexpr) 
        
           for (auto & c : ret) 
        
               c = 3; // == 'N' 
        
           // reverse mapping for characters and their lowercase 
        
           for (size_t rnk = 0u; rnk < alphabet_size; ++rnk) 
        
           { 
        
               ret[         rank_to_char[rnk] ] = rnk; 
        
               ret[to_lower(rank_to_char[rnk])] = rnk; 
        
           } 
        
           // set U equal to T 
        
           ret['U'] = ret['T']; ret['u'] = ret['t']; 
        
           // iupac characters are implicitly "UNKNOWN" 
        
           return ret;

Uncompressed, dna4 and dna5 (and all "simple" alphabets) take as much space as a char (= 1byte). A bitcompressed vector would need log_2(4) = 2 bit for dna4 and log_2(5) = 3bit for dna5 per character. Our sequences are per default uncompressed.

Is the bitcompressed vector of dna5 less efficient than a dna4 bitcompressed vector?

Compression wise: dna5 (=3bit) is less efficient than dna4 (=2bit). Iterating over an uncompressed sequence will always be faster, because you need no time for the (de-)compression of the sequence. So it depends on your use case. If you are tight on memory you might profit from a compression factor of 4 for dna4 (i.e. 4 symbols * 2bit = 1char), but if you have no such constraints, I would suggest to use dna5.

Using IUPAC-based conversion is good for an IUPAC-intended sequence, but it would provide more randomness (one for each IUPAC symbol) if the text was intended to be ACTG-specific and there were errors, this way we can safely assume that there is bias for 'A' only.

I think this is an interesting point and I have never thought about this in that way. Out of curiosity, can you use this fact in a statistical way? Do you use some methods to deal with this bias? Or to quantify it?

alphahmed · 2020-06-01T10:05:22Z

I am working on comparing sequences and subsequences, which could be part of evolutionary genetics. I came across the dna4 behavior through trying random texts to implement my algorithms, since texts can provide better comprehension than an alphabet_size of 4 while testing.

The dna5 will do I guess, but processing huge amounts of data can benefit from a smaller alphabet_size.

GC-ratio is one of the common measures for comparing genomes, if there were a reduction in the randomness of wrong-nucleotide assignment, the ratio won't suffer as much.

smehringer · 2020-06-26T05:02:59Z

Hi @alphahmed

In case you are still interested, we now provide a cookbook entry of How to write a custom dna4 alphabet that converts unknown characters to A. It is very simple and you can just copy'n'paste the code.

I'm closing this issue hoping our solutions give you all you need. Feel free to reopen this issue anytime if there is something more!

alphahmed added the bug faulty or wrong behaviour of code label May 30, 2020

marehr closed this as completed May 31, 2020

marehr reopened this May 31, 2020

marehr added the response needed stalled since more user information is needed label May 31, 2020

marehr removed the response needed stalled since more user information is needed label Jun 2, 2020

smehringer mentioned this issue Jun 5, 2020

Cookbook: A custom dna4 alphabet #1875

Merged

marehr mentioned this issue Jun 24, 2020

character conversion to dna4 is misbehaving seqan/seqan#2432

Closed

smehringer closed this as completed Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

char conversion to dna4 is misbehaving #1864

char conversion to dna4 is misbehaving #1864

alphahmed commented May 30, 2020

marehr commented May 31, 2020 •

edited

Loading

alphahmed commented May 31, 2020 •

edited

Loading

marehr commented May 31, 2020 •

edited

Loading

alphahmed commented Jun 1, 2020

smehringer commented Jun 26, 2020

char conversion to dna4 is misbehaving #1864

char conversion to dna4 is misbehaving #1864

Comments

alphahmed commented May 30, 2020

marehr commented May 31, 2020 • edited Loading

alphahmed commented May 31, 2020 • edited Loading

marehr commented May 31, 2020 • edited Loading

alphahmed commented Jun 1, 2020

smehringer commented Jun 26, 2020

marehr commented May 31, 2020 •

edited

Loading

alphahmed commented May 31, 2020 •

edited

Loading

marehr commented May 31, 2020 •

edited

Loading