Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

char conversion to dna4 is misbehaving #1864

Closed
alphahmed opened this issue May 30, 2020 · 5 comments
Closed

char conversion to dna4 is misbehaving #1864

alphahmed opened this issue May 30, 2020 · 5 comments
Labels
bug faulty or wrong behaviour of code

Comments

@alphahmed
Copy link

When using assign_char_to and assign_char to assign a char to dna4, the normal behavior should result in implicit conversion of any character other than 'C', 'G', 'T' into 'A'; however, it is not really behaving as expected for all characters.

For example, the 'S' character is converted into 'C'... NOT 'A'

@alphahmed alphahmed added the bug faulty or wrong behaviour of code label May 30, 2020
@marehr
Copy link
Member

marehr commented May 31, 2020

Thank you for your input!

TLDR; S is a iupac symbol which can mean either C or G, we decided for the lexicographical smallest. So C in your case. This seems to be a documentation error, because we don't mention it anywhere.


  • Can you provide us your sequence? Is it publicly available?
  • Do you think that our assumption of an iupac alphabet is wrong?
  • Do you need a fast and efficient implementation that has a different conversion strategy? Everything except A, C, G, T should convert to N = A for dna4?

Longer Explanation:

We use the following conversion table:

// set U equal to T
ret['U'] = ret['T']; ret['u'] = ret['t'];
// iupac characters get special treatment, because there is no N
ret['R'] = ret['A']; ret['r'] = ret['A']; // or G
ret['Y'] = ret['C']; ret['y'] = ret['C']; // or T
ret['S'] = ret['C']; ret['s'] = ret['C']; // or G
ret['W'] = ret['A']; ret['w'] = ret['A']; // or T
ret['K'] = ret['G']; ret['k'] = ret['G']; // or T
ret['M'] = ret['A']; ret['m'] = ret['A']; // or T
ret['B'] = ret['C']; ret['b'] = ret['C']; // or G or T
ret['D'] = ret['A']; ret['d'] = ret['A']; // or G or T
ret['H'] = ret['A']; ret['h'] = ret['A']; // or C or T
ret['V'] = ret['A']; ret['v'] = ret['A']; // or C or G

We once had the behaviour of converting anything except ACGT to unknown = A, but we changed this to handle iupac symbols. See #55

I guess that this is a documentation issue and that we forgot to adapt the documentation years ago.

Our documentation does not say anything about the iupac conversion and falsely states

#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
int main()
{
using seqan3::operator""_dna4;
seqan3::dna4 my_letter{'A'_dna4};
my_letter.assign_char('C');
my_letter.assign_char('F'); // unknown characters are implicitly converted to A.
if (my_letter.to_char() == 'A')
seqan3::debug_stream << "yeah\n"; // "yeah";
}

See https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1dna4.html#details

@marehr marehr closed this as completed May 31, 2020
@marehr marehr reopened this May 31, 2020
@marehr marehr added the response needed stalled since more user information is needed label May 31, 2020
@alphahmed
Copy link
Author

alphahmed commented May 31, 2020

Thank you for the clarification!

  • Unfortunately, my data is not available for public sharing yet.
  • Alternative implementation of dna5 would be the option to pursue if 'N' is required instead of IUPAC equivilants; however, I think dna5 occupies larger memory.
  • Is the bitcompressed vector of dna5 less efficient than a dna4 bitcompressed vector? If so, would it be possible to have a second version of assign_char( ) for dna4 that converts to 'A' instead of the least of IUPAC?! Using IUPAC-based conversion is good for an IUPAC-intended sequence, but it would provide more randomness (one for each IUPAC symbol) if the text was intended to be ACTG-specific and there were errors, this way we can safely assume that there is bias for 'A' only.

@marehr
Copy link
Member

marehr commented May 31, 2020

Thank you for the quick reply.

Unfortunately, my data is not available for public sharing yet.

Okay, but can you give us a general idea which kind of data you have? Does similar data already exists? I'm not so much interested in the actual sequence, I'm more interested in the format you use. Why are there S characters in the sequence data? If it is not IUPAC-intended sequence, what is your data about? What is your domain?

Alternative implementation of dna5 would be the option to pursue if 'N' is required instead of IUPAC equivalents; however, I think dna5 occupies larger memory.

Oh you are right, I skimmed over the fact that dna5 assigns IUPAC symbols to N.

std::array<rank_type, 256> ret{};
// initialize with UNKNOWN (std::array::fill unfortunately not constexpr)
for (auto & c : ret)
c = 3; // == 'N'
// reverse mapping for characters and their lowercase
for (size_t rnk = 0u; rnk < alphabet_size; ++rnk)
{
ret[ rank_to_char[rnk] ] = rnk;
ret[to_lower(rank_to_char[rnk])] = rnk;
}
// set U equal to T
ret['U'] = ret['T']; ret['u'] = ret['t'];
// iupac characters are implicitly "UNKNOWN"
return ret;

Uncompressed, dna4 and dna5 (and all "simple" alphabets) take as much space as a char (= 1byte). A bitcompressed vector would need log_2(4) = 2 bit for dna4 and log_2(5) = 3bit for dna5 per character. Our sequences are per default uncompressed.

Is the bitcompressed vector of dna5 less efficient than a dna4 bitcompressed vector?

Compression wise: dna5 (=3bit) is less efficient than dna4 (=2bit). Iterating over an uncompressed sequence will always be faster, because you need no time for the (de-)compression of the sequence. So it depends on your use case. If you are tight on memory you might profit from a compression factor of 4 for dna4 (i.e. 4 symbols * 2bit = 1char), but if you have no such constraints, I would suggest to use dna5.

Using IUPAC-based conversion is good for an IUPAC-intended sequence, but it would provide more randomness (one for each IUPAC symbol) if the text was intended to be ACTG-specific and there were errors, this way we can safely assume that there is bias for 'A' only.

I think this is an interesting point and I have never thought about this in that way. Out of curiosity, can you use this fact in a statistical way? Do you use some methods to deal with this bias? Or to quantify it?

@alphahmed
Copy link
Author

I am working on comparing sequences and subsequences, which could be part of evolutionary genetics. I came across the dna4 behavior through trying random texts to implement my algorithms, since texts can provide better comprehension than an alphabet_size of 4 while testing.

The dna5 will do I guess, but processing huge amounts of data can benefit from a smaller alphabet_size.

GC-ratio is one of the common measures for comparing genomes, if there were a reduction in the randomness of wrong-nucleotide assignment, the ratio won't suffer as much.

@marehr marehr removed the response needed stalled since more user information is needed label Jun 2, 2020
@smehringer
Copy link
Member

Hi @alphahmed

In case you are still interested, we now provide a cookbook entry of How to write a custom dna4 alphabet that converts unknown characters to A. It is very simple and you can just copy'n'paste the code.

I'm closing this issue hoping our solutions give you all you need. Feel free to reopen this issue anytime if there is something more!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug faulty or wrong behaviour of code
Projects
None yet
Development

No branches or pull requests

3 participants