-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
char conversion to dna4 is misbehaving #1864
Comments
Thank you for your input! TLDR;
Longer Explanation: We use the following conversion table: seqan3/include/seqan3/alphabet/nucleotide/dna4.hpp Lines 108 to 121 in 4fcbeb8
We once had the behaviour of converting anything except I guess that this is a documentation issue and that we forgot to adapt the documentation years ago. Our documentation does not say anything about the iupac conversion and falsely states seqan3/test/snippet/alphabet/nucleotide/dna4.cpp Lines 1 to 16 in 4fcbeb8
See https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1dna4.html#details |
Thank you for the clarification!
|
Thank you for the quick reply.
Okay, but can you give us a general idea which kind of data you have? Does similar data already exists? I'm not so much interested in the actual sequence, I'm more interested in the format you use. Why are there S characters in the sequence data? If it is not IUPAC-intended sequence, what is your data about? What is your domain?
Oh you are right, I skimmed over the fact that dna5 assigns IUPAC symbols to seqan3/include/seqan3/alphabet/nucleotide/dna5.hpp Lines 101 to 118 in 4fcbeb8
Uncompressed,
Compression wise: dna5 (=3bit) is less efficient than dna4 (=2bit). Iterating over an uncompressed sequence will always be faster, because you need no time for the (de-)compression of the sequence. So it depends on your use case. If you are tight on memory you might profit from a compression factor of
I think this is an interesting point and I have never thought about this in that way. Out of curiosity, can you use this fact in a statistical way? Do you use some methods to deal with this bias? Or to quantify it? |
I am working on comparing sequences and subsequences, which could be part of evolutionary genetics. I came across the dna4 behavior through trying random texts to implement my algorithms, since texts can provide better comprehension than an alphabet_size of 4 while testing. The dna5 will do I guess, but processing huge amounts of data can benefit from a smaller alphabet_size. GC-ratio is one of the common measures for comparing genomes, if there were a reduction in the randomness of wrong-nucleotide assignment, the ratio won't suffer as much. |
Hi @alphahmed In case you are still interested, we now provide a cookbook entry of How to write a custom dna4 alphabet that converts unknown characters to A. It is very simple and you can just copy'n'paste the code. I'm closing this issue hoping our solutions give you all you need. Feel free to reopen this issue anytime if there is something more! |
When using
assign_char_to
andassign_char
to assign a char to dna4, the normal behavior should result in implicit conversion of any character other than 'C', 'G', 'T' into 'A'; however, it is not really behaving as expected for all characters.For example, the 'S' character is converted into 'C'... NOT 'A'
The text was updated successfully, but these errors were encountered: