-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alphabet] dna4 complement and SIMD #1970
Comments
Thank you for investigating this! The templates shouldn't matter too much. The compiler is pretty good at resolving them into normal for loops, but saying that views are not completely cost free. The real bottleneck is that we internally use look-up tables to convert I never had the time to validate those claims, because we are currently stabilizing the public interfaces. So it is fascinating that you are validating my gut feeling :) Our internal rank representation is currently It would be nice if you could try this patch out: diff --git a/include/seqan3/alphabet/nucleotide/dna4.hpp b/include/seqan3/alphabet/nucleotide/dna4.hpp
index ae86fc609..9f60668ca 100644
--- a/include/seqan3/alphabet/nucleotide/dna4.hpp
+++ b/include/seqan3/alphabet/nucleotide/dna4.hpp
@@ -82,6 +82,18 @@ public:
}
//!\}
+ constexpr dna4 & assign_char(char_type const c) noexcept
+ {
+ char_type upper_char = c & 0b0101'1111; // to_upper
+ assign_rank((upper_char == 'T') * 3 + (upper_char == 'G') * 2 + (upper_char == 'C'));
+ return *this;
+ }
+
+ constexpr dna4 complement() const noexcept
+ {
+ return dna4{}.assign_rank(to_rank() ^ 0b11);
+ }
+
protected:
//!\privatesection (This won't do any look-ups and will only use arithmetic operations, which should be auto-vectorisable) This is your benchmark, right? auto s = std::string{forward};
auto s_as_dna = s | seqan3::views::char_to<seqan3::dna4>;
for (auto _ : state) {
auto revcomp =
s_as_dna | std::views::reverse | seqan3::views::complement;
// force evaluation of view
for (auto foo : revcomp)
benchmark::DoNotOptimize(foo);
} Can you try this variant for me (after you applied the patch)?: auto s = std::string{forward};
for (auto _ : state) {
auto revcomp =
s | std::views::reverse | seqan3::views::char_to<seqan3::dna4> | seqan3::views::complement;
// force evaluation of view
for (auto foo : revcomp)
benchmark::DoNotOptimize(foo);
} For cache behaviour I would first reverse and when do the char transformation. These are all lazy operations and the order shouldn't matter too much, but I would expect that doing it in this order should be tiny bit faster. |
That is the issue here. Even with my patch applied the code is compiled into a normal, scaler loop, not a (auto-)vectorized one despite full optimizations.
On x86_64 you can use
That makes it much slower than my patch and your original version. (Changing the benchmarking code makes no big difference.)
|
Ooops. Turns out my benchmarking code was wrong. Because I DoNotOptimize(foo) I basically force byte-wise generation. Storing the result in a vector is properly vectorized.
So your new patch is really really fast. In fact it is much faster than I would expect it to be as it does a lot per loop iteration. |
Phew. I thought I'm stupid xD A lot in a loop does not mean much, if it can pipeline a lot. (Which should be the case here) Thank you for bringing this to our attention and I think it is extremely valuable to have this kind of constructive feedback. I was disconnected form internet today: I wanted to try this patch next diff --git a/include/seqan3/alphabet/nucleotide/dna4.hpp b/include/seqan3/alphabet/nucleotide/dna4.hpp
index ae86fc609..4c224419f 100644
--- a/include/seqan3/alphabet/nucleotide/dna4.hpp
+++ b/include/seqan3/alphabet/nucleotide/dna4.hpp
@@ -47,11 +47,11 @@ class rna4;
* If the special char conversion of IUPAC characters is not your desired behavior, refer to our cookbook for an
* example of \ref cookbook_custom_dna4_alphabet to change the conversion behavior.
*/
-class dna4 : public nucleotide_base<dna4, 4>
+class dna4 : public nucleotide_base<dna4, 256>
{
private:
//!\brief The base class.
- using base_t = nucleotide_base<dna4, 4>;
+ using base_t = nucleotide_base<dna4, 256>;
//!\brief Befriend seqan3::nucleotide_base.
friend base_t;
@@ -82,6 +82,44 @@ public:
}
//!\}
+ static uint8_t constexpr alphabet_size = 4;
+
+ constexpr dna4 & assign_rank(rank_type const c) noexcept
+ {
+ base_t::assign_rank(c == 0 ? 'A' : (c == 1 ? 'C' : (c == 2 ? 'G' : 'T')));
+ return *this;
+ }
+
+ constexpr dna4 & assign_char(char_type const c) noexcept
+ {
+ base_t::assign_rank(c);
+ return *this;
+ }
+
+ constexpr char_type to_char() const noexcept
+ {
+ return base_t::to_rank();
+ }
+
+ constexpr rank_type to_rank() const noexcept
+ {
+ char_type c{to_char()};
+ char_type upper_char = c & 0b0101'1111; // to_upper
+ rank_type rank{};
+ rank = (upper_char == 'T') * 3 + (upper_char == 'G') * 2 + (upper_char == 'C');
+ return rank;
+ }
+
+ constexpr dna4 complement() const noexcept
+ {
+ char_type rank{to_char()};
+ rank ^= rank % 2 ? ('C' ^ 'G') : ('A' ^ 'T');
+
+ dna4 ret{};
+ static_cast<base_t &>(ret).assign_rank(rank);
+ return ret;
+ }
+
protected:
//!\privatesection
which is basically the same as not converting a char when reading it in and doing your pretty nifty trick in the complement with the conditional xor. :) |
@kloetzl Can you tell me the tool which you used to visualise the assembler / opcode? Thank you! :) |
Those are the Linux perf tools. Just |
Btw, next time you tag a release please add the (EDIT marehr: This will be tracked at seqan/product_backlog#159) |
Found a bug in my code. Here are some new numbers using this patch.
|
I recomputed the table above with the latest master on my old laptop.
|
Thank you for reminding :) I guess those numbers are without the proposed patch. Maybe I add some progress report: We started to change our alphabet_base implementation to allow any kind of One of the major problems we are facing right now, is how to provide a sane interface that provides the best implementation in both the generic and simd-aware use cases. Some benchmarks showed that switching from look-up tables to arithmetic operations can have a drastic performance penality with SIMD disabled. Also use cases that only call It seems like enabling SIMD computation will need some special care, in the past I thought that adding a special view std::vector<char> sequence{};
auto view = sequence | std::views::reverse | seqan3::views::simdify(seqan3::views::char_to<seqan3::dna4> | seqan3::views::complement); might be a viable solution, but I'm not completely sure if that is still a good idea. Right now I think changing the way how the API is called might be a good first step, e.g. // single conversion, non-SIMDifiable
// always uses lookup-tables
dna4::char_to_rank('A') == 0;
using rank_type = seqan3::alphabet_rank_t<seqan3::dna4>;
std::vector<char> sequence{'A', 'C', 'G', 'T'};
std::vector<rank_type> ranks(4, 0); // same size as sequence
// std::span is contiguous memory, multiple conversions, SIMDifiable
// uses lookup-tables without any optimisations, uses SIMD-operations when using SSE4, ... etc.
dna4::char_to_rank(std::span{ranks} /*out-param*/, std::span{sequence} /*in-param*/);
std::vector<rank_type> complement_ranks(4, 0);
dna4::rank_complement(std::span{complement_ranks} /*out-param*/, std::span{ranks});
std::vector<char> complement_sequence(4, 'A');
dna4::char_complement(std::span{complement_sequence} /*out-param*/, std::span{sequence}); This would allow to provide a more "C" like interface to work directly on contiguous, raw-pointers and I guess would be a good building-block to provide a high-level interface such as |
2022 update, numbers still look the same:
|
If the small patch by @marehr up there (#1970 (comment)) already improves performance so much, can we just apply that, even if we don't have a more general solution for the other alphabets, yet? edit: ah, I see, it slows things down for the non-vectorised use-case, right? May be just `#ifdef' around it? |
Although from @marehr's last comment it doesn't feel like a good solution to just patch it, I would have to look deeper into this, which |
I would just check for AVX2: #ifdef __AVX2__ But before adding, it would probably be good to have a benchmark included, as well. What was the latest benchmark code you were using, @kloetzl ? |
I wouldn't add this as a My patch is just good for a certain use case and without carefully crafting your code that makes use of this patch (mainly auto-vectorizable loops on contiguous memory), you will have a performance degeneration for all other use cases (most user code will NOT be in this form). I think the more appropriate option would be to add special alphabets for exactly this use case: seqan3::simd::dna4_rank dna4_1; // (rank based implementation, my patch)
seqan3::simd::dna4_char dna4_2; // (char based implementation, kloetzl implementation) This would just be one part, the second part would be the question of how to efficiently have sequence operations/algorithms, like e.g. "reverse_complement". But, lacking an application that uses this, I don't know what the best API would be. As I wrote in #1970 (comment), my gut feeling is that having "algorithmic" functions (like in |
I agree with @marehr that we should not aim for an easy-patch solution when we do not have a use case at hand. (I don't count a specific benchmark as use case). @h-2 An Regarding the extra alphabet, we could maybe add a cookbook entry for now that basically provides an efficient dna4 simd alphabet with @marehr s patch applied. Or maybe a documentation entry in the alphabt module page or even an how-to if someone has the time. With a note that the user interested in it should report back to us if he/she has a specific use case. |
Re-reading this thread, kloetzl proposed a "better" complement implementation for dna4 / rna4. He stated that
Currently, we use lookup tables for reverse complement, seqan3/include/seqan3/alphabet/nucleotide/dna4.hpp Lines 160 to 176 in e5c63f9
and kloetzl's patch public:
constexpr dna4 complement() const noexcept
{
return dna4{}.assign_rank(3 - to_rank());
} or my patch diff --git a/include/seqan3/alphabet/nucleotide/dna4.hpp b/include/seqan3/alphabet/nucleotide/dna4.hpp
index ae86fc609..9f60668ca 100644
--- a/include/seqan3/alphabet/nucleotide/dna4.hpp
+++ b/include/seqan3/alphabet/nucleotide/dna4.hpp
@@ -82,6 +82,18 @@ public:
}
//!\}
+ constexpr dna4 complement() const noexcept
+ {
+ return dna4{}.assign_rank(to_rank() ^ 0b11);
+ } would use arithmetic operations instead. As both are "single" instruction solutions, they should always be at least as good as the lookup table (for all use cases). See https://godbolt.org/z/TKKoEv1bc for generated instructions. So I would suggest changing dna4's complement implementation from lookup to xor. (We/I already did something similar for the phred implementation, fdf35e5)
Adding a cookbook entry how to write an alphabet that works in auto-vectorized loops would be fine for me with that remark. |
Here you go: https://github.com/kloetzl/libdna/blob/master/bench2/revcomp.cxx
I would argue that computing the reverse_complement is more common than just the complement on its own.
One more problem with an ifdef is portability. If an application using seqan would be compiled as a debian package on a machine supporting AVX2 it would not run on a users machine without AVX2 support.
Yes, xor and subtraction are near-optimal solutions. That should be good enough for most use cases. |
Alright, to me it looks like the following tasks would be open then:
What I don't see happening but is part of the discussion:
I'll propose this in the next core meeting on Monday :) |
I think for dna5 some potential arithmetic versions could be: uint8_t dna5_complement_subtract(uint8_t rank)
{
return min(3 - rank, 4); // using underflow
}
uint8_t dna5_complement_xor_is_dna5(uint8_t rank)
{
bool is_n = rank > 3;
return is_n ? rank : (rank ^ 0b11); // using conditional assignment
}
uint8_t dna5_complement_xor_min(uint8_t rank)
{
rank = rank ^ 0b11;
return min(rank, 4); // correcting reverse complement of N to be N
} Assembler suggests that |
That is very similar to how bwa does it. |
Core Meeting 2022-04-25We agreed on the tasks outlined in comment #1970 (comment). Writing the benchmark has highest priority. |
We merged #3026 which adds an example for an auto-vectorizable dna4 alphabet and improves the performance of the dna4-complement and also adds a benchmark. Thanks again a lot for your work, @kloetzl ! You are welcome to open another issue if you have other suggestions, we are excited to hear about them :) |
That's great to hear! Pleased I could help and we managed to produce such a nice speed up. 👍 |
Platform
Description
Adding the following function to dna4 makes complementing it about 16% faster. Patch at kloetzl@c43427f
Now, you might stop here and think this is ugly and 16% is not worth it. Or you go all in for performance and merge this patch. However, there is more to this.
I am currently looking into fast ways to complement dna strings. Comparing my own implementation with BWA and Seqan3, I noticed that the instructions generated for seqan are suboptimal. (Note that I am not an expert on seqan, I just used what was recommended in the tutorial.) Digging into the code I found that it can be simplified and boils down to a subtraction. Here are the runtimes of computing the reverse complement of a megabase:
Even with the patch seqan is almost nine time slower than BWA which also uses a subtraction. However, the BWA code gets vectorized by the compiler, whereas seqan does not. So there is something in the template magic preventing that optimization. Here is the assembly that GCC generates:
Interestingly even this assembly is kinda weird. Note how the $0x3 is stored in a register %esi, and that is then moved on each iteration of the loop. There might be something fishy going on here because rank_type is unsigned char.
Just wanted to share what I found in the hope you can make sense of it. ☺
Best,
Fabian
The text was updated successfully, but these errors were encountered: