-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] seqan3::align_multiple for MSA dependent on seqan2 #1990
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1990 +/- ##
==========================================
+ Coverage 97.86% 97.88% +0.01%
==========================================
Files 267 269 +2
Lines 10055 10145 +90
==========================================
+ Hits 9840 9930 +90
Misses 215 215
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you go through the documentation and seqan2
->SeqAn2
, seqan3
->SeqAn3
, seqan
->SeqAn
, where not used as part of a type.
I'm also pretty sure, that we don't want to explicitly say SeqAn3 because of semantic versioning, but I don't see how to omit the 3 and not get entirely confused :)
Also: msa
->MSA
auto align_multiple(std::vector<range_t> const & input, config_t config = align_cfg::msa_default_configuration) | ||
{ | ||
#if !SEQAN3_HAS_SEQAN2 // multiple sequence alignment is only enabled with seqan2 | ||
static_assert(false, "You need to have seqan 2.x"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_assert(false, "You need to have seqan 2.x"); | |
static_assert(false, "You need to have SeqAn >= 2.4"); |
Actually we even require the develop branch, so you could even put the version of the next release (The-SeqAn3-Compatibility-Release) here....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we release that? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shortly before 3.1 I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the next SeqAn 2 release should come shortly after 3.0.2 :)
include/seqan3/alignment/multiple/detail/align_multiple_seqan2_adaptation.hpp
Outdated
Show resolved
Hide resolved
include/seqan3/alignment/multiple/detail/align_multiple_seqan2_adaptation.hpp
Outdated
Show resolved
Hide resolved
include/seqan3/alignment/multiple/detail/align_multiple_seqan2_adaptation.hpp
Outdated
Show resolved
Hide resolved
test/unit/alignment/multiple/align_multiple_seqan2_adaptation_test.cpp
Outdated
Show resolved
Hide resolved
include/seqan3/alignment/multiple/detail/align_multiple_seqan2_adaptation.hpp
Outdated
Show resolved
Hide resolved
include/seqan3/alignment/multiple/detail/align_multiple_seqan2_adaptation.hpp
Outdated
Show resolved
Hide resolved
Done. (also not if used as a variable name etc. just docs)
Well, in the public API I did not mention either SeqAn3 nor SeqAn2 but just in the detail documentation that actually uses both libraries. I think removing that would be too confusing and since it is hidden from the user I hope it is fine?
Done. (I think it was only once?) |
Co-authored-by: Simon Sasse <simonsasse97@gmail.com> Co-authored-by: Jörg Winkler <j.winkler@fu-berlin.de>
Co-authored-by: Simon Sasse <simonsasse97@gmail.com> Co-authored-by: Jörg Winkler <j.winkler@fu-berlin.de>
Co-authored-by: Simon Sasse <simonsasse97@gmail.com> Co-authored-by: Jörg Winkler <j.winkler@fu-berlin.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are my first thoughts on the thing.
This really really great and a very good starting point.
Some things that are just in my mind for follow up things:
- Make the return type an actual alignment_result and store the alignment etc. inside of it.
- We might be able to build the alignment graph over origanal sequences, since the gaps are merely stored as position offsets. Then we could avoid copying everything again.
- Make the input generic for ranges.
#if SEQAN3_HAS_SEQAN2 | ||
#include <seqan/graph_msa.h> | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put this below the seqan3 includes? Then they are not spread
* Computes a multiple sequence alignment from the given input sequences, using a consistency-based progressive | ||
* alignment algorithm on a graph of sequence segments. You can use the configuration object to | ||
* specify various parameters, like gap scores, alignment scores and band constraints. The return type is | ||
* `std::vector<std::vector<gapped_alphabet_type>>`, with the inner type derived from the input sequence type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please also add the paper reference here? Because no one knows what consistency-based progressive alignment on graph sequence segments
is supposed to mean.
* `std::vector<std::vector<gapped_alphabet_type>>`, with the inner type derived from the input sequence type. | ||
*/ | ||
template <std::ranges::forward_range range_t, typename config_t = decltype(align_cfg::msa_default_configuration)> | ||
auto align_multiple(std::vector<range_t> const & input, config_t config = align_cfg::msa_default_configuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto align_multiple(std::vector<range_t> const & input, config_t config = align_cfg::msa_default_configuration) | |
auto align_multiple(std::vector<range_t> const & input, config_t config = align_cfg::msa_default_configuration) |
Was this intended, to only allow std::vector here? So I can't have a view over a collection of sequences.
It must not be changed now. It is ok to at least track this in a backlog item for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it should definitely change in the future. I just didn't want to make the first PR to big with the additional interfaces :)
include/seqan3/alignment/multiple/detail/align_multiple_seqan2_adaptation.hpp
Outdated
Show resolved
Hide resolved
template <typename candidate_t> | ||
struct is_invalid_msa_config : | ||
std::negation<std::disjunction<std::is_same<candidate_t, seqan3::align_cfg::band_fixed_size>, | ||
is_type_specialisation_of<candidate_t, seqan3::align_cfg::gap>, | ||
is_type_specialisation_of<candidate_t, seqan3::align_cfg::scoring_scheme>>>{}; | ||
|
||
/*!\brief Validate the given MSA configuration. | ||
* \tparam configs_t The specified configuration elements. | ||
*/ | ||
template <typename ... configs_t> | ||
void validate_configuration(seqan3::configuration<configs_t...> const &) | ||
{ | ||
static_assert(seqan3::pack_traits::find_if<is_invalid_msa_config, configs_t...> == -1, | ||
"The given MSA configuration is not valid."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <typename candidate_t> | |
struct is_invalid_msa_config : | |
std::negation<std::disjunction<std::is_same<candidate_t, seqan3::align_cfg::band_fixed_size>, | |
is_type_specialisation_of<candidate_t, seqan3::align_cfg::gap>, | |
is_type_specialisation_of<candidate_t, seqan3::align_cfg::scoring_scheme>>>{}; | |
/*!\brief Validate the given MSA configuration. | |
* \tparam configs_t The specified configuration elements. | |
*/ | |
template <typename ... configs_t> | |
void validate_configuration(seqan3::configuration<configs_t...> const &) | |
{ | |
static_assert(seqan3::pack_traits::find_if<is_invalid_msa_config, configs_t...> == -1, | |
"The given MSA configuration is not valid."); | |
} | |
template <typename candidate_t> | |
static constexpr bool is_valid_msa_config = | |
std::is_same_v<candidate_t, seqan3::align_cfg::band_fixed_size> || | |
is_type_specialisation_of<candidate_t, seqan3::align_cfg::gap> || | |
is_type_specialisation_of<candidate_t, seqan3::align_cfg::scoring_scheme>; | |
/*!\brief Validate the given MSA configuration. | |
* \tparam configs_t The specified configuration elements. | |
*/ | |
template <typename ... configs_t> | |
void validate_configuration(seqan3::configuration<configs_t...> const &) | |
{ | |
static_assert(!(is_valid_msa_config<configs_t> && ...), "The given MSA configuration is not valid."); | |
} |
That would be solution one, which is I think less difficult to understand.
And before we do this. Is that really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And before we do this. Is that really necessary?
Well, we could also just ignore all other configuration types but I thought it might be good to know for the user which are allowed instead of just silently ignoring them 🤷
using seqan3_types = type_list<dna4, dna5, dna15, rna4, rna5, aa27, aa10murphy, aa10li>; | ||
//!\brief The corresponding list of SeqAn2 types (the order must be the same!). | ||
using seqan2_types = type_list<seqan::Dna, | ||
seqan::Dna5, | ||
seqan::Iupac, | ||
seqan::Rna, | ||
seqan::Rna5, | ||
seqan::AminoAcid, | ||
seqan::ReducedAminoAcid<seqan::Murphy10>, | ||
seqan::ReducedAminoAcid<seqan::Li10>>; | ||
//!\brief The index of the seqan3_alphabet_type in the list of seqan3_types. | ||
static constexpr auto index = list_traits::find<seqan3_alphabet_type, seqan3_types>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not completely convinced, that this is the correct approach. It doesn't scale nor does it make sense, because it destroys the entire idea of being generic. I know it is a first draft and I am willing to keep this for now, to also get along with other things but since we are going to work more with SeqAn2 in the future we should generally think about the compatibility.
Brain Dump
Basically there are two ways to go along:
- Add a seqan::SimpleType<char, SeqAn3AlphabetAdaptor> that reduces any alphabet we pass to its char value
- Make all seqan::SimpleType interface functions available for types that model seqan3::[writable_]alphabet.
The first approach has the advantage, that it we basically only need to define this type and overload BitsPerValue and ValueSize. I think all the remaining functions would resolve to defaults for char.
The second approach has the advantage, that we could more fine grained control concepts like nucleotide_alphabet etc. But it is much more verbose and we might not need this finer control anyway.
In the optimal case, if we only use the iterator interfaces inside of the seqan algorithms, we should be able to also pass along our sequence with a transform view to the adaptor alphabet. We probably have to do some work to make the meta functions in SeqAn compatible though. But that should be in general possible. If we would then (which I have on my list of things to do anyway) also make the seqan::SimpleType model seqan3::alphabet etc. we would/could basically forward the output to the seqan3 world without transforming it before. Of course there are more bits to it and this has to be evaluated. But that would be the way to go I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, sounds like a good plan. But it also sounds like this might take some time because we can't estimate how easy it will be to adapt the interfaces. I would take this on as a follow up that we can discuss and schedule afterwards if that is fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a PR against your repo branch a while ago. There it is already removed and done via char
only. Have you seen it?
|
||
seqan::Blosum62 expected_matrix{}; | ||
for (size_t i = 0; i < static_cast<size_t>(seqan::Blosum62::TAB_SIZE); ++i) | ||
EXPECT_EQ(msaOpt.sc.data_tab[i], expected_matrix.data_tab[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, the check should not be on the internals but on the public interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I get you right here, why should we not test the detail interfaces?
Note that the public interface also tests the BLOSUM matrix for the correct MSA (see align_multiple_test).
* See seqan3::detail::align_multiple_seqan2_adaptation::create_msa_configuration for more details on the configuration. | ||
* \endif | ||
*/ | ||
constexpr configuration msa_default_configuration = gap{gap_scheme{gap_score{-1}, gap_open_score{-13}}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not required. The adaptor already selects the defaults and we can simply use an empty seqan3::configration as the optinal argument if that is ok with the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this makes the default (that I can configure) more explicit but sure I can move this into the adaptions.
Co-authored-by: rrahn <rene.rahn@fu-berlin.de>
Yes, already a follow up!
I tried to use std::vectors and seqan3 alphabets but I could several deep down errors that I couldn't understand (although I did not put much effort into it). This would be a follow up that needs to be investigated also considering you suggestions to use a simple char type as adaptation.
Yes. |
the MSA is not used right now. Before having an actual use case we will not pull in the MSA module |
Part of seqan/product_backlog#104
Note: Changing the result type as discussed into a
seqan3::multiple_sequence_alignment_result
type that models a range over aligned sequences will follow in a separate PR.