Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] seqan3::align_multiple for MSA dependent on seqan2 #1990

Closed
wants to merge 7 commits into from

Conversation

smehringer
Copy link
Member

@smehringer smehringer commented Jul 20, 2020

Part of seqan/product_backlog#104

Note: Changing the result type as discussed into a seqan3::multiple_sequence_alignment_result type that models a range over aligned sequences will follow in a separate PR.

@codecov
Copy link

codecov bot commented Sep 1, 2020

Codecov Report

Merging #1990 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1990      +/-   ##
==========================================
+ Coverage   97.86%   97.88%   +0.01%     
==========================================
  Files         267      269       +2     
  Lines       10055    10145      +90     
==========================================
+ Hits         9840     9930      +90     
  Misses        215      215              
Impacted Files Coverage Δ
...clude/seqan3/alignment/multiple/align_multiple.hpp 100.00% <100.00%> (ø)
...ltiple/detail/align_multiple_seqan2_adaptation.hpp 100.00% <100.00%> (ø)
include/seqan3/io/stream/iterator.hpp 98.52% <0.00%> (-0.03%) ⬇️
include/seqan3/search/search.hpp 100.00% <0.00%> (ø)
include/seqan3/range/views/detail.hpp 100.00% <0.00%> (ø)
include/seqan3/search/search_result.hpp 100.00% <0.00%> (ø)
include/seqan3/search/fm_index/fm_index_cursor.hpp 100.00% <0.00%> (ø)
...lude/seqan3/search/fm_index/bi_fm_index_cursor.hpp 100.00% <0.00%> (ø)
...de/seqan3/argument_parser/detail/version_check.hpp 92.74% <0.00%> (ø)
...qan3/alignment/pairwise/alignment_configurator.hpp 100.00% <0.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a02af9...f7e06a8. Read the comment docs.

@smehringer smehringer changed the title First draft to peek onto seqan3::align_multiple [FEATURE] seqan3::align_multiple for MSA dependent on seqan2 Sep 1, 2020
@smehringer smehringer marked this pull request as ready for review September 1, 2020 07:19
@smehringer smehringer requested a review from eseiler September 1, 2020 07:19
Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go through the documentation and seqan2->SeqAn2, seqan3->SeqAn3, seqan->SeqAn, where not used as part of a type.

I'm also pretty sure, that we don't want to explicitly say SeqAn3 because of semantic versioning, but I don't see how to omit the 3 and not get entirely confused :)

Also: msa->MSA

include/seqan3/alignment/multiple/align_multiple.hpp Outdated Show resolved Hide resolved
auto align_multiple(std::vector<range_t> const & input, config_t config = align_cfg::msa_default_configuration)
{
#if !SEQAN3_HAS_SEQAN2 // multiple sequence alignment is only enabled with seqan2
static_assert(false, "You need to have seqan 2.x");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static_assert(false, "You need to have seqan 2.x");
static_assert(false, "You need to have SeqAn >= 2.4");

Actually we even require the develop branch, so you could even put the version of the next release (The-SeqAn3-Compatibility-Release) here....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we release that? :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shortly before 3.1 I guess

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the next SeqAn 2 release should come shortly after 3.0.2 :)

test/unit/alignment/multiple/align_multiple_test.cpp Outdated Show resolved Hide resolved
test/unit/alignment/multiple/align_multiple_test.cpp Outdated Show resolved Hide resolved
@smehringer
Copy link
Member Author

Can you go through the documentation and seqan2->SeqAn2, seqan3->SeqAn3, seqan->SeqAn, where not used as part of a type.

Done. (also not if used as a variable name etc. just docs)

I'm also pretty sure, that we don't want to explicitly say SeqAn3 because of semantic versioning, but I don't see how to omit the 3 and not get entirely confused :)

Well, in the public API I did not mention either SeqAn3 nor SeqAn2 but just in the detail documentation that actually uses both libraries. I think removing that would be too confusing and since it is hidden from the user I hope it is fine?

Also: msa->MSA

Done. (I think it was only once?)

@smehringer smehringer requested a review from eseiler September 2, 2020 14:20
include/seqan3/alignment/multiple/align_multiple.hpp Outdated Show resolved Hide resolved
include/seqan3/core/platform.hpp Outdated Show resolved Hide resolved
smehringer and others added 6 commits September 17, 2020 08:09
Co-authored-by: Simon Sasse <simonsasse97@gmail.com>
Co-authored-by: Jörg Winkler <j.winkler@fu-berlin.de>
Co-authored-by: Simon Sasse <simonsasse97@gmail.com>
Co-authored-by: Jörg Winkler <j.winkler@fu-berlin.de>
Co-authored-by: Simon Sasse <simonsasse97@gmail.com>
Co-authored-by: Jörg Winkler <j.winkler@fu-berlin.de>
Copy link
Contributor

@rrahn rrahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my first thoughts on the thing.
This really really great and a very good starting point.
Some things that are just in my mind for follow up things:

  • Make the return type an actual alignment_result and store the alignment etc. inside of it.
  • We might be able to build the alignment graph over origanal sequences, since the gaps are merely stored as position offsets. Then we could avoid copying everything again.
  • Make the input generic for ranges.

Comment on lines +22 to +24
#if SEQAN3_HAS_SEQAN2
#include <seqan/graph_msa.h>
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put this below the seqan3 includes? Then they are not spread

include/seqan3/alignment/multiple/align_multiple.hpp Outdated Show resolved Hide resolved
include/seqan3/alignment/multiple/align_multiple.hpp Outdated Show resolved Hide resolved
* Computes a multiple sequence alignment from the given input sequences, using a consistency-based progressive
* alignment algorithm on a graph of sequence segments. You can use the configuration object to
* specify various parameters, like gap scores, alignment scores and band constraints. The return type is
* `std::vector<std::vector<gapped_alphabet_type>>`, with the inner type derived from the input sequence type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please also add the paper reference here? Because no one knows what consistency-based progressive alignment on graph sequence segments is supposed to mean.

* `std::vector<std::vector<gapped_alphabet_type>>`, with the inner type derived from the input sequence type.
*/
template <std::ranges::forward_range range_t, typename config_t = decltype(align_cfg::msa_default_configuration)>
auto align_multiple(std::vector<range_t> const & input, config_t config = align_cfg::msa_default_configuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto align_multiple(std::vector<range_t> const & input, config_t config = align_cfg::msa_default_configuration)
auto align_multiple(std::vector<range_t> const & input, config_t config = align_cfg::msa_default_configuration)

Was this intended, to only allow std::vector here? So I can't have a view over a collection of sequences.
It must not be changed now. It is ok to at least track this in a backlog item for the future.

Copy link
Member Author

@smehringer smehringer Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should definitely change in the future. I just didn't want to make the first PR to big with the additional interfaces :)

Comment on lines +187 to +201
template <typename candidate_t>
struct is_invalid_msa_config :
std::negation<std::disjunction<std::is_same<candidate_t, seqan3::align_cfg::band_fixed_size>,
is_type_specialisation_of<candidate_t, seqan3::align_cfg::gap>,
is_type_specialisation_of<candidate_t, seqan3::align_cfg::scoring_scheme>>>{};

/*!\brief Validate the given MSA configuration.
* \tparam configs_t The specified configuration elements.
*/
template <typename ... configs_t>
void validate_configuration(seqan3::configuration<configs_t...> const &)
{
static_assert(seqan3::pack_traits::find_if<is_invalid_msa_config, configs_t...> == -1,
"The given MSA configuration is not valid.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <typename candidate_t>
struct is_invalid_msa_config :
std::negation<std::disjunction<std::is_same<candidate_t, seqan3::align_cfg::band_fixed_size>,
is_type_specialisation_of<candidate_t, seqan3::align_cfg::gap>,
is_type_specialisation_of<candidate_t, seqan3::align_cfg::scoring_scheme>>>{};
/*!\brief Validate the given MSA configuration.
* \tparam configs_t The specified configuration elements.
*/
template <typename ... configs_t>
void validate_configuration(seqan3::configuration<configs_t...> const &)
{
static_assert(seqan3::pack_traits::find_if<is_invalid_msa_config, configs_t...> == -1,
"The given MSA configuration is not valid.");
}
template <typename candidate_t>
static constexpr bool is_valid_msa_config =
std::is_same_v<candidate_t, seqan3::align_cfg::band_fixed_size> ||
is_type_specialisation_of<candidate_t, seqan3::align_cfg::gap> ||
is_type_specialisation_of<candidate_t, seqan3::align_cfg::scoring_scheme>;
/*!\brief Validate the given MSA configuration.
* \tparam configs_t The specified configuration elements.
*/
template <typename ... configs_t>
void validate_configuration(seqan3::configuration<configs_t...> const &)
{
static_assert(!(is_valid_msa_config<configs_t> && ...), "The given MSA configuration is not valid.");
}

That would be solution one, which is I think less difficult to understand.

And before we do this. Is that really necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And before we do this. Is that really necessary?

Well, we could also just ignore all other configuration types but I thought it might be good to know for the user which are allowed instead of just silently ignoring them 🤷

Comment on lines +59 to +70
using seqan3_types = type_list<dna4, dna5, dna15, rna4, rna5, aa27, aa10murphy, aa10li>;
//!\brief The corresponding list of SeqAn2 types (the order must be the same!).
using seqan2_types = type_list<seqan::Dna,
seqan::Dna5,
seqan::Iupac,
seqan::Rna,
seqan::Rna5,
seqan::AminoAcid,
seqan::ReducedAminoAcid<seqan::Murphy10>,
seqan::ReducedAminoAcid<seqan::Li10>>;
//!\brief The index of the seqan3_alphabet_type in the list of seqan3_types.
static constexpr auto index = list_traits::find<seqan3_alphabet_type, seqan3_types>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not completely convinced, that this is the correct approach. It doesn't scale nor does it make sense, because it destroys the entire idea of being generic. I know it is a first draft and I am willing to keep this for now, to also get along with other things but since we are going to work more with SeqAn2 in the future we should generally think about the compatibility.

Brain Dump

Basically there are two ways to go along:

  1. Add a seqan::SimpleType<char, SeqAn3AlphabetAdaptor> that reduces any alphabet we pass to its char value
  2. Make all seqan::SimpleType interface functions available for types that model seqan3::[writable_]alphabet.

The first approach has the advantage, that it we basically only need to define this type and overload BitsPerValue and ValueSize. I think all the remaining functions would resolve to defaults for char.
The second approach has the advantage, that we could more fine grained control concepts like nucleotide_alphabet etc. But it is much more verbose and we might not need this finer control anyway.
In the optimal case, if we only use the iterator interfaces inside of the seqan algorithms, we should be able to also pass along our sequence with a transform view to the adaptor alphabet. We probably have to do some work to make the meta functions in SeqAn compatible though. But that should be in general possible. If we would then (which I have on my list of things to do anyway) also make the seqan::SimpleType model seqan3::alphabet etc. we would/could basically forward the output to the seqan3 world without transforming it before. Of course there are more bits to it and this has to be evaluated. But that would be the way to go I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, sounds like a good plan. But it also sounds like this might take some time because we can't estimate how easy it will be to adapt the interfaces. I would take this on as a follow up that we can discuss and schedule afterwards if that is fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR against your repo branch a while ago. There it is already removed and done via char only. Have you seen it?


seqan::Blosum62 expected_matrix{};
for (size_t i = 0; i < static_cast<size_t>(seqan::Blosum62::TAB_SIZE); ++i)
EXPECT_EQ(msaOpt.sc.data_tab[i], expected_matrix.data_tab[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the check should not be on the internals but on the public interfaces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I get you right here, why should we not test the detail interfaces?
Note that the public interface also tests the BLOSUM matrix for the correct MSA (see align_multiple_test).

* See seqan3::detail::align_multiple_seqan2_adaptation::create_msa_configuration for more details on the configuration.
* \endif
*/
constexpr configuration msa_default_configuration = gap{gap_scheme{gap_score{-1}, gap_open_score{-13}}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not required. The adaptor already selects the defaults and we can simply use an empty seqan3::configration as the optinal argument if that is ok with the algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this makes the default (that I can configure) more explicit but sure I can move this into the adaptions.

Co-authored-by: rrahn <rene.rahn@fu-berlin.de>
@smehringer
Copy link
Member Author

* Make the return type an actual alignment_result and store the alignment etc. inside of it.

Yes, already a follow up!

* We might be able to build the alignment graph over original sequences, since the gaps are merely stored as position offsets. Then we could avoid copying everything again.

I tried to use std::vectors and seqan3 alphabets but I could several deep down errors that I couldn't understand (although I did not put much effort into it). This would be a follow up that needs to be investigated also considering you suggestions to use a simple char type as adaptation.

* Make the input generic for ranges.

Yes.

@smehringer
Copy link
Member Author

the MSA is not used right now. Before having an actual use case we will not pull in the MSA module

@smehringer smehringer closed this Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants