Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Adds new format GenomeDataDirectoryFormat with genome_dict function #345

Merged
merged 6 commits into from
Oct 8, 2024

Conversation

VinzentRisch
Copy link
Contributor

@VinzentRisch VinzentRisch commented Oct 4, 2024

closes #344

For easier handling of GenesDirectoryFormat, ProteinsDirectoryFormat, LociDirectoryFormat and DNASequencesDirectoryFormat in q2-amrfinderplus I created a new DirFormat called GenomeDataDirectoryFormat that they all inherit from. This new dirformat has a function called genome_dict.

  • It will return a dict of file names as keys and full paths to files as values.
  • These functions will work similarly to sample_dict of MultiMAGSequencesDirFmt or feature_dict of MAGSequencesDirFmt.
  • GenesDirectoryFormat and ProteinsDirectoryFormat can contain files in per sample directories or not. So depending on the directory structure the output of genome_dict will be similar to sample_dict of MultiMAGSequencesDirFmt or feature_dict of MAGSequencesDirFmt.

@VinzentRisch
Copy link
Contributor Author

VinzentRisch commented Oct 4, 2024

Hi @misialq
In sample_dict of MultiFASTADirectoryFormat you do sample_id = d.name.rsplit('/', 1)[0]
The d.name seems to be enough why did you add the rsplit? What case does it cover?

Also in the same action there are these checks if the files match the pathspec:

                if not mags_pattern.match(path.name):
                    continue

So it checks if the files end in fasta or fa. But isn't this redundant because in the validation of the format only files with those extensions are allowed anyway? Or am I missing something?

I am asking because I copied that sample_dict function and modified it.

@misialq
Copy link
Collaborator

misialq commented Oct 7, 2024

Hey @VinzentRisch,

So for the d.name question I'm not entirely sure as, tbh, I copied this method from somewhere already 😅 When I look at it now I must agree with you: I cannot think of a use case for that split. That code looks very suggestive of the name attribute containing a full path rather than just the last piece of it, in which case we'd need to do a rsplit. If that's not the case, then I guess you can just use the name directly.

Re: matching the pathspec: that is also a good question - technically, it should not be required since, as you wrote, only those files are allowed. I guess that was more of a failsafe but you could probably skip it.

Copy link
Collaborator

@misialq misialq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @VinzentRisch - so how about being a bit more generous and adding those methods to the remaining directory formats here? Loci and Genomes could also make use of those and that's what was also described in the original issue. 🙏

q2_types/genome_data/_formats.py Outdated Show resolved Hide resolved
@VinzentRisch
Copy link
Contributor Author

Hi @misialq
I created a new format called GenomeDataDirectoryFormat that the four dir formats can inherit from.

@VinzentRisch VinzentRisch requested a review from misialq October 7, 2024 09:13
@VinzentRisch VinzentRisch changed the title ENH: Adds genome_dict function to GenesDirectoryFormat and ProteinsDirectoryFormat ENH: Adds new format GenomeDataDirectoryFormat with genome_dict function. Oct 7, 2024
Copy link
Collaborator

@misialq misialq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @VinzentRisch, looks good, thanks! 🚀

@lizgehret would you like to look this through?

Copy link
Member

@lizgehret lizgehret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks reasonable to me, thanks @VinzentRisch!

@lizgehret lizgehret merged commit 4b9509a into qiime2:dev Oct 8, 2024
4 checks passed
@lizgehret lizgehret changed the title ENH: Adds new format GenomeDataDirectoryFormat with genome_dict function. ENH: Adds new format GenomeDataDirectoryFormat with genome_dict function Oct 8, 2024
@lizgehret lizgehret self-assigned this Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

ENH: Add genome_dict functions to GenesDirectoryFormat and ProteinsDirectoryFormat.
3 participants