-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for GFF+FAA files? #10
Comments
Currently clinker uses BioPython for parsing files, which does not yet have the ability to parse GFF. Potentially in the future I'll swap over to the parsing library I wrote for cblaster which can handle either, but it would take a pretty big reworking so not planned at the moment. |
May I suggest the gffutils module for parsing GFF files? It's fairly straightforward and has worked great for me. http://daler.github.io/gffutils/ It appears they intend to integrate this into BioPython anyway: |
Oh cool, I'll look into it. Thanks! |
I've added an initial attempt at GFF3 parsing using gffutils in the E: Note that, as GenBank files are treated, GFF files with multiple regions are treated as gene clusters with multiple loci and will be drawn on the same line in the visualisation |
Gosh that was fast. Can't wait to try it! |
It appears to have processed the GFF successfully at least: [21:23:18] INFO - Generating results summary... |
I have this same issue as @marade "ValueError: Distance matrix 'X' must be symmetric.", I'm running Clinker on a server with .gbk files, but this error happens every time I run it. It could likely be the formatting of the .gbk files as in other peoples issues, but I haven't identified it yet. |
@kforcone: Could you open a new issue and upload the files causing you the error? |
@marade Sorry for taking so long on this, just today got around to reworking it. I was having issues with GFF/FASTA files of specific regions downloaded from NCBI with their graphic viewer, since the start/end of features in those GFF files are relative to the entire parent scaffold, not the specific extracted region, so now the parser accounts for that too. Anyway, I've merged the GFF+FASTA parser into master now if you'd like to try it out and see if you have any issues. The issue about the distance matrix should also have been fixed already by 4f4c53d. |
Cool, I will try this out as soon as I can. |
This is now added in clinker v0.0.10 so I'll close the issue - if you run into any bugs feel free to reopen it. |
Getting back to this...It looks like it processes GFF gene+CDS features just fine, but it chokes for gene+tRNA, etc with an error like this: ValueError: Found no CDS features in gnl|Prokka|blahpB1A1_76 [../trim-assemble2/blahpB1A1/prokka/blahpB1A1.gff] So probably need some logic in there to deal with situations where you don't get gene+CDS, because there can be many of those. |
Since GenBank format isn't particularly user-friendly, please consider adding support for alternate input using GFF + FAA files. Your work on this tool is much appreciated.
The text was updated successfully, but these errors were encountered: