Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GFF+FAA files? #10

Closed
marade opened this issue Nov 11, 2020 · 12 comments
Closed

Support for GFF+FAA files? #10

marade opened this issue Nov 11, 2020 · 12 comments
Labels
enhancement New feature or request

Comments

@marade
Copy link

marade commented Nov 11, 2020

Since GenBank format isn't particularly user-friendly, please consider adding support for alternate input using GFF + FAA files. Your work on this tool is much appreciated.

@gamcil
Copy link
Owner

gamcil commented Nov 12, 2020

Currently clinker uses BioPython for parsing files, which does not yet have the ability to parse GFF. Potentially in the future I'll swap over to the parsing library I wrote for cblaster which can handle either, but it would take a pretty big reworking so not planned at the moment.

@marade
Copy link
Author

marade commented Nov 12, 2020

May I suggest the gffutils module for parsing GFF files? It's fairly straightforward and has worked great for me.

http://daler.github.io/gffutils/

It appears they intend to integrate this into BioPython anyway:

https://biopython.org/wiki/GFF_Parsing

@gamcil
Copy link
Owner

gamcil commented Nov 12, 2020

Oh cool, I'll look into it. Thanks!

@gamcil gamcil added the enhancement New feature or request label Nov 12, 2020
@gamcil
Copy link
Owner

gamcil commented Nov 12, 2020

I've added an initial attempt at GFF3 parsing using gffutils in the gff3 branch if you want to try that out. Looks for GFF files (extensions .gtf, .gff, .gff3) as well as GenBank, and will look for a corresponding FASTA file of the same name (extensions .fa, .fsa, .fna, .fasta, .faa).

E: Note that, as GenBank files are treated, GFF files with multiple regions are treated as gene clusters with multiple loci and will be drawn on the same line in the visualisation

@marade
Copy link
Author

marade commented Nov 12, 2020

Gosh that was fast. Can't wait to try it!

@marade
Copy link
Author

marade commented Nov 12, 2020

It appears to have processed the GFF successfully at least:

[21:23:18] INFO - Generating results summary...
[21:23:18] INFO - Writing alignments to output
[21:23:18] INFO - Building clustermap.js visualisation
[21:23:18] INFO - Writing to: plot
/usr/local/lib/python3.6/dist-packages/clinker/align.py:356: RuntimeWarning: invalid value encountered in true_divide
matrix /= matrix.max()
Traceback (most recent call last):
File "/usr/local/bin/clinker", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/clinker/main.py", line 153, in main
hide_alignment_headers=args.hide_aln_headers,
File "/usr/local/lib/python3.6/dist-packages/clinker/main.py", line 77, in clinker
plot_clusters(globaligner, output=None if plot is True else plot)
File "/usr/local/lib/python3.6/dist-packages/clinker/plot.py", line 114, in plot_clusters
data = clusters.to_data()
File "/usr/local/lib/python3.6/dist-packages/clinker/align.py", line 201, in to_data
for i in self.order(i=i, method=method)
File "/usr/local/lib/python3.6/dist-packages/clinker/align.py", line 371, in order
linkage = hierarchy.linkage(squareform(matrix), method=method)
File "/usr/local/lib/python3.6/dist-packages/scipy/spatial/distance.py", line 2184, in squareform
is_valid_dm(X, throw=True, name='X')
File "/usr/local/lib/python3.6/dist-packages/scipy/spatial/distance.py", line 2260, in is_valid_dm
'symmetric.') % name)
ValueError: Distance matrix 'X' must be symmetric.

@kforcone
Copy link

I have this same issue as @marade "ValueError: Distance matrix 'X' must be symmetric.", I'm running Clinker on a server with .gbk files, but this error happens every time I run it. It could likely be the formatting of the .gbk files as in other peoples issues, but I haven't identified it yet.

@gamcil
Copy link
Owner

gamcil commented Nov 17, 2020

@kforcone: Could you open a new issue and upload the files causing you the error?

@gamcil
Copy link
Owner

gamcil commented Dec 10, 2020

@marade Sorry for taking so long on this, just today got around to reworking it. I was having issues with GFF/FASTA files of specific regions downloaded from NCBI with their graphic viewer, since the start/end of features in those GFF files are relative to the entire parent scaffold, not the specific extracted region, so now the parser accounts for that too.

Anyway, I've merged the GFF+FASTA parser into master now if you'd like to try it out and see if you have any issues.

The issue about the distance matrix should also have been fixed already by 4f4c53d.

@marade
Copy link
Author

marade commented Dec 10, 2020

Cool, I will try this out as soon as I can.

@gamcil
Copy link
Owner

gamcil commented Dec 14, 2020

This is now added in clinker v0.0.10 so I'll close the issue - if you run into any bugs feel free to reopen it.

@gamcil gamcil closed this as completed Dec 14, 2020
@marade
Copy link
Author

marade commented Dec 28, 2020

Getting back to this...It looks like it processes GFF gene+CDS features just fine, but it chokes for gene+tRNA, etc with an error like this:

ValueError: Found no CDS features in gnl|Prokka|blahpB1A1_76 [../trim-assemble2/blahpB1A1/prokka/blahpB1A1.gff]

So probably need some logic in there to deal with situations where you don't get gene+CDS, because there can be many of those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants