Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ete3: allow taxids in species tree input #163

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 22 additions & 16 deletions tools/ete/ete_species_tree_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

parser = optparse.OptionParser()
parser.add_option('-s', '--species', dest="input_species_filename",
help='Species list in text format one species in each line')
help='List of species names of taxids in text format one species in each line')
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
parser.add_option('-d', '--database', dest="database", default=None,
help='ETE sqlite data base to use (default: ~/.etetoolkit/taxa.sqlite)')
parser.add_option('-o', '--output', dest="output", help='output file name (default: stdout)')
Expand All @@ -19,26 +19,32 @@
parser.error("-s option must be specified, Species list in text format one species in each line")

ncbi = NCBITaxa(dbfile=options.database)
with open(options.input_species_filename) as f:
species_name = [_.strip().replace('_', ' ') for _ in f.readlines()]

name2taxid = ncbi.get_name_translator(species_name)

taxid = [name2taxid[_][0] for _ in species_name]

tree = ncbi.get_topology(taxid)
# determine taxids and species names in the input file
names = []
taxids = []
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
with open(options.input_species_filename) as f:
for species in f:
species = species.strip().replace('_', ' ')
try:
taxids.append(int(species))
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
except ValueError:
names.append(species)
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
# translate all species names to taxids
name2taxid = ncbi.get_name_translator(names)
taxids += {name2taxid[n][0] for n in names}
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved

# get topology and set the scientific name as output
tree = ncbi.get_topology(taxids)
for isleaf, node in tree.iter_prepostorder():
node.name = node.sci_name

if options.treebest == "yes":
inv_map = {str(v[0]): k.replace(" ", "") + "*" for k, v in name2taxid.items()}
else:
inv_map = {str(v[0]): k for k, v in name2taxid.items()}


for leaf in tree:
leaf.name = inv_map[leaf.name]
for leaf in tree:
leaf.name = leaf.name.replace(" ", "") + "*"

newickTree = tree.write(format=int(options.format))

# print(type(tree))
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
if options.treebest == "yes":
newickTree = newickTree.rstrip(';')
newickTree = newickTree + "root;"
Expand Down
4 changes: 2 additions & 2 deletions tools/ete/ete_species_tree_generator.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<tool id="ete_species_tree_generator" name="ETE species tree generator" version="@VERSION@">
<tool id="ete_species_tree_generator" name="ETE species tree generator" version="@VERSION@+galaxy1">
<description>from a list of species using the ETE Toolkit</description>
<macros>
<import>ete_macros.xml</import>
Expand All @@ -21,7 +21,7 @@ python '$__tool_directory__/ete_species_tree_generator.py'
-t $output_format.treebest
]]></command>
<inputs>
<param name="speciesFile" type="data" format="txt" label="Species file" help="List with one species per line" />
<param name="speciesFile" type="data" format="txt" label="Species file" help="List with one species name or taxid per line" />
<param name="database" type="data" format="sqlite" label="(ETE3) Taxonomy Database" help="The sqlite formatted Taxonomy used by ETE3 (which is derived from NCBI taxonomy)" />
<conditional name="output_format">
<param name="treebest" type="select" label="Use in TreeBest" help="Select yes if specie tree to be used in TreeBest">
bernt-matthias marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
2 changes: 1 addition & 1 deletion tools/ete/test-data/lineage-compress-lower.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Nomascus leucogenys Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Noma
Pongo abelii Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Pongo Pongo abelii
Gorilla gorilla gorilla Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Gorilla Gorilla gorilla
Pan troglodytes Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Pan Pan troglodytes
Homo sapiens Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Homo Homo sapiens
9606 Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Homo Homo sapiens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's good that this and the following output files are changed, probably need to modify ete_lineage_generator.py to print the species name instead of the input taxid in the first column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noted this. The change is because the first column is just a copy of the input. I liked this because it gives a mapping between the original inputs and the ete outputs.

Should be easy to change it if you like. But we can also document it better.

In most use cases people will input either taxids of species names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anilthanki Any opinion on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nsoranzo can we interpret no answer as positive feedback?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anilthanki is back from holidays, ping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anilthanki: any chance the get feedback? If not, can we just have a decision on how to continue here?

Copy link
Member

@anilthanki anilthanki Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bernt-matthias I did have a look before and tried to remember what It supposed to do but TBH its been long I worked on it so I couldnt really make a comment..

I will try to review it again this weekend and make a comment by monday

Apologies for this

Sorex araneus Eukaryota Chordata Mammalia Laurasiatheria Soricidae Sorex Sorex araneus
Erinaceus europaeus Eukaryota Chordata Mammalia Laurasiatheria Erinaceidae Erinaceus Erinaceus europaeus
Pteropus vampyrus Eukaryota Chordata Mammalia Laurasiatheria Pteropodidae Pteropus Pteropus vampyrus
Expand Down
2 changes: 1 addition & 1 deletion tools/ete/test-data/lineage-compress.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Nomascus leucogenys Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Noma
Pongo abelii Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Pongo Pongo abelii
Gorilla gorilla gorilla Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Gorilla Gorilla gorilla
Pan troglodytes Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Pan Pan troglodytes
Homo sapiens Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Homo Homo sapiens
9606 Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Homo Homo sapiens
Sorex araneus Eukaryota Chordata Mammalia Laurasiatheria Soricidae Sorex Sorex araneus
Erinaceus europaeus Eukaryota Chordata Mammalia Laurasiatheria Erinaceidae Erinaceus Erinaceus europaeus
Pteropus vampyrus Eukaryota Chordata Mammalia Laurasiatheria Pteropodidae Pteropus Pteropus vampyrus
Expand Down
2 changes: 1 addition & 1 deletion tools/ete/test-data/lineage-full.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Nomascus leucogenys Metazoa Hylobatidae
Pongo abelii Metazoa Hominidae
Gorilla gorilla gorilla Metazoa Hominidae
Pan troglodytes Metazoa Hominidae
Homo sapiens Metazoa Hominidae
9606 Metazoa Hominidae
Sorex araneus Metazoa Soricidae
Erinaceus europaeus Metazoa Erinaceidae
Pteropus vampyrus Metazoa Pteropodidae
Expand Down
2 changes: 1 addition & 1 deletion tools/ete/test-data/lineage-wid.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Nomascus leucogenys 61853 Eukaryota Chordata Mammalia Euarchontoglires Hominoide
Pongo abelii 9601 Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Pongo Pongo abelii
Gorilla gorilla gorilla 9595 Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Gorilla Gorilla gorilla
Pan troglodytes 9598 Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Pan Pan troglodytes
Homo sapiens 9606 Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Homo Homo sapiens
9606 9606 Eukaryota Chordata Mammalia Euarchontoglires Hominoidea Homo Homo sapiens
Sorex araneus 42254 Eukaryota Chordata Mammalia Laurasiatheria Soricidae Sorex Sorex araneus
Erinaceus europaeus 9365 Eukaryota Chordata Mammalia Laurasiatheria Erinaceidae Erinaceus Erinaceus europaeus
Pteropus vampyrus 132908 Eukaryota Chordata Mammalia Laurasiatheria Pteropodidae Pteropus Pteropus vampyrus
Expand Down
2 changes: 1 addition & 1 deletion tools/ete/test-data/lineage.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Nomascus leucogenys Eukaryota Metazoa NA NA Chordata Craniata NA Mammalia NA NA
Pongo abelii Eukaryota Metazoa NA NA Chordata Craniata NA Mammalia NA NA NA Euarchontoglires Primates Haplorrhini Simiiformes Catarrhini Hominoidea Hominidae Ponginae NA NA Pongo NA NA NA Pongo abelii NA NA NA
Gorilla gorilla gorilla Eukaryota Metazoa NA NA Chordata Craniata NA Mammalia NA NA NA Euarchontoglires Primates Haplorrhini Simiiformes Catarrhini Hominoidea Hominidae Homininae NA NA Gorilla NA NA NA Gorilla gorilla Gorilla gorilla gorilla NA NA
Pan troglodytes Eukaryota Metazoa NA NA Chordata Craniata NA Mammalia NA NA NA Euarchontoglires Primates Haplorrhini Simiiformes Catarrhini Hominoidea Hominidae Homininae NA NA Pan NA NA NA Pan troglodytes NA NA NA
Homo sapiens Eukaryota Metazoa NA NA Chordata Craniata NA Mammalia NA NA NA Euarchontoglires Primates Haplorrhini Simiiformes Catarrhini Hominoidea Hominidae Homininae NA NA Homo NA NA NA Homo sapiens NA NA NA
9606 Eukaryota Metazoa NA NA Chordata Craniata NA Mammalia NA NA NA Euarchontoglires Primates Haplorrhini Simiiformes Catarrhini Hominoidea Hominidae Homininae NA NA Homo NA NA NA Homo sapiens NA NA NA
Sorex araneus Eukaryota Metazoa NA NA Chordata Craniata NA Mammalia NA NA NA Laurasiatheria Insectivora NA NA NA NA Soricidae Soricinae NA NA Sorex NA NA NA Sorex araneus NA NA NA
Erinaceus europaeus Eukaryota Metazoa NA NA Chordata Craniata NA Mammalia NA NA NA Laurasiatheria Insectivora NA NA NA NA Erinaceidae Erinaceinae NA NA Erinaceus NA NA NA Erinaceus europaeus NA NA NA
Pteropus vampyrus Eukaryota Metazoa NA NA Chordata Craniata NA Mammalia NA NA NA Laurasiatheria Chiroptera Megachiroptera NA NA NA Pteropodidae Pteropodinae NA NA Pteropus NA NA NA Pteropus vampyrus NA NA NA
Expand Down
2 changes: 1 addition & 1 deletion tools/ete/test-data/species.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Nomascus leucogenys
Pongo abelii
Gorilla gorilla gorilla
Pan troglodytes
Homo sapiens
9606
Sorex araneus
Erinaceus europaeus
Pteropus vampyrus
Expand Down