Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compleasm: use symlink instead of copying busco data #6679

Merged
merged 2 commits into from
Jan 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 25 additions & 15 deletions tools/compleasm/compleasm.xml
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,18 @@
</requirements>
<version_command>compleasm --version</version_command>
<command><![CDATA[

mkdir -p galaxy_db &&
cp -r '${busco_database.fields.path}/lineages/${lineage_dataset}/' 'galaxy_db/' &&
mkdir -p 'galaxy_db/' &&
ln -s '${busco_database.fields.path}/lineages/${lineage_dataset}/' 'galaxy_db/${lineage_dataset}' &&
## Create a compleasm-specific empty file to avoid redownloading the lineage data (https://github.com/huangnengCSU/compleasm/blob/0.2.6/compleasm.py#L165)
touch 'galaxy_db/${lineage_dataset}.done' &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an empty file also existing in the DB folder? Then I would prefer a symlink.

Anyway a small explaining comment would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just an empty file specific to compleasm, to make it understand that it should not redownload it, I'm gonna add a comment


compleasm run
-a '$input'
-o galaxy_output
--mode $mode
-L 'galaxy_db'
-l '$lineage_dataset'
-t "\${GALAXY_SLOTS:-1}"
-t "\${GALAXY_SLOTS:-1}"

#if str($specified_contigs) != '':
--specified_contigs '$specified_contigs'
Expand All @@ -43,7 +44,7 @@
</sanitizer>
<validator type="regex">[0-9a-zA-Z_ ]+</validator>
</param>

<param name="outputs" type="select" multiple="true" label="Which outputs should be generated">
<option value="full_table_busco" selected="true">full busco table</option>
<option value="full_table">full table</option>
Expand Down Expand Up @@ -72,26 +73,32 @@
</data>
<data name='translated_protein' format='fasta' label="${tool.name} on ${on_string}: Translated protein" from_work_dir="galaxy_output/*_odb10/translated_protein.fasta">
<filter>outputs and 'translated_protein' in outputs</filter>
</data>
</data>
</outputs>

<tests>
<test expect_num_outputs="4">
<param name="input" value="small_genome.fasta"/>
<param name="mode" value="busco"/>
<param name="outputs" value="full_table_busco,full_table,miniprot,translated_protein"/>
<param name="busco_database" value="eukaryota_odb10"/>
<param name="lineage_dataset" value="eukaryota_odb10"/>
<param name="busco_database" value="entomoplasmatales_odb10"/>
<param name="lineage_dataset" value="entomoplasmatales_odb10"/>
<output name="full_table_busco">
<assert_contents>
<has_text text="Busco id"/>
<has_text text="Missing"/>
<has_text text="496at186328&#009;Missing"/>
<has_text text="165at186328&#009;Complete"/>
<has_text text="421at186328&#009;Complete"/>
<has_text text="90at186328&#009;Complete"/>
</assert_contents>
</output>
<output name="full_table">
<assert_contents>
<has_text text="Gene"/>
<has_text text="Missing"/>
<has_text text="Gene&#009;Status"/>
<has_text text="496at186328&#009;Missing"/>
<has_text text="165at186328&#009;Single"/>
<has_text text="421at186328&#009;Single"/>
<has_text text="90at186328&#009;Single"/>
</assert_contents>
</output>
<output name="miniprot">
Expand All @@ -101,19 +108,22 @@
</output>
<output name="translated_protein">
<assert_contents>
<has_text text="GGWLIGNGGAGGSGAAGVNGGAGGNGGAGGNGGAGG"/>
<has_text text="AAVFADRGAHVVLAVRNLEKGNAARARIMAARPGAHVTLQQLDLCSLDSVRAAADALRTAYPRIDVLINNAGVMW"/>
<has_text text="EKDFYQELGVSSDASPEEIKRAYRKLARDLHPDANPGNPAA"/>
<has_text text="AASITILEGLEAVRKRPGMYIGSTGERGLHHLIWEVVD"/>
</assert_contents>
</output>
<assert_stdout>
<has_text text="S:1.20%, 4"/>
</assert_stdout>
</test>
</tests>
<help><![CDATA[

compleasm_ assesses genome completeness based on genome assembly.
compleasm_ assesses genome completeness based on genome assembly.

.. _compleasm: https://github.com/huangnengCSU/compleasm

]]>
]]>
</help>
<expand macro="citation"/>
</tool>
4 changes: 2 additions & 2 deletions tools/compleasm/macros.xml
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
<macros>
<token name="@TOOL_VERSION@">0.2.6</token>
<token name="@VERSION_SUFFIX@">1</token>
<token name="@VERSION_SUFFIX@">2</token>

<xml name="citation">
<citations>
<citation type="doi">10.1101/2023.06.03.543588</citation>
<citation type="doi">10.1101/2023.06.03.543588</citation>
</citations>
</xml>

Expand Down
4 changes: 2 additions & 2 deletions tools/compleasm/test-data/busco_database.loc
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
# - value
# - name
# - version
# - /path/to/data
eukaryota_odb10 eukaryota 5.4.6 ${__HERE__}/test-db/busco_downloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why you updated the test (data)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the test-data dir as small as possible as entomoplasmatales_odb10 is a much smaller lineage than eukaryota_odb10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but changing it will increase the size of the repo. The "problem" with git repos is that everything that is in in will be there forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But changing it might still be a good idea if the runtime is reduced significantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous dataset was non-functional and not really used, so I see no other way than adding this new minimal one

# - /path/to/data
entomoplasmatales_odb10 entomoplasmatales 5.4.6 ${__HERE__}/test-db/busco_downloads

This file was deleted.

Loading
Loading