Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add busco tool #754

Merged
merged 3 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions .github/workflows/build-push-busco-image.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Build & Push BUSCO Image to GHCR

on:
pull_request:
types:
- opened
- reopened
- synchronize
- ready_for_review
paths:
- 'src/loaders/compute_tools/busco/versions.yaml'
- '.github/workflows/build-push-busco-image.yml'
- '.github/workflows/build-push-tool-images.yml'

push:
branches:
- main
- master
- develop
paths:
- 'src/loaders/compute_tools/busco/versions.yaml'
- '.github/workflows/build-push-busco-image.yml'
- '.github/workflows/build-push-tool-images.yml'

jobs:
trigger-build-push:
uses: ./.github/workflows/build-push-tool-images.yml
with:
tool_name: busco
version_file: 'src/loaders/compute_tools/busco/versions.yaml'
secrets: inherit
2 changes: 1 addition & 1 deletion RELEASE_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## 0.1.3

* Added BBMap tool to the CDM pipeline.
* Added BBMap and BUSCO tool to the CDM pipeline.
* Included metadata file generation after each tool's execution.
* Updated Python library dependencies to the latest versions.
* Standardized thread management logic across all tools.
Expand Down
33 changes: 33 additions & 0 deletions src/loaders/compute_tools/busco/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
FROM continuumio/miniconda3:24.5.0-0

# NOTE: If the tool version changes, ensure the metadata information saved after running the tool in the _run_busco_single method is updated
ARG BUSCO_VER=5.7.1
ENV CONDA_ENV busco-$BUSCO_VER

# Add Bioconda and Conda-Forge channels
RUN conda config --add channels bioconda
RUN conda config --add channels conda-forge

# Install BUSCO
# Certain dependencies (e.g., dendropy, sepp) are only compatible with Python versions up to 3.9.
ARG PYTHON_VER=3.9
RUN conda create -n $CONDA_ENV python=$PYTHON_VER
RUN conda install -n $CONDA_ENV pandas=2.2.2 jsonlines=4.0.0 mamba=1.5.8 pyyaml=6.0.1
RUN conda run -n $CONDA_ENV mamba install -c bioconda -c conda-forge -y busco=$BUSCO_VER
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved

# Activate the environment
RUN echo "source activate $CONDA_ENV" >> ~/.bashrc

# Set up directories
RUN mkdir -p /app
COPY ./ /app/collections
RUN rm -r /app/collections/.git

ENV PYTHONPATH /app/collections
WORKDIR /app

ENV PY_SCRIPT=/app/collections/src/loaders/compute_tools/busco/busco.py

RUN chmod -R 777 /app/collections

ENTRYPOINT ["/app/collections/src/loaders/compute_tools/entrypoint.sh"]
81 changes: 81 additions & 0 deletions src/loaders/compute_tools/busco/busco.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
"""
Run BUSCO tool on a set of fna files.

This tool serves a distinct purpose separate from collection tools; instead, it is suited for CDM work.
Therefore, the parser program is not compatible with data generated by this tool.

"""
import os
import time
from pathlib import Path

from src.loaders.common.loader_common_names import TOOL_METADATA
from src.loaders.compute_tools.tool_common import ToolRunner, run_command, create_tool_metadata
from src.loaders.compute_tools.tool_version import extract_latest_reference_db_version


def _run_busco_single(
tool_safe_data_id: str,
data_id: str,
source_file: Path,
output_dir: Path,
threads_per_tool_run: int,
debug: bool) -> None:
start = time.time()
print(f'Start executing BUSCO for {data_id}')

metadata_file = output_dir / TOOL_METADATA
if metadata_file.exists():
print(f"Skipping {source_file} as it has already been processed.")
return

current_dir = os.path.dirname(os.path.abspath(__file__))
version_file = os.path.join(current_dir, 'versions.yaml')
ref_db_version = extract_latest_reference_db_version(version_file)

command = [
'busco',
'-i', str(source_file),
'-o', data_id,
'--out_path', str(output_dir),
'--datasets_version', ref_db_version,
'--download_path', '/reference_data',
'-c', str(threads_per_tool_run),
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
'--auto-lineage-prok',
'-m', 'genome',
'-f',
'--augustus',
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
]

run_command(command, output_dir if debug else None)

end_time = time.time()
run_time = end_time - start
print(
f'Used {round(run_time / 60, 2)} minutes to execute BUSCO for {data_id}')

# Save run info to a metadata file in the output directory for parsing later
additional_metadata = {
'source_file': str(source_file),
'data_id': data_id,
"reference_db": {
"version": "odb10",
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved
},
}
create_tool_metadata(
output_dir,
tool_name="busco",
version="5.7.1",
command=command,
run_time=round(run_time, 2),
batch_size=1,
additional_metadata=additional_metadata)


def main():
runner = ToolRunner("busco")
runner.parallel_single_execution(_run_busco_single, unzip=True)


if __name__ == "__main__":
main()
12 changes: 12 additions & 0 deletions src/loaders/compute_tools/busco/versions.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# This tool serves a distinct purpose separate from collection tools; instead, it is suited for CDM work.
# Therefore, the parser program is not compatible with data generated by this tool.

versions:
- version: 0.1.0
date: 2024-08-22
notes: |
- initial BUSCO implementation
reference_db_version: odb10
MrCreosote marked this conversation as resolved.
Show resolved Hide resolved

#Please keep this reminder at the end of this file
#NOTE: If the db verllsion changes, ensure the metadata information saved after running the tool in the _run_busco_single method is updated
11 changes: 8 additions & 3 deletions src/loaders/jobs/taskfarmer/task_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,15 @@
--force Force overwrite of existing job directory
--source_file_ext SOURCE_FILE_EXT
Select files from source data directory that match the given extension.


TODO: The recommended approach by NERSC for running tasks with intensive I/O tools (most of our tools), is to utilize
the scratch directory. Before executing the task, source data and reference libraries should be copied to the scratch
directory. Soft links (such as for collection sources) should be created as needed. Once the task is complete,
the results should be copied back to the user's directory. For more information, refer to the NERSC documentation:
https://docs.nersc.gov/filesystems/perlmutter-scratch/
'''

TOOLS_AVAILABLE = ['gtdb_tk', 'checkm2', 'microtrait', 'mash', 'eggnog', 'bbmap']
TOOLS_AVAILABLE = ['gtdb_tk', 'checkm2', 'microtrait', 'mash', 'eggnog', 'bbmap', 'busco']

NODE_TIME_LIMIT_DEFAULT = 5 # hours
# Used as THREADS variable in the batch script which controls the number of parallel tasks per node
Expand All @@ -62,10 +67,10 @@
# if no specific metadata is provided for a tool, the default values are used.
TASK_META = {'gtdb_tk': {'chunk_size': 1000, 'exe_time': 65, 'tasks_per_node': 4, 'threads_per_tool_run': 32},
'eggnog': {'chunk_size': 100, 'exe_time': 15, 'node_time_limit': 0.5}, # Memory intensive tool - reserve more nodes with less node reservation time
'busco': {'chunk_size': 50, 'exe_time': 90, 'node_time_limit': 1.5}, # 1.5 minutes per genome with a single task per node on the user's drive. TODO: Aim to test multi-threading per node along with scratch execution, and adjust `tasks_per_node` accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is expensive. For 10k genomes that's 250 hours, so scaling this to GTDB sized genome counts seems infeasible. Hopefully your TODOs pan out.

'default': {'chunk_size': 5000, 'exe_time': 60}}
MAX_NODE_NUM = 100 # maximum number of nodes to use


REGISTRY = 'ghcr.io/kbase/collections'
VERSION_FILE = 'versions.yaml'
COMPUTE_TOOLS_DIR = '../../compute_tools' # relative to task_generator.py
Expand Down
Loading