Skip to content

Commit

Permalink
feat: update to v0.7.7 - BiG-FAM fixes & notebooks (#291)
Browse files Browse the repository at this point in the history
* fix: correct bigslice annotation rule
* chore: correct log file
* fix: remove gggenomes from main r_notebook env
* feat: experimental env for extracting mmseqs2 with gggenomes
* fix: handle numpy and matplotlib via conda
* notebook: add bigfam query visualization
* fix: correct html display for bigfam query
* fix: auto detect antismash major version and its env
* feat: use custom bigslice with no normalize option
* feat: change bigslice threshold
* fix: handle empty known resistance table
* notebook: add instruction to run bigslice clustering result
* notebook: improve display for mash and seqfu
* notebook: better instruction to start bigslice server
* notebook: correct fastani distances
* notebook: add barplot
* notebook: add sunburst
* notebook: tidy sections
* notebook: switch table and figure location
* fix: move deeptf roary output to processed folder
* notebooks: changen conda env for notebooks
* fix: deeptf roary input
* notebook: adjust label size and rotation
* test: add dry run for other subworkflows
* test: omit build database for now
* docs: add example report demo
* docs: add wrapper icon
* chore: bump version to v0.7.7
* fix: pin python version for duckdb
* fix: pin python 3.11 for duckdb
* fix: add unzip for dbt
* docs: simplify install guide
* fix: pin duckdb to 0.8.1 from conda
  • Loading branch information
matinnuhamunada committed Oct 19, 2023
1 parent ee2e265 commit 5c4941d
Show file tree
Hide file tree
Showing 20 changed files with 794 additions and 72 deletions.
36 changes: 36 additions & 0 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,42 @@ jobs:
stagein: "conda config --set channel_priority strict"
args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache -n"

dry-run-report:
runs-on: ubuntu-latest
needs:
- linting
- formatting
steps:
- name: Checkout repository and submodules
uses: actions/checkout@v4
with:
submodules: recursive
- name: Dry-run workflow
uses: snakemake/snakemake-github-action@v1.24.0
with:
directory: .tests
snakefile: workflow/Report
stagein: "conda config --set channel_priority strict"
args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache -n"

dry-run-metabase:
runs-on: ubuntu-latest
needs:
- linting
- formatting
steps:
- name: Checkout repository and submodules
uses: actions/checkout@v4
with:
submodules: recursive
- name: Dry-run workflow
uses: snakemake/snakemake-github-action@v1.24.0
with:
directory: .tests
snakefile: workflow/Metabase
stagein: "conda config --set channel_priority strict"
args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache -n"

unit-test:
runs-on: ubuntu-latest
needs:
Expand Down
19 changes: 11 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
[![Snakemake](https://img.shields.io/badge/snakemake-≥7.14.0-brightgreen.svg)](https://snakemake.bitbucket.io)
[![PEP compatible](https://pepkit.github.io/img/PEP-compatible-green.svg)](https://pep.databio.org)
[![wiki](https://img.shields.io/badge/wiki-documentation-forestgreen)](https://github.com/NBChub/bgcflow/wiki)
[![bgcflow-wrapper](https://img.shields.io/badge/CLI-BGCFlow_wrapper-orange)](https://github.com/NBChub/bgcflow_wrapper)
[![example report](https://img.shields.io/badge/demo-report-blue)](https://nbchub.github.io/bgcflow_report_demo/)

`BGCFlow` is a systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes) from internal & public datasets.

Expand All @@ -15,13 +17,14 @@ At present, `BGCFlow` is only tested and confirmed to work on **Linux** systems

Please use the latest version of `BGCFlow` available.
## Quick Start
A quick and easy way to use `BGCFlow` using [`bgcflow_wrapper`](https://github.com/NBChub/bgcflow_wrapper).
A quick and easy way to use `BGCFlow` using the command line interface wrapper:
[![bgcflow-wrapper](https://img.shields.io/badge/CLI-BGCFlow_wrapper-orange)](https://github.com/NBChub/bgcflow_wrapper)

1. Create a conda environment and install the [`BGCFlow` python wrapper](https://github.com/NBChub/bgcflow_wrapper) :

```bash
# create and activate a new conda environment
conda create -n bgcflow -c conda-forge python=3.11 pip openjdk -y
conda create -n bgcflow -c conda-forge python=3.11 pip openjdk -y # also install java for metabase
conda activate bgcflow

# install `BGCFlow` wrapper
Expand All @@ -37,10 +40,6 @@ With the environment activated, install or setup this configurations:
```bash
conda config --set channel_priority disabled
conda config --describe channel_priority
```
- Java (required to run `metabase`)
```bash
conda install openjdk
```

3. Deploy and run BGCFlow, change `your_bgcflow_directory` variable accordingly:
Expand All @@ -52,7 +51,10 @@ bgcflow init # initiate `BGCFlow` config and examples from template
bgcflow run -n # do a dry run, remove the flag "-n" to run the example dataset
```

4. Build and serve interactive report (after `bgcflow run` finished). The report will be served in [http://localhost:8001/](http://localhost:8001/):
4. Build and serve interactive report (after `bgcflow run` finished). The report will be served in [http://localhost:8001/](http://localhost:8001/). A demo of the report is available here:

[![example report](https://img.shields.io/badge/demo-report-blue)](https://nbchub.github.io/bgcflow_report_demo/)

```bash
# build a report
bgcflow build report
Expand All @@ -65,7 +67,8 @@ bgcflow serve --project Lactobacillus_delbrueckii
```


- For detailed usage and configurations, have a look at the [`BGCFlow` WIKI](https://github.com/NBChub/bgcflow/wiki/) (`under development`) :warning:
- For detailed usage and configurations, have a look at the WIKI:
[![wiki](https://img.shields.io/badge/wiki-documentation-forestgreen)](https://github.com/NBChub/bgcflow/wiki)
- Read more about [`bgcflow_wrapper`](https://github.com/NBChub/bgcflow_wrapper) for a detailed overview of the command line interface.

[![asciicast](https://asciinema.org/a/595149.svg)](https://asciinema.org/a/595149)
Expand Down
6 changes: 5 additions & 1 deletion workflow/bgcflow/bgcflow/data/arts_extract_all.py
Original file line number Diff line number Diff line change
Expand Up @@ -430,6 +430,7 @@ def extract_arts_knownhits(arts_knownhits_tsv, outfile_hits, genome_id=None):
pandas.DataFrame
A DataFrame representation of the extracted Known Resistance Hits information,
with each row representing a hit and columns representing hit attributes.
If the input DataFrame is empty, an empty DataFrame is returned.
Example
-------
Expand Down Expand Up @@ -465,7 +466,6 @@ def extract_arts_knownhits(arts_knownhits_tsv, outfile_hits, genome_id=None):
] = f"{str(genome_id)}__{str(model)}__{str(scaffold)}__{str(sequence_id)}"
arts_knownhits = (
arts_knownhits.drop(columns=["Sequence description"])
.set_index("index")
.rename(
columns={
"#Model": "model",
Expand All @@ -474,6 +474,10 @@ def extract_arts_knownhits(arts_knownhits_tsv, outfile_hits, genome_id=None):
}
)
)
if arts_knownhits.empty:
logging.warning("ARTS Known Resistance Hits table is empty.")
else:
arts_knownhits = arts_knownhits.set_index("index")
arts_knownhits_dict = arts_knownhits.T.to_dict()
outfile_hits.parent.mkdir(parents=True, exist_ok=True)
logging.info(f"Writing Known Resistance Hits JSON to: {outfile_hits}")
Expand Down
24 changes: 18 additions & 6 deletions workflow/bgcflow/bgcflow/data/get_dependencies.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,36 @@
import json
import logging
import sys

log_format = "%(levelname)-8s %(asctime)s %(message)s"
date_format = "%d/%m %H:%M:%S"
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)

import yaml

# list of the main dependecies used in the workflow
dependencies = {
"antismash": r"workflow/envs/antismash.yaml",
"bigslice": r"workflow/envs/bigslice.yaml",
"cblaster": r"workflow/envs/cblaster.yaml",
"prokka": r"workflow/envs/prokka.yaml",
"mlst": r"workflow/envs/mlst.yaml",
"eggnog-mapper": r"workflow/envs/eggnog.yaml",
"roary": r"workflow/envs/roary.yaml",
"refseq_masher": r"workflow/envs/refseq_masher.yaml",
"seqfu": r"workflow/envs/seqfu.yaml",
"checkm": r"workflow/envs/checkm.yaml",
"gtdbtk": r"workflow/envs/gtdbtk.yaml",
}


def get_dependency_version(dep, dep_key):
def get_dependency_version(dep, dep_key, antismash_version="7"):
"""
return dependency version tags given a dictionary (dep) and its key (dep_key)
"""
if dep_key == "antismash":
logging.info(f"AntiSMASH version is: {antismash_version}")
if antismash_version == "6":
dep[dep_key] = "workflow/envs/antismash_v6.yaml"
logging.info(f"Getting software version for: {dep_key}")
with open(dep[dep_key]) as file:
result = []
documents = yaml.full_load(file)
Expand All @@ -38,14 +50,14 @@ def get_dependency_version(dep, dep_key):
return str(result)


def write_dependecies_to_json(outfile, dep=dependencies):
def write_dependecies_to_json(outfile, antismash_version, dep=dependencies):
"""
write dependency version to a json file
"""
with open(outfile, "w") as file:
dv = {}
for ky in dep.keys():
vr = get_dependency_version(dep, ky)
vr = get_dependency_version(dep, ky, antismash_version=antismash_version)
dv[ky] = vr
json.dump(
dv,
Expand All @@ -57,4 +69,4 @@ def write_dependecies_to_json(outfile, dep=dependencies):


if __name__ == "__main__":
write_dependecies_to_json(sys.argv[1])
write_dependecies_to_json(sys.argv[1], sys.argv[2])
11 changes: 8 additions & 3 deletions workflow/envs/bgcflow_notes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ channels:
- bioconda
- defaults
dependencies:
- python>=3.9,<3.12
- jupyterlab>3
- ipywidgets>=7.6
- jupyter-dash
Expand All @@ -13,13 +14,17 @@ dependencies:
- python-kaleido
- bokeh
- openpyxl
- nbconvert
- nbconvert==6.4.2
- ipykernel
- ete3
- xlrd >= 1.0.0
- altair
- itables
- scikit-learn
- pyarrow
- pygraphviz
- pip
- pip:
- biopython
- jupyterlab-dash
- networkx
- alive_progress
- ../bgcflow
2 changes: 1 addition & 1 deletion workflow/envs/bigslice.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ dependencies:
- scipy==1.9.3
- six==1.16.0
- threadpoolctl==3.1.0
- git+https://github.com/medema-group/bigslice.git@103d8f2
- git+https://github.com/NBChub/bigslice.git@c0085de
6 changes: 5 additions & 1 deletion workflow/envs/dbt-duckdb.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
name: dbt-duckdb
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python==3.11
- python-duckdb==0.8.1
- unzip
- pip
- pip:
- dbt-duckdb==1.6.0
- dbt-metabase==0.9.15
- dbt-metabase==0.9.15
56 changes: 54 additions & 2 deletions workflow/notebook/bigslice.py.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@
"source": [
"import pandas as pd\n",
"from pathlib import Path\n",
"import shutil, json\n",
"from IPython.display import display, Markdown\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
Expand All @@ -33,7 +35,57 @@
"metadata": {},
"outputs": [],
"source": [
"report_dir = Path(\"../\")"
"report_dir = Path(\"../\")\n",
"bgcflow_dir = report_dir / (\"../../../\")\n",
"envs = bgcflow_dir / \"workflow/envs/bigslice.yaml\"\n",
"\n",
"metadata = report_dir / \"metadata/dependency_versions.json\"\n",
"with open(metadata, \"r\") as f:\n",
" dependency_version = json.load(f)\n",
"\n",
"# Define the destination path\n",
"destination_path = Path(\"assets/envs/bigslice.yaml\")\n",
"\n",
"# Ensure the destination directory exists\n",
"destination_path.parent.mkdir(parents=True, exist_ok=True)\n",
"\n",
"# Copy the file\n",
"shutil.copy(envs, destination_path);"
]
},
{
"cell_type": "markdown",
"id": "43ca282f",
"metadata": {},
"source": [
"## Usage\n",
"\n",
"You can start the BiG-SLICE flask app to view the clustering result.\n",
"\n",
"Steps:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed378d7c-7d1a-4715-a06d-8850c1344387",
"metadata": {},
"outputs": [],
"source": [
"run_bigslice=f\"\"\"- Install the conda environment:\n",
"\n",
"```bash\n",
" conda install -f {report_dir.resolve()}/docs/assets/envs/bigslice.yaml\n",
"```\n",
"\n",
"- Run the app\n",
"\n",
"```bash\n",
" port='5001'\n",
" conda run -n bigslice bash {report_dir.resolve()}/cluster_as_{dependency_version[\"antismash\"]}/start_server.sh $port\n",
"```\n",
"\"\"\"\n",
"display(Markdown(run_bigslice))"
]
},
{
Expand Down Expand Up @@ -69,7 +121,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
"version": "3.9.18"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 5c4941d

Please sign in to comment.