feat: update to v0.7.7 - BiG-FAM fixes & notebooks (#291)

* fix: correct bigslice annotation rule * chore: correct log file * fix: remove gggenomes from main r_notebook env * feat: experimental env for extracting mmseqs2 with gggenomes * fix: handle numpy and matplotlib via conda * notebook: add bigfam query visualization * fix: correct html display for bigfam query * fix: auto detect antismash major version and its env * feat: use custom bigslice with no normalize option * feat: change bigslice threshold * fix: handle empty known resistance table * notebook: add instruction to run bigslice clustering result * notebook: improve display for mash and seqfu * notebook: better instruction to start bigslice server * notebook: correct fastani distances * notebook: add barplot * notebook: add sunburst * notebook: tidy sections * notebook: switch table and figure location * fix: move deeptf roary output to processed folder * notebooks: changen conda env for notebooks * fix: deeptf roary input * notebook: adjust label size and rotation * test: add dry run for other subworkflows * test: omit build database for now * docs: add example report demo * docs: add wrapper icon * chore: bump version to v0.7.7 * fix: pin python version for duckdb * fix: pin python 3.11 for duckdb * fix: add unzip for dbt * docs: simplify install guide * fix: pin duckdb to 0.8.1 from conda
NBChub · Oct 19, 2023 · 5c4941d · 5c4941d
1 parent ee2e265
commit 5c4941d
Show file tree

Hide file tree

Showing 20 changed files with 794 additions and 72 deletions.
diff --git a/.github/workflows/push.yml b/.github/workflows/push.yml
@@ -46,6 +46,42 @@ jobs:
         stagein: "conda config --set channel_priority strict"
         args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache -n"
 
+  dry-run-report:
+    runs-on: ubuntu-latest
+    needs:
+      - linting
+      - formatting
+    steps:
+    - name: Checkout repository and submodules
+      uses: actions/checkout@v4
+      with:
+        submodules: recursive
+    - name: Dry-run workflow
+      uses: snakemake/snakemake-github-action@v1.24.0
+      with:
+        directory: .tests
+        snakefile: workflow/Report
+        stagein: "conda config --set channel_priority strict"
+        args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache -n"
+
+  dry-run-metabase:
+    runs-on: ubuntu-latest
+    needs:
+      - linting
+      - formatting
+    steps:
+    - name: Checkout repository and submodules
+      uses: actions/checkout@v4
+      with:
+        submodules: recursive
+    - name: Dry-run workflow
+      uses: snakemake/snakemake-github-action@v1.24.0
+      with:
+        directory: .tests
+        snakefile: workflow/Metabase
+        stagein: "conda config --set channel_priority strict"
+        args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache -n"
+
   unit-test:
     runs-on: ubuntu-latest
     needs:

diff --git a/README.md b/README.md
@@ -2,6 +2,8 @@
 [![Snakemake](https://img.shields.io/badge/snakemake-≥7.14.0-brightgreen.svg)](https://snakemake.bitbucket.io)
 [![PEP compatible](https://pepkit.github.io/img/PEP-compatible-green.svg)](https://pep.databio.org)
 [![wiki](https://img.shields.io/badge/wiki-documentation-forestgreen)](https://github.com/NBChub/bgcflow/wiki)
+[![bgcflow-wrapper](https://img.shields.io/badge/CLI-BGCFlow_wrapper-orange)](https://github.com/NBChub/bgcflow_wrapper)
+[![example report](https://img.shields.io/badge/demo-report-blue)](https://nbchub.github.io/bgcflow_report_demo/)
 
 `BGCFlow` is a systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes) from internal &amp; public datasets.
 
@@ -15,13 +17,14 @@ At present, `BGCFlow` is only tested and confirmed to work on **Linux** systems
 
 Please use the latest version of `BGCFlow` available.
 ## Quick Start
-A quick and easy way to use `BGCFlow` using [`bgcflow_wrapper`](https://github.com/NBChub/bgcflow_wrapper).
+A quick and easy way to use `BGCFlow` using the command line interface wrapper:
+[![bgcflow-wrapper](https://img.shields.io/badge/CLI-BGCFlow_wrapper-orange)](https://github.com/NBChub/bgcflow_wrapper)
 
 1. Create a conda environment and install the [`BGCFlow` python wrapper](https://github.com/NBChub/bgcflow_wrapper) :
 
 ```bash
 # create and activate a new conda environment
-conda create -n bgcflow -c conda-forge python=3.11 pip openjdk -y
+conda create -n bgcflow -c conda-forge python=3.11 pip openjdk -y # also install java for metabase
 conda activate bgcflow
 
 # install `BGCFlow` wrapper
@@ -37,10 +40,6 @@ With the environment activated, install or setup this configurations:
 ```bash
 conda config --set channel_priority disabled
 conda config --describe channel_priority
-```
-  - Java (required to run `metabase`)
-```bash
-conda install openjdk 
 ```
 
 3. Deploy and run BGCFlow, change `your_bgcflow_directory` variable accordingly:
@@ -52,7 +51,10 @@ bgcflow init # initiate `BGCFlow` config and examples from template
 bgcflow run -n # do a dry run, remove the flag "-n" to run the example dataset
 ```
 
-4. Build and serve interactive report (after `bgcflow run` finished). The report will be served in [http://localhost:8001/](http://localhost:8001/):
+4. Build and serve interactive report (after `bgcflow run` finished). The report will be served in [http://localhost:8001/](http://localhost:8001/). A demo of the report is available here: 
+
+    [![example report](https://img.shields.io/badge/demo-report-blue)](https://nbchub.github.io/bgcflow_report_demo/)
+
 ```bash
 # build a report
 bgcflow build report
@@ -65,7 +67,8 @@ bgcflow serve --project Lactobacillus_delbrueckii
 ```
 
 
-- For detailed usage and configurations, have a look at the [`BGCFlow` WIKI](https://github.com/NBChub/bgcflow/wiki/) (`under development`) :warning:  
+- For detailed usage and configurations, have a look at the WIKI:
+[![wiki](https://img.shields.io/badge/wiki-documentation-forestgreen)](https://github.com/NBChub/bgcflow/wiki) 
 - Read more about [`bgcflow_wrapper`](https://github.com/NBChub/bgcflow_wrapper) for a detailed overview of the command line interface.
 
 [![asciicast](https://asciinema.org/a/595149.svg)](https://asciinema.org/a/595149)

diff --git a/workflow/bgcflow/bgcflow/data/arts_extract_all.py b/workflow/bgcflow/bgcflow/data/arts_extract_all.py
@@ -430,6 +430,7 @@ def extract_arts_knownhits(arts_knownhits_tsv, outfile_hits, genome_id=None):
     pandas.DataFrame
         A DataFrame representation of the extracted Known Resistance Hits information,
         with each row representing a hit and columns representing hit attributes.
+        If the input DataFrame is empty, an empty DataFrame is returned.
 
     Example
     -------
@@ -465,7 +466,6 @@ def extract_arts_knownhits(arts_knownhits_tsv, outfile_hits, genome_id=None):
         ] = f"{str(genome_id)}__{str(model)}__{str(scaffold)}__{str(sequence_id)}"
     arts_knownhits = (
         arts_knownhits.drop(columns=["Sequence description"])
-        .set_index("index")
         .rename(
             columns={
                 "#Model": "model",
@@ -474,6 +474,10 @@ def extract_arts_knownhits(arts_knownhits_tsv, outfile_hits, genome_id=None):
             }
         )
     )
+    if arts_knownhits.empty:
+        logging.warning("ARTS Known Resistance Hits table is empty.")
+    else:
+        arts_knownhits = arts_knownhits.set_index("index")
     arts_knownhits_dict = arts_knownhits.T.to_dict()
     outfile_hits.parent.mkdir(parents=True, exist_ok=True)
     logging.info(f"Writing Known Resistance Hits JSON to: {outfile_hits}")

diff --git a/workflow/bgcflow/bgcflow/data/get_dependencies.py b/workflow/bgcflow/bgcflow/data/get_dependencies.py
@@ -1,24 +1,36 @@
 import json
+import logging
 import sys
 
+log_format = "%(levelname)-8s %(asctime)s   %(message)s"
+date_format = "%d/%m %H:%M:%S"
+logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)
+
 import yaml
 
 # list of the main dependecies used in the workflow
 dependencies = {
     "antismash": r"workflow/envs/antismash.yaml",
+    "bigslice": r"workflow/envs/bigslice.yaml",
+    "cblaster": r"workflow/envs/cblaster.yaml",
     "prokka": r"workflow/envs/prokka.yaml",
-    "mlst": r"workflow/envs/mlst.yaml",
     "eggnog-mapper": r"workflow/envs/eggnog.yaml",
     "roary": r"workflow/envs/roary.yaml",
-    "refseq_masher": r"workflow/envs/refseq_masher.yaml",
     "seqfu": r"workflow/envs/seqfu.yaml",
+    "checkm": r"workflow/envs/checkm.yaml",
+    "gtdbtk": r"workflow/envs/gtdbtk.yaml",
 }
 
 
-def get_dependency_version(dep, dep_key):
+def get_dependency_version(dep, dep_key, antismash_version="7"):
     """
     return dependency version tags given a dictionary (dep) and its key (dep_key)
     """
+    if dep_key == "antismash":
+        logging.info(f"AntiSMASH version is: {antismash_version}")
+        if antismash_version == "6":
+            dep[dep_key] = "workflow/envs/antismash_v6.yaml"
+    logging.info(f"Getting software version for: {dep_key}")
     with open(dep[dep_key]) as file:
         result = []
         documents = yaml.full_load(file)
@@ -38,14 +50,14 @@ def get_dependency_version(dep, dep_key):
     return str(result)
 
 
-def write_dependecies_to_json(outfile, dep=dependencies):
+def write_dependecies_to_json(outfile, antismash_version, dep=dependencies):
     """
     write dependency version to a json file
     """
     with open(outfile, "w") as file:
         dv = {}
         for ky in dep.keys():
-            vr = get_dependency_version(dep, ky)
+            vr = get_dependency_version(dep, ky, antismash_version=antismash_version)
             dv[ky] = vr
         json.dump(
             dv,
@@ -57,4 +69,4 @@ def write_dependecies_to_json(outfile, dep=dependencies):
 
 
 if __name__ == "__main__":
-    write_dependecies_to_json(sys.argv[1])
+    write_dependecies_to_json(sys.argv[1], sys.argv[2])
diff --git a/workflow/envs/bgcflow_notes.yaml b/workflow/envs/bgcflow_notes.yaml
@@ -4,6 +4,7 @@ channels:
   - bioconda
   - defaults
 dependencies:
+  - python>=3.9,<3.12
   - jupyterlab>3
   - ipywidgets>=7.6
   - jupyter-dash
@@ -13,13 +14,17 @@ dependencies:
   - python-kaleido
   - bokeh
   - openpyxl
-  - nbconvert
+  - nbconvert==6.4.2
   - ipykernel
-  - ete3
   - xlrd >= 1.0.0
+  - altair
+  - itables
+  - scikit-learn
+  - pyarrow
+  - pygraphviz
   - pip
   - pip:
       - biopython
+      - jupyterlab-dash
       - networkx
       - alive_progress
-      - ../bgcflow
diff --git a/workflow/envs/bigslice.yaml b/workflow/envs/bigslice.yaml
@@ -41,4 +41,4 @@ dependencies:
     - scipy==1.9.3
     - six==1.16.0
     - threadpoolctl==3.1.0
-    - git+https://github.com/medema-group/bigslice.git@103d8f2
+    - git+https://github.com/NBChub/bigslice.git@c0085de
diff --git a/workflow/envs/dbt-duckdb.yaml b/workflow/envs/dbt-duckdb.yaml
@@ -1,9 +1,13 @@
 name: dbt-duckdb
 channels:
   - conda-forge
+  - bioconda
   - defaults
 dependencies:
+  - python==3.11
+  - python-duckdb==0.8.1
+  - unzip
   - pip
   - pip:
     - dbt-duckdb==1.6.0
-    - dbt-metabase==0.9.15
+    - dbt-metabase==0.9.15
diff --git a/workflow/notebook/bigslice.py.ipynb b/workflow/notebook/bigslice.py.ipynb
@@ -21,6 +21,8 @@
    "source": [
     "import pandas as pd\n",
     "from pathlib import Path\n",
+    "import shutil, json\n",
+    "from IPython.display import display, Markdown\n",
     "\n",
     "import warnings\n",
     "warnings.filterwarnings('ignore')"
@@ -33,7 +35,57 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "report_dir = Path(\"../\")"
+    "report_dir = Path(\"../\")\n",
+    "bgcflow_dir = report_dir / (\"../../../\")\n",
+    "envs = bgcflow_dir / \"workflow/envs/bigslice.yaml\"\n",
+    "\n",
+    "metadata = report_dir / \"metadata/dependency_versions.json\"\n",
+    "with open(metadata, \"r\") as f:\n",
+    "    dependency_version = json.load(f)\n",
+    "\n",
+    "# Define the destination path\n",
+    "destination_path = Path(\"assets/envs/bigslice.yaml\")\n",
+    "\n",
+    "# Ensure the destination directory exists\n",
+    "destination_path.parent.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "# Copy the file\n",
+    "shutil.copy(envs, destination_path);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43ca282f",
+   "metadata": {},
+   "source": [
+    "## Usage\n",
+    "\n",
+    "You can start the BiG-SLICE flask app to view the clustering result.\n",
+    "\n",
+    "Steps:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ed378d7c-7d1a-4715-a06d-8850c1344387",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "run_bigslice=f\"\"\"- Install the conda environment:\n",
+    "\n",
+    "```bash\n",
+    "    conda install -f {report_dir.resolve()}/docs/assets/envs/bigslice.yaml\n",
+    "```\n",
+    "\n",
+    "- Run the app\n",
+    "\n",
+    "```bash\n",
+    "    port='5001'\n",
+    "    conda run -n bigslice bash {report_dir.resolve()}/cluster_as_{dependency_version[\"antismash\"]}/start_server.sh $port\n",
+    "```\n",
+    "\"\"\"\n",
+    "display(Markdown(run_bigslice))"
    ]
   },
   {
@@ -69,7 +121,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.0"
+   "version": "3.9.18"
   }
  },
  "nbformat": 4,