-
Notifications
You must be signed in to change notification settings - Fork 9
04 Building and Serving OLAP Database
BGCFlow integrates various bionformatic tools for the analysis of genomes and BGCs. Many of these tools generate table, or table-like outputs. Using bgcflow build database
we can extract data from all this tables, build a parquet file, and load them into DuckDB (SQL OLAP Database) using data build tools (DBT).
This step will generate:
- Parquet tables, located in
data/processed/<PROJECT NAME>/data_warehouse/
- DuckDB SQL database, located in
data/processed/<PROJECT NAME>/dbt/antiSMASH_<ANTISMASH_VERSION>/duckdb.db
For interactive exploration and visualization of the tables, BGCFlow utilizes Metabase, a popular business analytics platform.
To run metabase, run:
bgcflow serve --metabase
If you are running Metabase for the first time, you will be given a default admin id and password. Please change this once you access the admin panel of Metabase. Note that this password will only shown once!
$ bgcflow serve --metabase
(cd resources/metabase && java -Xms2g -Xmx8g -DMB_SETUP_TOKEN=ad0fb086-351b-4fa5-a17e-76282d2c9753 -jar metabase_v0.46.2.jar)
Metabase is running - pid: 20201
Connecting to metabase...
Retrying to connect: 1x
Retrying to connect: 2x
Retrying to connect: 3x
Metabase setup successful.
Login account: admin@bgcflow.com
Password: Q&@Gt853Q7Ae
Please change the password once you logged in.
Metabase is running at: http://localhost:3000
To add database connections, see: https://www.metabase.com/docs/latest/databases/connecting
Press q to quit Metabase server:
This will start a Metabase server with DuckDB plugin that is served in http://localhost:3000. Follow the Metabase documentation to load and explore the data.
In BGCFlow_wrapper 0.3.1, we introduce a script to synchronize the dbt schema with the DuckDB data in metabase.
$ bgcflow sync -h
Usage: bgcflow sync [OPTIONS] PROJECT_NAME
Upload and sync DuckDB database to Metabase.
Options:
--dbt-excludes TEXT A list of dbt models to exclude from the
synchronization.
--metabase-http Use HTTP instead of HTTPS to connect to Metabase.
--dbt-database TEXT The name of the dbt database to use.
--metabase-database TEXT The name of the Metabase database to use.
--dbt-schema TEXT The name of the dbt schema to use.
--mb-password TEXT The Metabase password. If None, the user will be
prompted to enter their password.
--mb-username TEXT The Metabase username. If None, the user will be
prompted to enter their username.
--metabase-host TEXT The URL of the Metabase server.
--dbt-dir DIRECTORY The directory containing the dbt project to
upload. If None, the directory is inferred from
the BGCFlow project directory.
--bgcflow-dir DIRECTORY The root directory of the BGCFlow project.
-h, --help Show this message and exit.
- project-name: The name of the project to upload to Metabase.
- --bgcflow-dir: The root directory of the BGCFlow project. Default is the current directory.
- --dbt-dir: The directory containing the dbt project to upload. If None, the directory is inferred from the BGCFlow project directory.
- --metabase-host: The URL of the Metabase server. Default is http://localhost:3000.
- --mb-username: The Metabase username. If None, the user will be prompted to enter their username.
- --mb-password: The Metabase password. If None, the user will be prompted to enter their password.
- --dbt-schema: The name of the dbt schema to use. Default is main.
- --metabase-database: The name of the Metabase database to use. If None, the project name is used.
- --dbt-database: The name of the dbt database to use. Default is dbt_bgcflow.
- --metabase-http: Whether to use HTTP instead of HTTPS to connect to Metabase.
bgcflow sync Lactobacillus_delbrueckii
You can also upload the generated database to Motherduck.
python workflow/scripts/motherduck.py -h
usage: motherduck.py [-h] [--database_name DATABASE_NAME] database_path
Upload a local DuckDB database to a remote DuckDB database.
positional arguments:
database_path The path to the local DuckDB database.
options:
-h, --help show this help message and exit
--database_name DATABASE_NAME
The name of the remote database to create or replace. Default is "bgcflow".