-
Notifications
You must be signed in to change notification settings - Fork 9
98 FAQ (Frequently Asked Questions)
How do I use BGCFlow to run large number of genomes?
The upper limit of the number of genomes that BGCFlow can handle primarily depends on the computational resources available. In our experience, to handle ~100-1000 bacterial genomes, a Linux machine with 16 CPU and 64 GB RAM is sufficient.
Nevertheless, there are several factors to consider when scaling up:
- Computational Resources: The upper limit of genome analysis capacity is largely determined by the specifications of the machine or computing environment used. More powerful hardware can handle larger datasets more efficiently.
- High-Performance Computing (HPC) Support: For projects involving thousands of genomes, BGCFlow is designed to be compatible with HPC environments. It can be run on HPC systems using SLURM by leveraging Snakemake profiles, which allows for efficient distribution of computational tasks across multiple nodes.
- Clustering Approach: The choice of clustering approach can significantly impact the scalability of genome analysis. BiG-SCAPE, for example, performs pairwise comparisons, leading to an exponential increase in computational demand as the number of genomes increases. A more scalable approach is to use BiG-SLICE to identify clusters of interest initially and then apply BiG-SCAPE for detailed analysis on a selected subset of genomes.
-
Technical Considerations:
- Snakemake Limitations: One limitation is the creation of the Directed Acyclic Graph (DAG), which can become slow with very large numbers of tasks. This issue can be mitigated by dividing the workflow into smaller batches.
- Database Handling: As the number of genomes increases, the choice of database technology becomes crucial and using the provided Jupyter notebook templates might not be the right tool for this type of analysis. This is one of the reason why we incorporate Metabase, which can be connected to most of the modern databases system for efficient analytics and visualization While the native database chosen for BGCFlow is duckdb (which is still capable of handling 1 million rows of data), larger projects may benefit from using production level database such as PostgreSQL.
I did not ask BGCFlow to run automlst_wrapper
, but it still runs automlst_wrapper
when I just run roary
?
While running pangenome analysis with Roary does not require the output of automlst, the BGCFlow `roary` pipeline includes additional visualization step which needs the phylogenetic tree build by `automlst_wrapper`. Hence, `automlst_wrapper`is triggered by Snakemake.
Can I use genbank files instead of fasta as input?
Yes, to utilize genbank files as inputs, add the variable `input_type` in the project configuration file.You can find the project configuration file under the config folder:
config/
├── Lactobacillus_delbrueckii
│ ├── project_config.yaml # --> open this file (project configuration file) using a text editor
│ └── samples.csv
└── config.yaml
Using a text editor, add the input_type
in the project_config.yaml
name: Lactobacillus_delbrueckii_gbk
pep_version: 2.1.0
description: "Lactobacillus delbrueckii 27 01 2023"
sample_table: samples.csv
input_type: "gbk"
#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
seqfu: TRUE
...
Can I specify a custom location for my input files?
Yes, there are several ways to speficy a specific input location or directory of the custom input files.- add the variable
input_folder
in the project configuration file.
You can find the project configuration file under the config folder:
config/
├── Lactobacillus_delbrueckii
│ ├── project_config.yaml # --> open this file (project configuration file) using a text editor
│ └── samples.csv
└── config.yaml
Using a text editor, add the input_type
in the project_config.yaml
name: Lactobacillus_delbrueckii_gbk
pep_version: 2.1.0
description: "Lactobacillus delbrueckii 27 01 2023"
sample_table: samples.csv
input_folder: "my_custom_input_folder"
#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
seqfu: TRUE
...
I cannot resume BGCFlow runs because the workflow is locked after I manually ends then run. How do I resume the workflow?
If you're having trouble resuming interrupted runs, it's likely due to Snakemake’s directory locking mechanism. This feature is designed to prevent concurrent access that could compromise the data integrity of ongoing analyses. To unlock the workflow and allow it to resume, you can use the --unlock
parameter with Snakemake or use the BGCFlow CLI bgcflow run --unlock
(available for bgcflow wrapper
version v0.2.6
and above). We recommend users to always use the latest release (currently v0.3.5
), which is available from PyPi here.
I cannot run BGCFlow on tmux
. I got this error: subprocess.CalledProcessError: Command 'conda info --json' returned non-zero exit status 127
?
This error typically occurs when tmux
doesn't load all the path variables from your .bashrc
file correctly. The conda info --json
command is failing because it can't find the conda
command in the system's PATH
.
The .bashrc
file is a script that runs every time you open a new terminal session. It's used to configure your shell, including setting up path variables. When you run tmux
, it starts a new shell, but it doesn't necessarily run the .bashrc
script, depending on your configuration.
To resolve this issue, you can instruct tmux
to source the .bashrc
file every time it starts a new shell. This can be done by adding the following line to your .tmux.conf
file:
set -g default-command "source ~/.bashrc; bash"
This line sets the default command for new shells in tmux to first source the .bashrc
file (which loads your path variables), and then start a new bash shell. After adding this line, tmux should be able to find the conda command, and you should be able to run BGCFlow
without encountering the error.
Remember to replace ~/.bashrc
and .tmux.conf
with the actual paths to these files on your system if they are located elsewhere.