Skip to content

Latest commit

 

History

History
121 lines (98 loc) · 5.41 KB

README.md

File metadata and controls

121 lines (98 loc) · 5.41 KB

ChromOptimise is a pipeline that identifies the optimum number of states that should be used with ChromHMM's LearnModel command for a particular genomic dataset.

For more specific information, please head over to the wiki.

Table of contents

Motivation

When using ChromHMM to learn hidden Markov models for genomic data, it is often difficult to determine how many states to include:

  • Including too many states will result in overfitting your data and introduces redundant states
  • Including too few states will result in underfitting your data and thus results in lower model accuracy

This pipeline identifies the optimal number of states to use by finding a model that avoids the two above points.

After using this pipeline, the user will have greater knowledge over their dataset in the context of ChromHMM, which will allow them to make more informed decisions as they continue to further downstream analysis.

Getting started

  1. Clone this repository
  2. Ensure all required software is installed
  3. If using LDSC, download 1000 genomes files (or similar) from this repository
  4. Copy the configuration files to a memorable location (recommended: next to your data) and then fill them in using the templates provided. DO NOT CHANGE THE NAMES OF THESE FILES.
    • If you are feeling lazy. You can just edit the files where they already are. The suggetsion to move them is to accomodate having mutliple configs for different projects.
  5. Run the setup executable, providing the path to the directory with the config files in them as the first argument:
./setup path/to/configuration/directory

Usage

After completing 'getting started', run the master script (ChromOptimise.sh) in the command line with:

bash ChromOptimise.sh path/to/your/configuration/directory

Alternatively, you can run each of the shell scripts in JobSubmission sequentially.

sbatch 1_BinarizeFiles.sh path/to/your/configuration/directory

For further information please see the pipeline explanation.

There also exists supplementary scripts for further information on your chosen data set. Most importantly, thresholds used in redundancy analysis can be inferred from the results of Redundancy_Threshold_Optimisation. Further details for these scripts can be found in the wiki.

Software requirements

This pipeline requires a unix-flavoured OS with the following software installed:

Additionally, conda environments are created for you to obtain:

  • R v4.4.1
  • java-jdk v8.0.112
  • bedtools v2.27.1

Further information

This study makes use of data generated by the Blueprint Consortium. A full list of the investigators who contributed to the generation of the data is available from www.blueprint-epigenome.eu. Funding for the project was provided by the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 282510 – BLUEPRINT.

For any further enquiries, please open an issue or contact Sam Fletcher:
s.o.fletcher@exeter.ac.uk