Skip to content

Development and application of DNA fragility prediction engine based only on sequence-context.

License

Notifications You must be signed in to change notification settings

SahakyanLab/DNAFragility_ML

Repository files navigation

DNAFragility

Development and application of the generalised DNA fragility prediction engine based only on sequence-context.

Setup

Clone the project:

git clone https://github.com/SahakyanLab/DNAFragility_ML.git

Please follow the instructions below on how to acquire the public datasets, setup the directory stucture, and software necessary to run all the studies from the publication. At the end of this README file, you can find two separate bash script commands that runs the majority of the setup and runs the calculations sequentially.

1. Software requirements

The resource-demanding computations were performed on a single NVIDIA RTX A6000 GPU with 40GB RAM. The developed workflows and analyses employed the R programming language 4.3.2 and Python 3.9.12.

Please run the below script to install the latest versions of the R and Python packages necessary to perform the calculations and analyses.

bash ./setup/install_packages.sh

Please also download and install the below software.

Edlib

Secondary structure folding parameter file

ggpattern

Please note, if you are using Ubuntu, you may have trouble installing the ggpattern R package. However, the below steps has worked for us.

  1. sudo apt-get update
  2. sudo apt-get install libmagick++-dev
  3. sudo apt install libgdal-dev
  4. sudo apt-get install -y libudunits2-dev
  5. install.packages("units")
  6. install.packages("sf")
  7. install.packages("gridpattern")
  8. install.packages("ggpattern")

2. Public files to download

Cancer-associated DNA strand breaks

We retrieved all the somatic mutation data of both the non-coding and coding regions associated with cancer from the Catalogue of Somatic Mutations in Cancer (COSMIC) database, including Non-Coding Variants, Cancer Gene Census, and Breakpoints (structural variants) datasets obtained from release v98, May 2023.

These versions can be downloaded from the following links:

Unpack and extract the relevant files. Place the contents into COSMIC/data/COSMIC/ folder. Please note, we renamed the above first two files with the "PROCESSED" suffix, as the files were very large due to the SNPs, hence, we removed them. We suggest you do this too, unless you have sufficient memory to load and process them all.

ClinVar SVs and SNPs

We obtained SVs and SNPs from the variant_summary.txt.gz file downloaded from the ClinVar database accessed on December 6th in 2023 that had a clinically associated pathogenic or benign label. Please note, this file gets updated weekly.

Unpack and extract the relevant files. Place the contents into 04_ClinVar/ folder.

Annotation of genomic and genic features on the human genome

To download CpG islands and Isochores from the UCSC Table Browser, please select the following:

  • CpG Islands. clade: Mammal, genome: Human, assembly: Jan 2022 (T2T CHM13v2.0/hs1), group: All Tracks, track: CpG Islands, table: hub_3671779_cpgIslandExtUnmasked, output format: BED - browser extensible data, output filename: output_CpG_Islands.csv.
  • Isochores. clade: Mammal, genome: Human, assembly: May 2004 (NCBI35/hg17), group: All Tracks, track: Isochores, table: ct_Isochores_9145, output format: BED - browser extensible data, output filename: iso_hg17.bb.

Unpack and extract the relevant files from above. Place the contents into COSMIC/data/annotations/ folder.

Chromothripsis breakpoint events

We obtained the chromothripsis breakpoint cases from ChromothripsisDB. Please download the dataset from Download -> Full Dataset -> Chromothripsis case data

Unpack and extract the relevant files from above. Place the contents into 03_Chromothripsis/data folder.

Transcription Factor data

We retrieved 247 core-validated vertebrate transcript factor binding sites (TFBS) from the JASPAR 2024 database.

Unpack and extract the relevant files from above. Place the contents into data/TFBS/ folder.

3. Liftover files

We processed all datasets in the reference genome version used as per the deposition. When doing comparative analysis, we lifted the genomic coordinates over to the latest T2T genome assembly.

Unpack and extract the relevant files. Place the contents into COSMIC/data/liftover/ folder.

4. Reference sequences

We processed all datasets in the reference genome version used as per the deposition. For Kmertone, the individual fasta files were needed. This GitHub repo is dependent on the results of DNAFragility_dev, where the reference genomes are downloaded already.

5. DNAfrAIlib feature library

The genomic sequence-based octameric features can be downloaded from the DNAfrAIlib repo. The quantum mechanical hexameric parameters can be downloaded from DNAkmerQM.

This has been automatically setup if you run the below bash script.

bash get_feature_lib.sh

6. Notes on 00_ML_proof_of_concept folder

To run the 00_ML_proof_of_concept work, you need to have two datasets downloaded and processed following the method from DNAFragility_dev. The demonstration used in the paper is based on this study with data deposited on the GEO database. We specifically used DMSO-treated, endogenous DNA fragility in K562 cells. You can also run it on the etoposide-treated DNA fragility in K562 cells enriched at topoisomerase II sites.

For any ML task, you require the genomic sequence range of influence for each of the short-, medium-, and long-range effects. Depending on the dataset used, some datasets had to be pre-processed to handle 5'-3' DNA strand breaks. Hence, running the full DNAFragility_dev study beforehand is strongly advised.

Alternatively, if you wish to skip the DNAFragility_dev process, and just want to process these DNA strand breaks for the present study, please run the below bash script.

bash get_MLdemo_datasets.sh

Optional additional studies

De novo motif discovery with Homer

We used the Homer software for motif discoveries, including de novo ones. We use the R package marge to interface with Homer. Below are two suggestions of downloading and installing the relevant source codes, as Option 1 may fail, depending on your operating system.

Option 1. To install marge, please follow the instructions from the GitHub page here.

Marge relies on a local installation of Homer. To install for your operating system, please follow the instructions from their website here.

Inside the 09_HOMER/lib/ environment, we used the below commands.

# In the terminal, we used the following commands.
mkdir lib
wget -P lib/ http://homer.ucsd.edu/homer/configureHomer.pl
perl /path-to-homer/configureHomer.pl -install homer
perl /path/to/homer/configureHomer.pl -install hg19 
vi ~/.bashrc
PATH=$PATH:/path/to/homer/lib
source ~/.bashrc

In the 09_HOMER/Process.R file, we used the below commands.

# In the R environment, we used the following commands.
devtools::install_github('robertamezquita/marge', ref = "master", force = TRUE)
homer_path = "/path/to/homer/lib"

options('homer_path' = homer_path)
library(marge)
check_homer()

Option 2. Depending on your operating system, the above installation may not work. Below is a workaround that has worked in our case. Download the ZIP master file from the GitHub page here. Then, inside marge-master/R/check_homer.R, change the following line from loc <- system('type -a findMotifsGenome.pl', intern = TRUE) to loc <- system('type findMotifsGenome.pl', intern = TRUE). Then, run the below.

# path to local master R package from GitHub
path_to_file = "path/to/marge-master"

devtools::install(
  pkg = path_to_file,
  quiet = FALSE,
  force = TRUE
)

library(marge)

If the above steps have been successfully implemented, you can run this optional study by going into the 09_HOMER/ folder, then running the below bash script. Please edit the homer_path inside this file to the path of your saved location.

bash submit.sh

Nullomer sequence fragility

Here, we use 13 statistically significant nullomer sequences from this paper and downloaded from here on 1st Feb 2024. Under Download, select Genomic MAWs. Download the Genomic_MAWs.tsv file and place it into 06_Nullomers/data/ folder.

The workflow was the following. First, we wanted to evaluate whether nullomer sequences can bring fragility to a genomic region. Second, we wanted to pinpoint this to the nullomer sequence specifically, by searching for sequences in the human genome that mismatch by 1 base to the nullomer, introduce a SNP to generate the nullomer, and evaluate the change in sequence fragility.

If the above steps have been successfully implemented, you can run this optional study by going into the 06_Nullomers/ folder, then running the below bash script.

bash submit.sh

Other notes

  • All cpp files are interfaced via the Rcpp library in R with omp.h when possible. Please ensure you have this installed.
  • RcppArmadillo.h and RcppEigen.h are necessary for the feature extraction process. Please ensure you have this installed. By default, will not use it in case you have not installed it.
  • Various model predictions have been deposited if the compressed file size was within the GitHub file size limit. If you wish to view and/or use them, please gunzip the files.
  • While this repo can run as a standalone study, the results are dependent on DNAFragility_dev and when possible, we have deposited the necessary dependent files.
  • When you run the run_dnafragility.sh bash script, you will need to include the path to the viennaRNA RNAFold programme as the first argument. Some operating systems allow you to interface it directly via RNAfold, others require the literal path to the programme.

Run all setup files

If you wish to run all setups, including all the aforementioned bash scripts, please run the below bash script.

bash run_all_setup_files.sh

Run the full DNAFragility_ML study

Please note that many of the calculations were computationally intensive, particularly the 01_LGBM_FullGenome and 05_DeltaFragility folders. Most things were run in parallel in smaller batches. However, if you submit the below bash script, it runs all scripts sequentially. This can take several months to complete. Most tasks take up several tens to hundreds of GBs worth of RAM. The entire study requires between 2-4 TB of hard drive space.

You may need to monitor your memory usage, memory cache, and swap to ensure calculations run smoothly.

Arguments

  • Rnafold_path path to the RNAfold function for secondary structure prediction.
  • fast_matrix If TRUE, will use fast RcppArmadillo matrix calculations. Default FALSE.
bash run_dnafragility.sh $RNAfold_path $fast_matrix

About

Development and application of DNA fragility prediction engine based only on sequence-context.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published