Usage: python holistic.py <in.amplicons.bed> <in.gothic.csv> <float: q-value threshold> <autosomes only Y|N>
in.amplicons.bed: BED file containing coordinates of identified eccDNA segments (amplicons). Column 4 should uniquely identify the contig to which the amplicon belongs.
in.gothic.csv: CSV file containing GOTHiChicup output dumped from R data frame.
q-value threshold: Count interaction as significant if q-value is less than or equal this value. (Recommended: 0.05).
autosomes?: If Y, HolistIC will look for interactions on the whole genome including X and Y chromosomes.
Tested with Python 2.7 but Python 3+ should work too
Python Networkx: https://networkx.github.io/documentation/stable/install.html
matplotlib: https://matplotlib.org/users/installing.html#installation-guide
rpy2: https://pypi.org/project/rpy2/
The sample BED file is provided in this repository. The sample CSV file can be downloaded here: https://s3.amazonaws.com/meh52-B1/Simulated.hic.gothic.results.csv.gz
python holistic.py dmfinder.HG38.wgs.results.SORTED.bed Simulated.hic.gothic.results.csv 0.05 Y
The BED file contains WGS-based coordinates of contiguous regions of copy number amplification, organized by contig. The CSV file is the dumped output from the GOTHiChicup function of the GOTHiC package. The input to HolistIC is data assumed to come from the same individual.
The BED file must be a sequence of records of the following format:
chr start end ID
Assume that we have the following NGS-based predicted double minutes (i.e. where we know the breakpoints of the amplicons and their adjacencies):
The BED file will contain the following records:
chr5 3000000 3300000 DM1
chr8 28000000 28500000 DM1
chr19 40000000 40200000 DM1
chr4 2000000 2300000 DM2
chr11 6000000 6800000 DM2
chr18 20000000 20200000 DM2
The BED file can be created using the output from programs that reconstruct circular eccDNA architectures (like AmpliconArchitect or CouGaR). HolistIC does not assume that all connected amplicons are within a circular contig, so incomplete predicted architectures can be considered too.
WARNING: Within each unique ID in the 4th column, the records must first be sorted in lexicographic order by chromosome, and then in ascending numeric order by the interval start position. This will not preserve the adjacencies provided in the BED file, but this does not adversely affect HolistIC's ability to segregate amplicons by contig. The input BED file can be sorted with the following command:
sort -k4,4 -k1,1 -k2,2 in.bed > in.sorted.bed
We assume that significant chromatin interactions are determined with GOTHiC. The CSV file is the dumped output from an R dataframe generated by the GOTHiChicup function:
https://master.bioconductor.org/packages/release/bioc/manuals/GOTHiC/man/GOTHiC.pdf
Assume that we have the following double minutes:
Due to ambiguities inherent in WGS-only predictions, the above double minutes could have resulted in the following input BED file, where all amplicons are erroneously assumed to be part of the same double minute (remember that the records are lexicographically sorted by chromosome and then sorted ascending by start position of interval):
chr1 8000000 8900000 DM1
chr3 7000000 7400000 DM1
chr5 3000000 3300000 DM1
chr6 17000000 17800000 DM1
chr18 12000000 13000000 DM1
This is due to ambiguity induced by the red amplicon since the WGS data shows a red-blue-yellow-red-orange-green-red cycle.
If the CSV file of Hi-C interactions is provided, HolistIC will attempt to segregate each amplicon (i.e. row in BED file) into its true double minute. In the ideal case, HolistIC would have the following output for the above input BED file (written to [input file name].disambig.bed):
chr1 8000000 8900000 DM1
chr3 7000000 7400000 DM1
chr5 3000000 3300000 DM1
chr6 17000000 17800000 DM2
chr18 12000000 13000000 DM2
chr5 3000000 3300000 DM2
Furthermore, HolistIC will create PNG image files for each identified maximal clique, which ostensibly refer to separate double minutes.