Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MACS2 stalls pre-computing pvalue-qvalue table.. #101

Closed
jma1991 opened this issue Nov 6, 2015 · 10 comments
Closed

MACS2 stalls pre-computing pvalue-qvalue table.. #101

jma1991 opened this issue Nov 6, 2015 · 10 comments

Comments

@jma1991
Copy link

jma1991 commented Nov 6, 2015

Hello,

MACS2 stalls at the following line:

#3 Pre-compute pvalue-qvalue table...

I have two BAM files, one converted to BAM from Bowtie2 output and the other which has underwent some processing. Here is an example alignment from the raw BAM file:

HISEQ:127:C7HJ7ANXX:1:1312:20025:49343  163 chr1    10069   30  50M =   10075   56  ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACC  CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  XS:i:0  YS:i:0  YT:Z:CP

and here is an example from the processed BAM file:

HISEQ:127:C7HJ7ANXX:1:1312:20025:49343/1    16  chr1    10070   255 50M *   0   0   *   *

MACS2 stalls when I input the processed BAM file but not the raw file. The processed BAM file was created by converting the raw BAM to a BED file, processing the data with AWK, and then converting back to a BAM file (both conversions done using BEDtools).

I call peaks using the following command:

macs2 callpeak -t sample.bam -f BAMPE -g hs -s 50 -n sample -q 0.01 --nomodel --nolambda --keep-dup all --call-summits

Thank you

@taoliu
Copy link
Contributor

taoliu commented Dec 22, 2015

Is it related to #86?

@pveber
Copy link

pveber commented Mar 1, 2017

I confirm there is still a problem in current master (9ddacb0). Using the following SAM input file:

FCC5W8BACXX:1:1101:16503:36846#ATCCAAGC/2       16      chr2L   22700622        44      100M    *       0       0       CGAGATCAACGCACAACTCGAGGTCGAAATCGAATTCGGCAAAAGACGACTTCGCATAAACATCCTAATACTACCAGGGGTGGAATTTCCTCACCCCGAA    ^b`_YJTG[ab^YSGS]a^Ya^^[X]_]]ca_\V^cdcdeedddef_\Zeabgggcfbgece_aaehgcaaaYdffaghgbfde_^cgeebWO`ccc__^    AS:i:200        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100        YT:Z:UU

and the following command line:

macs2 callpeak --outdir delme --name macs2 --format SAM --nomodel --treatment input.sam

macs2 get stalled at step "Pre-compute pvalue-qvalue table".

@jcohn42
Copy link

jcohn42 commented Aug 9, 2022

I found this thread when troubleshooting run times for jobs that I had running, where MACS2 was part of an analysis pipeline to process ChIPseq data. I looked at the error files as the jobs had been running for several days, which was odd. Tail of the error files showed the jobs were hung at the same step in MACS2 reported here - issue #101 and #86.

INFO @ Fri, 29 Jul 2022 10:03:32: #3 Call peaks...
INFO @ Fri, 29 Jul 2022 10:03:32: #3 Pre-compute pvalue-qvalue table...

I've tried using macs2 --version 2.2.6 and 2.2.7.1
using a HPC grid, and tried using 1,4, and 8 slots, expanding buffer, requested nodes with > 10GB RAM

The odd part is that when I ran MACS2 with all default parameters the jobs finished in a few hours.
macs2 callpeak -f BAM -t /pathtomyfile/hisat2.sorted.bam -c /pathtomyfile/control_hisat2.sorted.bam -g 1e-9 -n sample1

However, when I add the the following parameters the program is hung at the indicated steps:
--nomodel --extsize 147

or

--nomodel --extsize 147 --broad --broad-cutoff 0.1

Single threads being used seem to be computing efficiently at 100%, but the jobs simply spin for up to 7 days. These run times seem way too long, as file read in for macs2 is super fast - 8.5G BAM files load and are read in (steps 1-3) ~ 10 minutes.

So, this seems like a bug rather than a compute environment issues. Do you have suggestions on potential fixes?

One other note - I recently used the same versions of MACS2 on the same compute system on ATACseq data and used similar --no-model parameters, but only had treatment files, no control input files. These jobs used even larger files and finished within 12-24 hours.

@jcohn42
Copy link

jcohn42 commented Aug 17, 2022

A follow up regarding this issue - I found that I get different results with different files. I tested pre-processed BED files from SRA - also ChIPseq data, and macs2 or macs3 completed peak calling in about 30 minutes, no issues. However, I downloaded the raw data from SRA and processed with my pipeline and I get the same issue - the process hangs at Pre-compute pvalue-quvalue table.. My pipeline uses trimmomatic to trim illimina adaptors, hisat2 alignment (-p 8 -k 1 --no-spliced-alignment), bamsormadup to remove duplicates and sort bam output. macs2 doesn't throw and error, but just hangs. Any ideas would be much appreciated.

@taoliu
Copy link
Contributor

taoliu commented Aug 17, 2022

@jcohn42 Is it possible that the raw data processed from SRA contains many random chromosomes? The process hangs at p-q table calculation is an old issue I dealt with years ago where when the hash table (we used khash) becomes too big. Please let me know how it goes with macs3 since we are working on it. Another fix/idea I got is that maybe we should provide an option to skip converting p-value to q-value entirely to save time because people right now have better ways to control the false positive rate such as IDR on ranked peaks based on p-value or foldchange.

@jcohn42
Copy link

jcohn42 commented Aug 17, 2022

@taoliu, yes that is right, the SRA data does have many chromosomes (10 maize chromosomes and a few dozed contigs). They are present in the pre-processed BED files downloaded from SRA, I used the same reference genome for aligning the raw data using hisat2, so the bam files resulting have the same chromosomes. The only major difference I noted is that the external BED files seem to have been pre-filtered for alignment quality; $cut -5 |sort -u gives single value of 255. hisat2 uses different quality measures, but I could try filtering on 60. I don't think macs2 uses the quality information, though, is that right? I've tried macs2 and macs3 - both get hung on the same step. I agree, q values are great to have, but one can filter on p values, or use IDR or combine replicates with MSPC, etc. Since my original data was PE I also tried -f BAMPE and -f BAM on the deduplicated BAM files, but both get stuck at the same place. I could try to filter input files for only 'real' chromosome data, but I've already tried filtering out a single chromosome of data for troubleshooting, and I have the same issue.

@taoliu
Copy link
Contributor

taoliu commented Aug 17, 2022

@jcohn42 Could you share me the SRA accession number so that I can give it a try?

@jcohn42
Copy link

jcohn42 commented Aug 17, 2022

@taoliu The files come from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA492464
GSM3397995 Chip B73 leaf H3K56ac rep1 SRR7889772
GSM3398001 Chip B73 leaf input rep1 SRR7889778

Here is my macs3 call with the processed BED files that worked beautifully, completing within 30 minutes
macs3 callpeak -f BED -t /scratch-large/5-biannual/COH3780_chip/TEST_ext/GSM3397995_mapped_chip_B73_leaf_H3K56ac_rep1.bed -c /scratch-large/5- biannual/COH3780_chip/TEST_ext/GSM3398001_mapped_chip_B73_leaf_input_rep1.bed -g 2100000000 -n TESText --outdir /scratch-large/5-biannual/COH3780_chip/TEST_ext --nomodel --extsize 147

FASTQ files were aligned to B73_v4 genome, an internal copy, but should be very similar to B73 RefGen_v4, AGPv4, available from here: https://www.maizegdb.org/assembly. Note, there is a newer genome, but this was the genome used to generate the processed files tested.

I used the same macs3 call, with the change "-f BAM" and -t <deduplicated, sorted bam files for test> -c <deduplicated, sorted bam file for input". No errors, but still paused on #3 Pre-compute pvalue-qvalue table..., running now ~ 20 hours.

Do you have recommendations for buffer size, # slots, node parameters, etc? I've requested > 10G RAM, which should be fine, I think?

@jcohn42
Copy link

jcohn42 commented Aug 23, 2022

@taoliu, can you clarify if the problem you described was 'fixed' "The process hangs at p-q table calculation is an old issue I dealt with years ago where when the hash table (we used khash) becomes too big." What was the fix, is there some work around, like adjusting input file sizes, processing one chromosome at a time, etc, that I could use to get MACS2 (or MACS3) working for my data? Are there particular filters of aligned reads that should be used?

Thanks for your recommendations and help.

@jcohn42
Copy link

jcohn42 commented Aug 26, 2022

@taoliu I discovered the problem - it was an issue with my genome size input - a typo! Thanks again for your replies; btw, MACS3 is also working quite well with my data and the external data. Thanks for providing a very useful tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants