Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BQSR can produce tables with missing ReadGroups #6242

Open
lbergelson opened this issue Nov 4, 2019 · 12 comments
Open

BQSR can produce tables with missing ReadGroups #6242

lbergelson opened this issue Nov 4, 2019 · 12 comments
Assignees

Comments

@lbergelson
Copy link
Member

BaseRecalibrator can create a recalibration table which doesn't contain all the readgroups in the input sample. Then when ApplyBQSR is run on the same file with that table it crashes with an exception.

ex:

java.lang.IllegalStateException: The covariates table is missing ReadGroup GROUP_1 in RecalTable0

This can happen because BaseRecalibrator used a stricter set of filters than ApplyBQSR. If all the reads in a read group are filtered in BQSR then that read group will be missing from the recalibration table and fail when running ApplyBQSR. A possible example is a readgroup that only has duplicates or secondary reads.

@lbergelson lbergelson added this to the GATK-Priority-Backlog milestone Nov 4, 2019
@lbergelson
Copy link
Member Author

@droazen droazen modified the milestones: GATK-Priority-Backlog, GATK-Triaged-Issues Nov 18, 2019
@shengzha
Copy link

Is the issue fixed? I ran into the same error and after checking out GATK from GitHub and with the latest build (commit 031c407, gatk-4.1.4.1-83-g031c407-SNAPSHOT), the problem persisted.

@cmnbroad
Copy link
Collaborator

@shengzha Not yet, though I am working on it (this ticket will transition to the closed state when a fix has been merged). In the meantime if you have a small test case that illustrates the problem, please add it here, as that would help ensure the fix addresses the specific issue you're seeing. Thanks.

1 similar comment
@cmnbroad
Copy link
Collaborator

@shengzha Not yet, though I am working on it (this ticket will transition to the closed state when a fix has been merged). In the meantime if you have a small test case that illustrates the problem, please add it here, as that would help ensure the fix addresses the specific issue you're seeing. Thanks.

@shengzha
Copy link

@cmnbroad Thank you so much for the reply. I don't have a small test case for you, but I can provide some other information.
It is RNA seq data and passes validation check (java -jar picard.jar ValidateSamFile I=S3_2.unmapped.split.bam MODE=SUMMARY).
BaseRecalibrator cmd:
gatk BaseRecalibrator -R Homo_sapiens_assembly38.fasta -I S3_2.unmapped.split.bam --use-original-qualities -O S3_2.unmapped.recal_data.csv -known-sites Homo_sapiens_assembly38.dbsnp138.vcf -known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz
ApplyBQSR cmd:
gatk ApplyBQSR -R Homo_sapiens_assembly38.fasta -I S3_2.unmapped.split.bam -O S3_2.unmapped.aligned.duplicates_marked.recalibrated.bam -bqsr S3_2.unmapped.recal_data.csv --add-output-sam-program-record --use-original-qualities
RecalTables in S3_2.unmapped.recal_data.csv are empty.

Here is the screen dump of BaseRecalibrator and ApplyBQSR.
BaseRecalibrator

Using GATK jar <XXX>/gatk-4.1.4.1-83-g031c407-SNAPSHOT/gatk-package-4.1.4.1-83-g031c407-SNAPSHOT-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar <XXX>/gatk-4.1.4.1-83-g031c407-SNAPSHOT/gatk-package-4.1.4.1-83-g031c407-SNAPSHOT-local.jar BaseRecalibrator -R Homo_sapiens_assembly38.fasta -I S3_2.unmapped.split.bam --use-original-qualities -O S3_2.unmapped.recal_data.csv -known-sites Homo_sapiens_assembly38.dbsnp138.vcf -known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz
23:39:34.668 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:<XXX>/gatk-4.1.4.1-83-g031c407-SNAPSHOT/gatk-package-4.1.4.1-83-g031c407-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
Feb 26, 2020 11:39:34 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
23:39:34.915 INFO  BaseRecalibrator - ------------------------------------------------------------
23:39:34.915 INFO  BaseRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.4.1-83-g031c407-SNAPSHOT
23:39:34.915 INFO  BaseRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
23:39:34.915 INFO  BaseRecalibrator - Executing as <XXX@XXX> on Linux v3.10.0-957.12.1.el7.x86_64 amd64
23:39:34.915 INFO  BaseRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-b08
23:39:34.916 INFO  BaseRecalibrator - Start Date/Time: February 26, 2020 11:39:34 PM EST
23:39:34.916 INFO  BaseRecalibrator - ------------------------------------------------------------
23:39:34.916 INFO  BaseRecalibrator - ------------------------------------------------------------
23:39:34.916 INFO  BaseRecalibrator - HTSJDK Version: 2.21.2
23:39:34.916 INFO  BaseRecalibrator - Picard Version: 2.21.9
23:39:34.916 INFO  BaseRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:39:34.916 INFO  BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:39:34.916 INFO  BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:39:34.916 INFO  BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:39:34.916 INFO  BaseRecalibrator - Deflater: IntelDeflater
23:39:34.917 INFO  BaseRecalibrator - Inflater: IntelInflater
23:39:34.917 INFO  BaseRecalibrator - GCS max retries/reopens: 20
23:39:34.917 INFO  BaseRecalibrator - Requester pays: disabled
23:39:34.917 INFO  BaseRecalibrator - Initializing engine
23:39:35.461 INFO  FeatureManager - Using codec VCFCodec to read file file://<XXX>/Homo_sapiens_assembly38.dbsnp138.vcf
23:39:35.595 INFO  FeatureManager - Using codec VCFCodec to read file file://<XXX>/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
23:39:35.721 INFO  FeatureManager - Using codec VCFCodec to read file file://<XXX>/Homo_sapiens_assembly38.known_indels.vcf.gz
23:39:35.896 INFO  BaseRecalibrator - Done initializing engine
23:39:35.913 INFO  BaseRecalibrationEngine - The covariates being used here:
23:39:35.913 INFO  BaseRecalibrationEngine - 	ReadGroupCovariate
23:39:35.913 INFO  BaseRecalibrationEngine - 	QualityScoreCovariate
23:39:35.913 INFO  BaseRecalibrationEngine - 	ContextCovariate
23:39:35.914 INFO  BaseRecalibrationEngine - 	CycleCovariate
23:39:35.965 INFO  ProgressMeter - Starting traversal
23:39:35.965 INFO  ProgressMeter -        Current Locus  Elapsed Minutes       Reads Processed     Reads/Minute
23:40:38.767 INFO  BaseRecalibrator - 62393454 read(s) filtered by: MappingQualityNotZeroReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: WellformedReadFilter
62393454 total reads filtered
23:40:38.769 INFO  ProgressMeter -             unmapped              1.0                     0              0.0
23:40:38.769 INFO  ProgressMeter - Traversal complete. Processed 0 total reads in 1.0 minutes.
23:40:38.770 INFO  BaseRecalibrator - Calculating quantized quality scores...
23:40:38.784 INFO  BaseRecalibrator - Writing recalibration report...
23:40:38.821 INFO  BaseRecalibrator - ...done!
23:40:38.822 INFO  BaseRecalibrator - BaseRecalibrator was able to recalibrate 0 reads
23:40:38.822 INFO  BaseRecalibrator - Shutting down engine
[February 26, 2020 11:40:38 PM EST] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 1.07 minutes.
Runtime.totalMemory()=3622305792
Tool returned:
SUCCESS

ApplyBQSR

23:41:16.951 INFO  ApplyBQSR - The Genome Analysis Toolkit (GATK) v4.1.4.1-83-g031c407-SNAPSHOT
23:41:16.952 INFO  ApplyBQSR - For support and documentation go to https://software.broadinstitute.org/gatk/
23:41:16.952 INFO  ApplyBQSR - Executing as <XXX@XXX> on Linux v3.10.0-957.12.1.el7.x86_64 amd64
23:41:16.952 INFO  ApplyBQSR - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-b08
23:41:16.952 INFO  ApplyBQSR - Start Date/Time: February 26, 2020 11:41:16 PM EST
23:41:16.952 INFO  ApplyBQSR - ------------------------------------------------------------
23:41:16.952 INFO  ApplyBQSR - ------------------------------------------------------------
23:41:16.953 INFO  ApplyBQSR - HTSJDK Version: 2.21.2
23:41:16.953 INFO  ApplyBQSR - Picard Version: 2.21.9
23:41:16.953 INFO  ApplyBQSR - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:41:16.953 INFO  ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:41:16.953 INFO  ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:41:16.953 INFO  ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:41:16.953 INFO  ApplyBQSR - Deflater: IntelDeflater
23:41:16.953 INFO  ApplyBQSR - Inflater: IntelInflater
23:41:16.953 INFO  ApplyBQSR - GCS max retries/reopens: 20
23:41:16.953 INFO  ApplyBQSR - Requester pays: disabled
23:41:16.953 INFO  ApplyBQSR - Initializing engine
23:41:17.460 INFO  ApplyBQSR - Done initializing engine
23:41:17.527 INFO  ProgressMeter - Starting traversal
23:41:17.527 INFO  ProgressMeter -        Current Locus  Elapsed Minutes       Reads Processed     Reads/Minute
23:41:17.849 INFO  ApplyBQSR - Shutting down engine
[February 26, 2020 11:41:17 PM EST] org.broadinstitute.hellbender.tools.walkers.bqsr.ApplyBQSR done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2411724800
java.lang.IllegalStateException: The covariates table is missing ReadGroup S3_2 in RecalTable0
	at org.broadinstitute.hellbender.utils.Utils.validate(Utils.java:752)
	at org.broadinstitute.hellbender.utils.recalibration.covariates.ReadGroupCovariate.keyForReadGroup(ReadGroupCovariate.java:81)
	at org.broadinstitute.hellbender.utils.recalibration.covariates.ReadGroupCovariate.recordValues(ReadGroupCovariate.java:53)
	at org.broadinstitute.hellbender.utils.recalibration.covariates.StandardCovariateList.recordAllValuesInStorage(StandardCovariateList.java:133)
	at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.computeCovariates(RecalUtils.java:546)
	at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.computeCovariates(RecalUtils.java:527)
	at org.broadinstitute.hellbender.transformers.BQSRReadTransformer.apply(BQSRReadTransformer.java:145)
	at org.broadinstitute.hellbender.transformers.BQSRReadTransformer.apply(BQSRReadTransformer.java:27)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.Iterator.forEachRemaining(Iterator.java:116)
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
	at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:94)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
	at org.broadinstitute.hellbender.Main.main(Main.java:292)

@shengzha
Copy link

@cmnbroad And here is the output of BaseRecalibrator. S3_2.unmapped.recal_data.csv.zip

@cmnbroad
Copy link
Collaborator

@shengzha In your case it looks like ApplyBQSR is failing because ALL of your input reads are being filtered out before BQSR ever sees them, due to having mapping quality 0:

23:40:38.767 INFO BaseRecalibrator - 62393454 read(s) filtered by: MappingQualityNotZeroReadFilter
...
23:40:38.822 INFO BaseRecalibrator - BaseRecalibrator was able to recalibrate 0 reads

@shengzha
Copy link

@cmnbroad Thanks, I didn't notice that. I will check and get back to you later.

@shengzha
Copy link

The root cause might be irrelevant but I am still posting it in case someone is in a similar situation. I was running the GATK workflow https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels on a local machine and it turned out that STAR alignment

STAR \
		--genomeDir STAR2_5 \
		--runThreadN ${threads} \
		--readFilesIn ${fastq1} ${fastq2} \
		--readFilesCommand "gunzip -c" \
		${"--sjdbOverhang "+(read_length-1)} \
		--outSAMtype BAM SortedByCoordinate \
		--twopassMode Basic \
		--limitBAMsortRAM ${star_mem+"000000000"} \
		--limitOutSJcollapsed ${default=1000000 star_limitOutSJcollapsed} \
		--outFileNamePrefix ${base_name}.

was not working fine because --readFilesCommand "gunzip -c" is not supported locally.
https://groups.google.com/forum/#!topic/rna-star/EkTo556lOMk this post helps.

@droazen droazen removed this from the GATK-Triaged-Issues milestone Jun 22, 2020
@LvLH
Copy link

LvLH commented Jul 16, 2020

@shengzha In your case it looks like ApplyBQSR is failing because ALL of your input reads are being filtered out before BQSR ever sees them, due to having mapping quality 0:

23:40:38.767 INFO BaseRecalibrator - 62393454 read(s) filtered by: MappingQualityNotZeroReadFilter
...
23:40:38.822 INFO BaseRecalibrator - BaseRecalibrator was able to recalibrate 0 reads

Hi, I also encounted the trable. And it looks like ApplyBQSR is failing is truelly caused by 0 reads after filtered in BQSR processing. I solved my problem in BQSR to generate recal_table.
So maybe my experience is helpful to you or others.
Q: "on the ApplyBQSR step with an error stating "The covariates table is missing ReadGroup xxx in RecalTable0"
A: check up reads remained in BQSR, such as filtering by regions: --intervals xxx.bed, and no reads in target regions.,and other filtering situations.

@DuyDN
Copy link

DuyDN commented Nov 2, 2021

I still have the same issues, what can we do to get the Number of read groups must be >= 1.
Thanks

@user-tq
Copy link

user-tq commented Oct 15, 2022

I have the same issues,but finaly I found that I made a mistake about
bwa -R "@RG\tID:${x} \tPL:Illumina\tLB:${x}\tSM:${x}\tPG:bwa"
There is a space after the ID.
in fact ,result of gatk BaseRecalibrator Contains the space.
but when gatk ApplyBQSR read the table,
it will trim trailing spaces.
different names cause errors(I guess)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants