Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SelectVariants rsID matching #5260

Closed
vdauwera opened this issue Oct 4, 2018 · 2 comments
Closed

Improve SelectVariants rsID matching #5260

vdauwera opened this issue Oct 4, 2018 · 2 comments
Assignees
Labels
enhancement learn GATK Suitable for GATK beginners

Comments

@vdauwera
Copy link
Contributor

vdauwera commented Oct 4, 2018

Feature request

Tool(s) or class(es) involved

SelectVariants (all versions -- tested most recently with 4.0.9.0)

Description

SelectVariants has an option for excluding specific sites by rsID, --exclude-ids. Currently the matching is done by exact string matching. The limitation is that if I have a site with two rsIDs (as appears in the 1000 Genomes dataset), I can't filter against just one of them. For example, if I want to exclude this site from 1000G participant HG00096 (because it makes my synthetic data generator choke, if you must know):

3523879:16	68401338	rs554836707;rs56090907	CGTGCGCTGCT	CTGCT,C	100	PASS <snip>

I have to specify --exclude-ids 'rs554836707;rs56090907' in my SelectVariants command. I would like to be able to specify just --exclude-ids rs554836707 or --exclude-ids rs56090907 with the same result (the site is excluded). Right now if I do use just one like that, the site is not excluded because the pattern fails to match.

The reason I want to be able to do that is because there are several other similar sites that I need to exclude, so I make an array of rsIDs to exclude, e.g. provide

Array[String] = [
    "rs575450111",
    "rs151266607",
    "rs557010312",
    "rs564559134",
    "rs560466806",
    "rs140802196",
    "rs554836707",
    "rs56090907"
]

as an input to my WDL, which handles it in the command block as

--exclude-ids ${sep=' --exclude-ids ' exclusionList}

Right now I have to use

[
"'rs575450111;rs151266607'",
"'rs557010312;rs564559134'",
"'rs560466806;rs140802196'",
"'rs554836707;rs56090907'"
]

which is stylistically ugly and would not cover cases where only one rsID is annotated for whatever reason.

@vdauwera vdauwera added enhancement learn GATK Suitable for GATK beginners labels Oct 4, 2018
@davidbenjamin
Copy link
Contributor

@MartonKN It looks like the relevant code is lines 713-5 of SelectVariants:

if (rsIDsToRemove != null && !rsIDsToRemove.isEmpty()) {
            compositeFilter = compositeFilter.and(new CountingVariantFilter(new VariantIDsVariantFilter(rsIDsToRemove).negate()));
}

and in VariantIDsVariantFilter

public boolean test(final VariantContext vc) {
        return includeIDs.contains(vc.getID());
    }

This seems like it should do the right thing, except that vc.getID() is a String, not a List<String>. @vdauwera can we assume that a VariantContext ID will not contain a semicolon, and therefore we can parse by splitting on ';'?

MartonKN added a commit that referenced this issue Nov 30, 2018
* Changed SelectVariants so that it van handle multiple rsIDs separated by ';' in a VCF file.

* The tests testKeepSelectionIDLiteral and testKeepSelectionIDFromFile broke due to the change in the test file complexExample1.vcf. I modified the file testSelectVariants_KeepSelectionID.vcf appropriately so that the tests pass as they should.

* Changes due to review by David Benjamin and Phil Shapiro.

* Made the code in VariantIDsVariantFilter's test function more concise.
@davidbenjamin
Copy link
Contributor

Closed by #5464.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement learn GATK Suitable for GATK beginners
Projects
None yet
Development

No branches or pull requests

3 participants