Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF: "Genotype fields" vs "FORMAT" and per-sample #738

Closed
jkbonfield opened this issue Aug 2, 2023 · 1 comment
Closed

VCF: "Genotype fields" vs "FORMAT" and per-sample #738

jkbonfield opened this issue Aug 2, 2023 · 1 comment
Labels

Comments

@jkbonfield
Copy link
Contributor

A minor technicality.

The language of the VCF specification is to describe 8 fixed fields (CHROM to INFO) followed by Genotype data.

This feels off to me as it's describing what the FORMAT/sample data has traditionally encoded rather than describing the actual format of the VCF file. The file format is to have a bunch of keys ("FORMAT") and a set of values per sample. They don't have to encode genotype data at all, and generally most don't.

Also related to the description of columns, it would be helpful if the fixed 8 fields documented whether they are mandatory or not. My initial assumption was obviously so, but HTSlib's VCF parser (currently) handles files where a record has e.g. 4 values only. The others get treated as the "missing" value, so it feels like a deliberate mechanism. Picard rejects such data, which is more logical to me. However the specification doesn't explicitly state that the fixed columns must all be present, even if it feels like the most obvious interpretation.

@jkbonfield jkbonfield added the vcf label Aug 2, 2023
@jkbonfield jkbonfield moved this to New items in GA4GH File Formats Aug 17, 2023
@jkbonfield jkbonfield moved this from New items to To do (backlog) in GA4GH File Formats Aug 22, 2023
@jkbonfield
Copy link
Contributor Author

Also related to the description of columns, it would be helpful if the fixed 8 fields documented whether they are mandatory or not.

Doh! Ignore me. While the "Data lines" section just describes fixed columns without being explicit about them being mandatory, the previous "Header line syntax" section does infact state "8 fixed, mandatory columns". So I just didn't spot it. Apologies for that part of this issue.

Although being ultra nit-picky and related to this is whether the whole header line itself is infact mandatory! Given data doesn't have to have FORMAT and samples, it could be argued to be superfluous in that scenario. Coupled with the fact that we are quite clear in stating that the fileformat line is required (and first), it may imply that the lack of stating something is a required field means it is not a hard requirement. Although frankly you'd be heroic to assume this and both htslib and htsjdk sensibly have it has a hard rule.

The only saving grace is the structure of the file is listed as meta-information lines, a header line, and data lines. However a file without data lines is valid, so that in and of itself doesn't dictate these fields are mandatory. Very minor though!

@github-project-automation github-project-automation bot moved this from To do (backlog) to Done in GA4GH File Formats Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

2 participants