-
Notifications
You must be signed in to change notification settings - Fork 457
Add support for VCFv4.4 / VCFv4.5 Number= fields #1874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
I dealt with the 2nd of your TODO action items. The first (tests) turns out to be irrelevant currently as there are no checks in the code either, so all test files pass regardless. It needs upgrading to add proper record validation, but that will come with a CPU cost so it needs more thought as to how to do this. Best done in a following PR I suspect. My header validation permits Number=. as I see this used in the wild in numerous places. Indeed Bcftools own tests also fail on this all over the place, commonly having AD defined as Bcftools still has some warnings with the extra checks: test/setGT.5.vcf has We probably need @pd3 to look at these and comment whether these are widely used alternatives that fly against the specification. Technically a tool could look at We can omit such things from validation if they are widely accepted spec breakages. Note this doesn't produce errrors and the function has void return. It won't spam either as each error is reported at maximum once (not even once per file, but once per use of htslib). |
Note I would like to be more aggressive on the newer 4.4 and 4.5 types, like LPL needing We could add a version header check perhaps so we only validate on specific versions. I'll look into seeing how easy that is to do. It'll make things a bit messier though. Part of me thinks maybe just removing the |
6361912
to
342611f
Compare
Adds BCF_VL_* types for Number= P, LA, LG, LR and M Upgrades bcf_hdr_register_hrec() so that it sets the appropriate value when it finds one of the new types in a FORMAT header line. Only set number type BCF_VL_FIXED if the value can actually be parsed as an integer. Unknown types will now cause a warning to be printed and will be treated as BCF_VL_VAR. Adds doxygen style comments to describe the meaning of the various BCF_VL_* values.
Note I haven't added the myraid of base modification tags as there are so many of them.
Previously we were gracious in accepting Number=. as it turns in in several places in the wild. However for VCF 4.4 and 4.5 we're more strict as these aren't common place yet and we can complain louder about errors. The small caveat here is using 4.4 and 4.5 defined tags in an earlier VCF version. This happens when for example people wish to use local alleles. They've been used for many years, but only finally landed in 4.5. Hence we see VCF 4.3 using these tags and Number=. as the correct type code didn't exist in that version. Also added phase-set tests which were accidentally missed on the previous commit.
Local alleles work similarly to global ones, except the set of alleles present can vary per sample. The mapping from local to global numbering is stored in the LAA field. Removing an allele can change both the global numbering and also local numbering if one of the entries referred to the delete allele. If LAA data is present, it is updated to reflect the new global numbering if that has changed due to an allele being removed. At the same time a map is constructed storing data about removed or renumbered local entries, in a similar way to the global map[] array; and per-sample ref+alt counts are recorded to use when calculating LR, LA and LG Number= fields. Both the map[] and laa_map[] arrays store -1 for removed alleles, allowing them to be used instead of kbs_exists() when checking which items to delete. It's then possible to use the same code to update global Number= A,R,G and local Number= LA,LR,LG data by either using the global map[] array or the local laa_map[] entries for the current sample.
Ensure that missing values are reported correctly.
The tags CICN, CIEND, CILEN, CIPOS, MEINFO, and METRANS are defined as Number=. in the VCF 4.4+ specification, but with the additional rule that they're really either 2 x Number=A for the CI* tags or 4 x Number=A for the ME* ones. Handle this by spotting them, and setting the multiple we expect to see so values are treated as groups of two or four.
342611f
to
af274ef
Compare
Rebased complete with updates to make |
af3afd6
to
fa988d1
Compare
vcfutils.c
Outdated
if (!new_info) | ||
return -1; | ||
memcpy(new_info, buf, new_len); | ||
memcpy(new_info + new_len, buf, info->vptr_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be buf or info->vptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should be fixed now.
Tandem repeats are described using a mini-format of INFO tags, namely RN, RUS, RUL, RB, RUC, CIRB, CIRUC and RUB. The number of items in RUS, RUL, RB, CIRB, RUC and CIRUC depend on the values in RN. The number of items in RUB depend on the values in RUC (and therefore also RN). Removing an allele means we have to carefully update these, ensuring it's done in the right order so the information needed by a dependent tag is still present when it is processed. Removing tag values is done by shifting the data pointed to by info->vptr around where necessary. While more complicated than using bcf_get_info_values() followed by bcf_update_info(), it should be much faster as in most cases no memory allocations will be needed to make the desired changes.
fa988d1
to
0841858
Compare
Also fix LPP typo of type "LP" instead of "LG".
-1 is now a valid value in map[]
We need to check against the original allele count, not the updated one.
If all alleles referenced by LAA tags in a VCF record are removed, the MISSING value needs to be stored for each sample. This stops bcf_update_format_int32() from removing the LAA tag completely, which would invalidate other tags like LAD and LPL that depend on LAA being present.
If all the alleles referenced by CNV:TR related INFO tags are removed, the INFO tags will no longer have any values and so also need to be removed.
Added a few fixes:
|
Adds
BCF_VL_*
types for Number=P
,LA
,LG
,LR
andM
, and updatesbcf_hdr_register_hrec()
so that it sets them.Note that this changes handling of unrecognised
Number=
values. Previously htslib would attempt to treat these as numbers, and as it did not check forsscanf()
failure they were stored as typeBCF_VL_FIXED
with count0xfffff
. This commit makes it print a warning and store the type asBCF_VL_VAR
instead.TODO:
bcf_hdr_check_sanity()
bcf_remove_allele_set()
work with local alleles