Skip to content

Conversation

daviesrob
Copy link
Member

Adds BCF_VL_* types for Number= P, LA, LG, LR and M, and updates bcf_hdr_register_hrec() so that it sets them.

Note that this changes handling of unrecognised Number= values. Previously htslib would attempt to treat these as numbers, and as it did not check for sscanf() failure they were stored as type BCF_VL_FIXED with count 0xfffff. This commit makes it print a warning and store the type as BCF_VL_VAR instead.

TODO:

  • Add tests
  • Check more header types in bcf_hdr_check_sanity()
  • Make bcf_remove_allele_set() work with local alleles

@jkbonfield
Copy link
Contributor

I dealt with the 2nd of your TODO action items. The first (tests) turns out to be irrelevant currently as there are no checks in the code either, so all test files pass regardless. It needs upgrading to add proper record validation, but that will come with a CPU cost so it needs more thought as to how to do this. Best done in a following PR I suspect.

My header validation permits Number=. as I see this used in the wild in numerous places. Indeed Bcftools own tests also fail on this all over the place, commonly having AD defined as Number=. for example.

Bcftools still has some warnings with the extra checks:

test/setGT.5.vcf has ##FORMAT=<ID=AD,Number=A,Type=Integer,Description="Alleles"> (should be "R") and test/view64bit.5.vcf has ##INFO=<ID=END,Number=A,Type=Integer,Description="dummy"> (should be "1").

We probably need @pd3 to look at these and comment whether these are widely used alternatives that fly against the specification. Technically a tool could look at AD,Number=A and derive the REF depth hasn't been reported, but the spec doesn't formally permit such nuance. Equally so it's possible that with multiple alleles END may have multiple values. It's not permitted, but perhaps that's an oversight in the spec and people have been doing this for years?

We can omit such things from validation if they are widely accepted spec breakages.

Note this doesn't produce errrors and the function has void return. It won't spam either as each error is reported at maximum once (not even once per file, but once per use of htslib).

@jkbonfield
Copy link
Contributor

Note I would like to be more aggressive on the newer 4.4 and 4.5 types, like LPL needing Number=LG, but ironically in Gnomad (and probably others) these are the cases where we use Number=. to avoid problems with parsers not supporting 4.5.

We could add a version header check perhaps so we only validate on specific versions. I'll look into seeing how easy that is to do. It'll make things a bit messier though.

Part of me thinks maybe just removing the BCF_VL_VAR loop-hole and causing more warnings to appear may just be the right thing. The data is incorrect according to the spec and the consequence is nothing more than a one line warning. Is it not correct that people are told their data is wrongly formatted, as long as we still continue with best-efforts as we've always done? However for the time being and in the interests of keeping the peace it's being lenient as people have got away with this in the past.

@jkbonfield jkbonfield force-pushed the new_number_types branch 2 times, most recently from 6361912 to 342611f Compare April 17, 2025 14:40
daviesrob and others added 10 commits September 1, 2025 10:05
Adds BCF_VL_* types for Number= P, LA, LG, LR and M

Upgrades bcf_hdr_register_hrec() so that it sets the appropriate
value when it finds one of the new types in a FORMAT header line.

Only set number type BCF_VL_FIXED if the value can actually be
parsed as an integer.  Unknown types will now cause a warning
to be printed and will be treated as BCF_VL_VAR.

Adds doxygen style comments to describe the meaning of the various
BCF_VL_* values.
Note I haven't added the myraid of base modification tags as there are
so many of them.
Previously we were gracious in accepting Number=. as it turns in in
several places in the wild.  However for VCF 4.4 and 4.5 we're more
strict as these aren't common place yet and we can complain louder
about errors.

The small caveat here is using 4.4 and 4.5 defined tags in an earlier
VCF version.  This happens when for example people wish to use local
alleles.  They've been used for many years, but only finally landed in
4.5.  Hence we see VCF 4.3 using these tags and Number=. as the
correct type code didn't exist in that version.

Also added phase-set tests which were accidentally missed on the
previous commit.
Local alleles work similarly to global ones, except the set
of alleles present can vary per sample.  The mapping from local
to global numbering is stored in the LAA field.  Removing an
allele can change both the global numbering and also local
numbering if one of the entries referred to the delete allele.

If LAA data is present, it is updated to reflect the new global
numbering if that has changed due to an allele being removed.  At
the same time a map is constructed storing data about removed
or renumbered local entries, in a similar way to the global map[]
array; and per-sample ref+alt counts are recorded to use when
calculating LR, LA and LG Number= fields.

Both the map[] and laa_map[] arrays store -1 for removed alleles,
allowing them to be used instead of kbs_exists() when checking
which items to delete.  It's then possible to use the same code
to update global Number= A,R,G and local Number= LA,LR,LG
data by either using the global map[] array or the local laa_map[]
entries for the current sample.
Ensure that missing values are reported correctly.
The tags CICN, CIEND, CILEN, CIPOS, MEINFO, and METRANS are
defined as Number=. in the VCF 4.4+ specification, but with the
additional rule that they're really either 2 x Number=A for the
CI* tags or 4 x Number=A for the ME* ones.

Handle this by spotting them, and setting the multiple we expect
to see so values are treated as groups of two or four.
@daviesrob
Copy link
Member Author

Rebased complete with updates to make bcf_remove_allele_set() work with local alleles, including tests.

@daviesrob daviesrob marked this pull request as ready for review September 1, 2025 11:13
vcfutils.c Outdated
if (!new_info)
return -1;
memcpy(new_info, buf, new_len);
memcpy(new_info + new_len, buf, info->vptr_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be buf or info->vptr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should be fixed now.

Tandem repeats are described using a mini-format of INFO tags,
namely RN, RUS, RUL, RB, RUC, CIRB, CIRUC and RUB.  The number
of items in RUS, RUL, RB, CIRB, RUC and CIRUC depend on the values
in RN.  The number of items in RUB depend on the values in RUC
(and therefore also RN).  Removing an allele means we have to
carefully update these, ensuring it's done in the right order
so the information needed by a dependent tag is still present
when it is processed.

Removing tag values is done by shifting the data pointed
to by info->vptr around where necessary.  While more complicated
than using bcf_get_info_values() followed by bcf_update_info(),
it should be much faster as in most cases no memory allocations
will be needed to make the desired changes.
jkbonfield and others added 5 commits September 18, 2025 17:14
Also fix LPP typo of type "LP" instead of "LG".
-1 is now a valid value in map[]
We need to check against the original allele count, not the
updated one.
If all alleles referenced by LAA tags in a VCF record are removed,
the MISSING value needs to be stored for each sample.  This stops
bcf_update_format_int32() from removing the LAA tag completely,
which would invalidate other tags like LAD and LPL that depend
on LAA being present.
If all the alleles referenced by CNV:TR related INFO tags are
removed, the INFO tags will no longer have any values and
so also need to be removed.
@daviesrob
Copy link
Member Author

Added a few fixes:

  • Check for valid GTs needed to allow map[] values being -1 (indicating missing)
  • Use nR_ori when checking original LAA values for validity
  • Prevent the LAA tag from disappearing if all referenced alleles are removed. It should instead be set as MISSING.
  • Remove the CNV:TR related INFO tags if all CNV:TR alleles have been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants