Streaming gzipped BCF files: issue with CHROM field

Hello,

I first posted this as an issue on the GeneticVariation github:

However, I suspect this may be a better place for it, since the package may not be at fault here. I am copying the issue fuly below; any thoughts would be appreciated!


Hello,

I am unable to get BCF.chrom to output anything other than 1 (Julia 1.1.0; RHEL 7.6).

I tested this with multiple BCF files. When inspected with bcftools view, these same files show the correct chromosome name, in the expected field:

> bcftools view bcftest.bcf.gz
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype calls">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##contig=<ID=22>
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype call probabilities">
##INFO=<ID=QUAL,Number=1,Type=Float,Description="Imputation Quality">
##INFO=<ID=TAG,Number=1,Type=String,Description="rs_id from annotation files">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view bcftest.bcf.gz; Date=Wed Jun 12 02:41:11 2019
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  XXXX
22      16050075        22:16050075:A:G A       G       0.123   PASS    TAG=22:16050075:A:G;AC=0;AN=0   GT:DS:GP        ./.:0:1,0,0
22      16050115        22:16050115:G:A G       A       0.404   PASS    TAG=22:16050115:G:A;AC=0;AN=0   GT:DS:GP        ./.:0:1,0,0

However, BCF.chrom tells a different story:

julia> using GeneticVariation

julia> reader = BCF.Reader(open("bcftest.bcf.gz", "r"))
GeneticVariation.BCF.Reader{IOStream}((0x02, 0x02), GeneticVariation.VCF.Header:
  metainfo tags: fileformat FILTER FORMAT contig INFO
     sample IDs: XXXX, BGZFStreams.BGZFStream{IOStream}(<mode=read>))

julia> record = read(reader)
GeneticVariation.BCF.Record:
   chromosome: 1
     position: 16050075
   identifier: 22:16050075:A:G
    reference: A
    alternate: G
      quality: 0.123
       filter: 1
  information: Tuple{Int64,Any}[(5, "22:16050075:A:G"), (6, 0), (7, 0)]

julia> BCF.chrom(record)
1

julia> BCF.pos(record)
16050075


The issue doesn’t seem connected to the actual implementation of BCF.chrom(). Reading straight from the BGZFStream causes the same issue:

julia> reader = BCF.Reader(open("bcftest.bcf.gz", "r"))
GeneticVariation.BCF.Reader{IOStream}((0x02, 0x02), GeneticVariation.VCF.Header:
  metainfo tags: fileformat FILTER FORMAT contig INFO
     sample IDs: XXXX, BGZFStreams.BGZFStream{IOStream}(<mode=read>))

julia> misc = read(reader.stream, Int64) #skip first 8 bytes
115964117068

julia> chrom = read(reader.stream, Int32) %Int + 1 #next 4 bytes are CHR
1

julia> pos = read(reader.stream, Int32) %Int + 1 #next 4 bytes are POS
16050075



Reproducible example attached: bcftest.bcf.gz

1 Like

Looks like I should have persevered just a tad longer before posting! In case anyone wonders, the solution can be found in the VCFv4.3 specs.

Long story short, (I think) this is an issue with how GeneticVariation handles BCF files; instead of storing the explicit chromosome name like in plaintext VCF format, the CHROM field in BCF files stores indices to the list of ‘contig’ field header.

In the example BCF file, there is only one ##contig, with corresponding value 22. Instead of 1, BCF.Chrom should have returned the value specified by the first ##contig, i.e. 22.

See page 28 of the specs:

6.2.2 Dictionary of contigs
The CHROM field in BCF2 is encoded as an integer offset into the list of ##contig field headers in the VCF header. The offsets begin, like the dictionary of strings, at 0. So for example if in BCF2 the contig value is 10, this indicates that the actual chromosome is the 11th element in the ordered list of ##contig elements […]

1 Like