Using bgzipped VCF files with GeneticVariation.jl

I’m interested in using GeneticVariation.jl to work with VCF files. My VCF files are compressed with bgzip and are indexed using tabix.

I’m looking at the documentation for reading in VCF files here: https://github.com/BioJulia/GeneticVariation.jl/blob/master/docs/src/io/vcf-bcf.md . The document suggests to me that the VCF reader should be used with uncompressed VCF’s.

It is my understanding that I could change
reader = VCF.Reader(open("example.vcf", "r"))
to
reader = VCF.Reader(open(`zless example.vcf.gz`, "r"))

To read in a compressed VCF. Is this the recommended way for working with compressed VCF files? It seems that there’s a reader for BCF files as well, though I’d rather not have to convert all of my vcf.gz files to bcfs.

1 Like

You might want to look at the Libz.jl package. It can handle .gz files directly and is almost as fast as reading uncompressed files directly.

Hi, I’m a developer of both GeneticVariation.jl and Libz.jl.

I’d rather recommend using CodecZlib.jl to decompress gzip files. CodecZlib.jl is a member of TranscodingStreams.jl, which offers a consistent APIs to various compression formats. I’m going to support automatic gzip decompression using CodecZlib.jl in our bio packages, but now, you can write it like:

using GeneticVariation
using CodecZlib

reader = VCF.Reader(GzipDecompressionStream(open("example.vcf.gz"))
1 Like

This seemed to do the trick. Thanks for the answer and for developing GeneticVariation.jl + Libz.jl!