Partially decompressing Bzip2 files

As the title of the question states, I have a few *.csv.bzip2 compressed files that are as big as 13GB when decompressed. The files have a structured format and are essentially CSV files.

I’d like to read the headers of the files to generate a SQL schema so that I can import them into a RDBMS for more efficient querying and exploration.

I’ve looked at CodecBzip2 package and it hangs whenever I attempt to decode a partial bytes vector.

Some sample code (omitting proper handling for brevity):
Setup:

] add CodecBzip2, TranscodingStreams, CSV

Script:

using TranscodingStreams, CodecBzip2

f = open("/path/to/csv.bzip2", "r")
zipdata = read(f, 2048)

# The following line hangs
d = transcode(Bzip2Decompressor, zipdata)

println(String(d))

My understanding is that the algorithm’s compressed blocks should be independently decompressable; I’m most likely not grabbing “valid” data blocks for decompression by just reading the first N bytes of the file.

Are there any packages that would offer partial decompression? has anyone done something similar?

Many thanks in advance!

You can use the stream version of the compressor, since you only want to read some bytes.

shell> bzcat test.csv.bz2
a,b
1,"hello"

julia> using CodecBzip2

julia> x = open("test.csv.bz2") do compressed
           decompressed = Bzip2DecompressorStream(compressed)
           String(read(decompressed, 10))
       end
"a,b\n1,\"hel"
1 Like

Hi @fredrikekre! Thank you so much, this works great!