Partially decompressing Bzip2 files

tristian · March 22, 2023, 7:04am

As the title of the question states, I have a few *.csv.bzip2 compressed files that are as big as 13GB when decompressed. The files have a structured format and are essentially CSV files.

I’d like to read the headers of the files to generate a SQL schema so that I can import them into a RDBMS for more efficient querying and exploration.

I’ve looked at CodecBzip2 package and it hangs whenever I attempt to decode a partial bytes vector.

Some sample code (omitting proper handling for brevity):
Setup:

] add CodecBzip2, TranscodingStreams, CSV

Script:

using TranscodingStreams, CodecBzip2

f = open("/path/to/csv.bzip2", "r")
zipdata = read(f, 2048)

# The following line hangs
d = transcode(Bzip2Decompressor, zipdata)

println(String(d))

My understanding is that the algorithm’s compressed blocks should be independently decompressable; I’m most likely not grabbing “valid” data blocks for decompression by just reading the first N bytes of the file.

Are there any packages that would offer partial decompression? has anyone done something similar?

Many thanks in advance!

fredrikekre · March 22, 2023, 7:48am

You can use the stream version of the compressor, since you only want to read some bytes.

shell> bzcat test.csv.bz2
a,b
1,"hello"

julia> using CodecBzip2

julia> x = open("test.csv.bz2") do compressed
           decompressed = Bzip2DecompressorStream(compressed)
           String(read(decompressed, 10))
       end
"a,b\n1,\"hel"

tristian · March 23, 2023, 1:00am

Hi @fredrikekre! Thank you so much, this works great!

Topic		Replies	Views
GzipDecompressionStream compared to GZip.jl? General Usage	2	1426	August 22, 2017
How to decompress .xz files/ How to use streams? General Usage question , package , codecxz	6	673	September 3, 2022
How to read a compressed CSV file? New to Julia	11	4886	January 17, 2019
[ANN] TranscodingStreams.jl - new APIs to zlib, bzip2, xz, zstd and more! Community package , announcement	2	1456	August 18, 2017
CSV.write() to Unix Pipe (e.g., lz4 or bzip2) Data	6	684	October 1, 2018

Partially decompressing Bzip2 files

Related topics