As the title of the question states, I have a few *.csv.bzip2
compressed files that are as big as 13GB when decompressed. The files have a structured format and are essentially CSV files.
I’d like to read the headers of the files to generate a SQL schema so that I can import them into a RDBMS for more efficient querying and exploration.
I’ve looked at CodecBzip2
package and it hangs whenever I attempt to decode a partial bytes vector.
Some sample code (omitting proper handling for brevity):
Setup:
] add CodecBzip2, TranscodingStreams, CSV
Script:
using TranscodingStreams, CodecBzip2
f = open("/path/to/csv.bzip2", "r")
zipdata = read(f, 2048)
# The following line hangs
d = transcode(Bzip2Decompressor, zipdata)
println(String(d))
My understanding is that the algorithm’s compressed blocks should be independently decompressable; I’m most likely not grabbing “valid” data blocks for decompression by just reading the first N
bytes of the file.
Are there any packages that would offer partial decompression? has anyone done something similar?
Many thanks in advance!