As the title of the question states, I have a few
*.csv.bzip2 compressed files that are as big as 13GB when decompressed. The files have a structured format and are essentially CSV files.
I’d like to read the headers of the files to generate a SQL schema so that I can import them into a RDBMS for more efficient querying and exploration.
I’ve looked at
CodecBzip2 package and it hangs whenever I attempt to decode a partial bytes vector.
Some sample code (omitting proper handling for brevity):
] add CodecBzip2, TranscodingStreams, CSV
using TranscodingStreams, CodecBzip2 f = open("/path/to/csv.bzip2", "r") zipdata = read(f, 2048) # The following line hangs d = transcode(Bzip2Decompressor, zipdata) println(String(d))
My understanding is that the algorithm’s compressed blocks should be independently decompressable; I’m most likely not grabbing “valid” data blocks for decompression by just reading the first
N bytes of the file.
Are there any packages that would offer partial decompression? has anyone done something similar?
Many thanks in advance!