How to unpack an .xz file with Julia

Is there a package that can decompress .xz files?

I need to do this in a cross-platform way.

https://github.com/JuliaIO/TranscodingStreams.jl + https://github.com/JuliaIO/CodecXz.jl

2 Likes

Thanks a lot!

The following code works for me:

using CodecXz

const FILENAME="data/log_8700W_8ms.csv.xz"

stream = open(FILENAME)
output = open(FILENAME[1:end-3],"w")
for line in eachline(XzDecompressorStream(stream))
    println(output, line)
end
close(stream)
close(output)

But it will work only for decompressing text files.

How can it be generalized for any files?

What do you mean by “any file”? Code you wrote is generic enough, there should be no difference in decompressing text or any other file (in the end of the day all files are just text).

I mean, eachline will only work if the stream contains line delimiters, or am I wrong?

Ah, that’s the best part actually! Since TranscodingStream produces IO object, all General IO applies. It doesn’t matter whether it is compressed data or not at this point.

All what follows depends on packages that you use or procedure that you need to implement. You can materialize data as Vector{UInt8} or any other data format. Or maybe your package can accept this IO object and you can forget about compressed data processing completely.

As an example, consider following xz arrow manipulations

using CodecXz
using Arrow
using Tables

x = [(; a = 1, b = 2)]
Arrow.write("x.arrow", x)

Here we switch to shell and compress data manually. It can be done with the CodecXz of course, but we pretend that this is external file.

sh> xz x.arrow

and back to Julia

julia> stream = XzDecompressorStream(open("x.arrow.xz"))
TranscodingStreams.TranscodingStream{XzDecompressor, IOStream}(<mode=idle>)

# we can materialize uncompressed data as Vector{UInt8}
julia> read(stream)
610-element Vector{UInt8}:
 0x41
 0x52
 0x52
    â‹®

# we can read it as a String (which is weird of course, since it is binary file)
julia> read(stream, String)
"ARROW1\0\0\xff\xff\xff\xff\xa8\0\0\0\x10\0\0\0\0\0\n\0\f\0\n\0\b\0\x04\0\n\0\0\0\x10\0\0\0\x01\
0\x04\0\b\0\b\0\0\0\x04\0\b\0\0\0\x04\0\0\0\x02\0\0\0D\0\0\0\x04\0\0\0\xd4\xff\xff\xff\x10\0\0\0..."

# and we can read it back as Julia data
julia> Arrow.Table(stream) |> Tables.rowtable
1-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
 (a = 1, b = 2)

One small note: in these manipulations, before each read I actually use stream = XzDecompressorStream(open("x.arrow.xz")) command, because you can’t read twice from the same stream. I just omit it for simplicity.

2 Likes

The only functions in your snippet that care about line delimiters are eachline and println. For example, read(XzDecompressorStream(stream)) would read the whole decompressed content as a vector of raw bytes.

1 Like