Is there a package that can decompress .xz files?
I need to do this in a cross-platform way.
Is there a package that can decompress .xz files?
I need to do this in a cross-platform way.
Thanks a lot!
The following code works for me:
using CodecXz
const FILENAME="data/log_8700W_8ms.csv.xz"
stream = open(FILENAME)
output = open(FILENAME[1:end-3],"w")
for line in eachline(XzDecompressorStream(stream))
println(output, line)
end
close(stream)
close(output)
But it will work only for decompressing text files.
How can it be generalized for any files?
What do you mean by “any file”? Code you wrote is generic enough, there should be no difference in decompressing text or any other file (in the end of the day all files are just text).
I mean, eachline will only work if the stream contains line delimiters, or am I wrong?
Ah, that’s the best part actually! Since TranscodingStream produces IO
object, all General IO applies. It doesn’t matter whether it is compressed data or not at this point.
All what follows depends on packages that you use or procedure that you need to implement. You can materialize data as Vector{UInt8}
or any other data format. Or maybe your package can accept this IO
object and you can forget about compressed data processing completely.
As an example, consider following xz arrow manipulations
using CodecXz
using Arrow
using Tables
x = [(; a = 1, b = 2)]
Arrow.write("x.arrow", x)
Here we switch to shell and compress data manually. It can be done with the CodecXz
of course, but we pretend that this is external file.
sh> xz x.arrow
and back to Julia
julia> stream = XzDecompressorStream(open("x.arrow.xz"))
TranscodingStreams.TranscodingStream{XzDecompressor, IOStream}(<mode=idle>)
# we can materialize uncompressed data as Vector{UInt8}
julia> read(stream)
610-element Vector{UInt8}:
0x41
0x52
0x52
â‹®
# we can read it as a String (which is weird of course, since it is binary file)
julia> read(stream, String)
"ARROW1\0\0\xff\xff\xff\xff\xa8\0\0\0\x10\0\0\0\0\0\n\0\f\0\n\0\b\0\x04\0\n\0\0\0\x10\0\0\0\x01\
0\x04\0\b\0\b\0\0\0\x04\0\b\0\0\0\x04\0\0\0\x02\0\0\0D\0\0\0\x04\0\0\0\xd4\xff\xff\xff\x10\0\0\0..."
# and we can read it back as Julia data
julia> Arrow.Table(stream) |> Tables.rowtable
1-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
(a = 1, b = 2)
One small note: in these manipulations, before each read
I actually use stream = XzDecompressorStream(open("x.arrow.xz"))
command, because you can’t read
twice from the same stream. I just omit it for simplicity.
The only functions in your snippet that care about line delimiters are eachline
and println
. For example, read(XzDecompressorStream(stream))
would read the whole decompressed content as a vector of raw bytes.