Arrow.jl reading compressed(lz4 and zstd) arrow/feather format may have memory leak

jesseventura · December 21, 2020, 2:41pm

refer to memory leaking when reading compressed arrow files · Issue #80 · apache/arrow-julia · GitHub

I found only when i have a compressed arrow format(lz4 or zstd) , the memory leak problem occurs.
uncompressed format is OK
may be a bug in 3rd party libs ?

test env: julia1.5.3 stable / julia 1.7-dev
Arrow 1.0.3
linux mint 20

using Arrow
using ProgressMeter
using DataFrames

for i in range(1,5)
    df = DataFrame(rand(Float64,(100000,1000)))
    Arrow.write("data.$i.feather.lz4",df, compress=:lz4)
    Arrow.write("data.$i.feather.zstd",df, compress=:zstd)
end

@showprogress 1 "read repeatly" for i in range(1,5)
    df = DataFrame(Arrow.Table("data.$i.feather"))
end

and in CodecLz4 frame_compression.jl implements initialize and finalize…

 """
    TranscodingStreams.finalize(codec::LZ4FrameDecompressor)

Finalizes the LZ4F Decompression Codec.
 """
function TranscodingStreams.finalize(codec::LZ4FrameDecompressor)::Nothing
    LZ4F_freeDecompressionContext(codec.dctx[])
    nothing
end

"""
Releases the memory of a `LZ4F_cctx`.
"""
function LZ4F_freeCompressionContext(cctx::Ptr{LZ4F_cctx})
    ccall((:LZ4F_freeCompressionContext, liblz4), Csize_t, (Ptr{LZ4F_cctx},), cctx)
end

jesseventura · December 29, 2020, 7:16am

especially reading PYTHON exported feather format…

here is python code creating several feather files…

import numpy as np
from pyarrow.feather import write_feather
import os

for i in range(5):
    for comp_type in ['lz4','zstd']:
        fout = 'pydata.{}.feather.{}'.format(i+1,comp_type)
        if os.path.isfile(fout):
            continue
        write_feather(pd.DataFrame(pd.np.random.rand(100000,1000)), fout, compression=comp_type)

and here is repeatly

using Arrow
using DataFrames

@showprogress 1 "read repeatly" for ii in range(1,10)
    i = mod(ii,5)+1
    @time df = copy(DataFrame(Arrow.Table("pydata.$i.feather.zstd"), copycols=true))
    finalize(df)
    sleep(1)
end

and here is the mem size ,nearly up to 16G

PID	VSZ	RSS
55623	19133564	17207488

jesseventura · May 6, 2021, 3:41am

issue been fixed

jesseventura · August 12, 2021, 1:47am

https://github.com/JuliaData/Arrow.jl/issues/93

Topic		Replies	Views
Help with Arrow.jl and size of files Data question , arrow	23	1888	October 21, 2022
Feather.jl, and understanding when data is loaded into RAM Data	0	931	April 11, 2017
Cannot read file written by Arrow.jl in Python Data python , arrow	3	575	April 25, 2023
Apache Arrow 1.0 release Data arrow	7	1922	September 5, 2020
How to unpack an .xz file with Julia General Usage question	7	682	April 26, 2021

Arrow.jl reading compressed(lz4 and zstd) arrow/feather format may have memory leak

especially reading PYTHON exported feather format…

here is python code creating several feather files…

and here is repeatly

and here is the mem size ,nearly up to 16G

Related topics