Arrow.jl reading compressed(lz4 and zstd) arrow/feather format may have memory leak

refer to memory leaking when reading compressed arrow files · Issue #80 · apache/arrow-julia · GitHub

I found only when i have a compressed arrow format(lz4 or zstd) , the memory leak problem occurs.
uncompressed format is OK
may be a bug in 3rd party libs ?

test env: julia1.5.3 stable / julia 1.7-dev
Arrow 1.0.3
linux mint 20

using Arrow
using ProgressMeter
using DataFrames

for i in range(1,5)
    df = DataFrame(rand(Float64,(100000,1000)))
    Arrow.write("data.$i.feather.lz4",df, compress=:lz4)
    Arrow.write("data.$i.feather.zstd",df, compress=:zstd)
end

@showprogress 1 "read repeatly" for i in range(1,5)
    df = DataFrame(Arrow.Table("data.$i.feather"))
end

and in CodecLz4 frame_compression.jl implements initialize and finalize…

 """
    TranscodingStreams.finalize(codec::LZ4FrameDecompressor)

Finalizes the LZ4F Decompression Codec.
 """
function TranscodingStreams.finalize(codec::LZ4FrameDecompressor)::Nothing
    LZ4F_freeDecompressionContext(codec.dctx[])
    nothing
end
"""
Releases the memory of a `LZ4F_cctx`.
"""
function LZ4F_freeCompressionContext(cctx::Ptr{LZ4F_cctx})
    ccall((:LZ4F_freeCompressionContext, liblz4), Csize_t, (Ptr{LZ4F_cctx},), cctx)
end

especially reading PYTHON exported feather format…

here is python code creating several feather files…

import numpy as np
from pyarrow.feather import write_feather
import os

for i in range(5):
    for comp_type in ['lz4','zstd']:
        fout = 'pydata.{}.feather.{}'.format(i+1,comp_type)
        if os.path.isfile(fout):
            continue
        write_feather(pd.DataFrame(pd.np.random.rand(100000,1000)), fout, compression=comp_type)

and here is repeatly

using Arrow
using DataFrames

@showprogress 1 "read repeatly" for ii in range(1,10)
    i = mod(ii,5)+1
    @time df = copy(DataFrame(Arrow.Table("pydata.$i.feather.zstd"), copycols=true))
    finalize(df)
    sleep(1)
end

and here is the mem size ,nearly up to 16G

PID VSZ RSS
55623 19133564 17207488

issue been fixed

3 Likes

https://github.com/JuliaData/Arrow.jl/issues/93