Releasing process memory from Arrow.jl

djholiver · August 16, 2023, 9:00pm

Hi,

I make heavy use of the (incredible) Arrow.jl package provided by @quinnj to do api - exposed analytics running in docker containers. The files are large: c. 65GB, 35M rows, 250 columns and the analytics queries execute across random indice subsets with combinations of random columns to filter and calculate.

This all works fantastically, and I can run millions of calculations for a large period of time, with the memory (either through the new heap-flag-hint, or manual, periodic, memory GC at safepoints) until it grinds to a halt (either in docker or locally running).

I’m aware that Arrow reads through mmapping, but the docker/task manager memory shows a very small footprint (1-2GB) yet the process/task manager view grows without bound causing the pod or process to crash.

Running something like:

arrowfile= raw"yourArrowFileLocation.arrow"

a = Arrow.Table(arrowfile)
tblFile = DataFrames.DataFrame(a) 
tblSchema = Tables.schema(a)
tblColumnNames = Tables.columnnames(a)

@show tblSchema.types

col1 = a[1]
@show length(col1)

function evaluateAllocation(a)
    for c in Tables.columns(a)
       col = @view c[1:length(c)]
        for v in 1:length(c)
            col[v]
        end
    end
end

@allocated evaluateAllocation(a)

recreates the situation (if the file is large enough) in that my Julia process is showing a small footprint but my machine/pod process memory is maxed or crashing:

I can’t see in the process view what is holding the memory, although assume it must be the OS swap and restarting the julia process/pod releases it. However, in cloud docker deployments it essentially causes execution times to grind to a halt then the pod crashes, or, worse, it just hangs.

Is there any programmatic way to release this memory without restarting the process/pod and is this expected behaviour?

Regards

nilshg · August 17, 2023, 5:57am

Is your docker a windows system as well? I think it’s still the case that files can’t be closed on windows, this looks like a related issue: Underlying file still referenced even with `copycols=true` · Issue #226 · apache/arrow-julia · GitHub

Do you see the same issue on Linux?

djholiver · August 17, 2023, 6:25am

Thanks for the reply and link - unfortunately yes I do.

I don’t believe the code snippet requires the DataFrames line to exhibit the behaviour, it’s an artifact of an earlier test that I’ll pull out and rerun.

I can understand why this would happen, in that accessing every element of the file could move it into virtual memory for ready access, but without seeing where it’s ended up I don’t know how to control the size or lifetime, causing pods to crash after a period of time without warning or predictability.

Note that I’m running on 1.9.2 having updated all packages prior to running.

Regards,

simsurace · August 17, 2023, 7:59am

If the number of distinct column names is unbounded, you could also be running into the fact that “Symbols don’t get GC’d” (IIUC). E.g.

while true
    Symbol(rand(UInt64))
end

will just allocate more and more memory until the process is killed by the OS, which could take a while, but will inevitably happen. I’ve run into this on some stress tests on mock-up tabular data with randomly generated column names.

djholiver · August 17, 2023, 8:22am

Thanks for the response - in all cases I have a fixed (<1000) set of columns.

Regards

djholiver · August 17, 2023, 7:23pm

@nilshg , @simsurace here is a visual of what happens over a period of time, with the pod, then process views:

the process:

and the process metrics:

as you can see - the process RSS growth aligns to a dramatic drop in the Net Bytes Sent on the pod (the execution slowdown). the heap size hint is at 6GB and the pod appears to honour this (although below). These are all similar style test queries, too.

The only way to return to rapid execution is to restart the pods as it stands, which really isnt an approach I want to take.

Regards,

Topic		Replies	Views
How well Apache Arrow’s zero copy methodology is supported? Data arrow	24	2656	May 1, 2021
How to release lock on Arrow table General Usage dataframes , arrow	7	317	December 8, 2023
Arrow.jl reading compressed(lz4 and zstd) arrow/feather format may have memory leak Data arrow , data-compression	3	1504	August 12, 2021
Writing Arrow record batch requires a lot of RAM Data arrow	5	996	September 19, 2021
Memory build-up when loading DataFrames in a loop Performance question , dataframes	2	161	February 3, 2024

Releasing process memory from Arrow.jl

Related topics