Releasing process memory from Arrow.jl

Hi,

I make heavy use of the (incredible) Arrow.jl package provided by @quinnj to do api - exposed analytics running in docker containers. The files are large: c. 65GB, 35M rows, 250 columns and the analytics queries execute across random indice subsets with combinations of random columns to filter and calculate.

This all works fantastically, and I can run millions of calculations for a large period of time, with the memory (either through the new heap-flag-hint, or manual, periodic, memory GC at safepoints) until it grinds to a halt (either in docker or locally running).

I’m aware that Arrow reads through mmapping, but the docker/task manager memory shows a very small footprint (1-2GB) yet the process/task manager view grows without bound causing the pod or process to crash.

Running something like:

arrowfile= raw"yourArrowFileLocation.arrow"

a = Arrow.Table(arrowfile)
tblFile = DataFrames.DataFrame(a) 
tblSchema = Tables.schema(a)
tblColumnNames = Tables.columnnames(a)

@show tblSchema.types

col1 = a[1]
@show length(col1)

function evaluateAllocation(a)
    for c in Tables.columns(a)
       col = @view c[1:length(c)]
        for v in 1:length(c)
            col[v]
        end
    end
end

@allocated evaluateAllocation(a)

recreates the situation (if the file is large enough) in that my Julia process is showing a small footprint but my machine/pod process memory is maxed or crashing:
image

I can’t see in the process view what is holding the memory, although assume it must be the OS swap and restarting the julia process/pod releases it. However, in cloud docker deployments it essentially causes execution times to grind to a halt then the pod crashes, or, worse, it just hangs.

Is there any programmatic way to release this memory without restarting the process/pod and is this expected behaviour?

Regards

Is your docker a windows system as well? I think it’s still the case that files can’t be closed on windows, this looks like a related issue: Underlying file still referenced even with `copycols=true` · Issue #226 · apache/arrow-julia · GitHub

Do you see the same issue on Linux?

Thanks for the reply and link - unfortunately yes I do.

I don’t believe the code snippet requires the DataFrames line to exhibit the behaviour, it’s an artifact of an earlier test that I’ll pull out and rerun.

I can understand why this would happen, in that accessing every element of the file could move it into virtual memory for ready access, but without seeing where it’s ended up I don’t know how to control the size or lifetime, causing pods to crash after a period of time without warning or predictability.

Note that I’m running on 1.9.2 having updated all packages prior to running.

Regards,

If the number of distinct column names is unbounded, you could also be running into the fact that “Symbols don’t get GC’d” (IIUC). E.g.

while true
    Symbol(rand(UInt64))
end

will just allocate more and more memory until the process is killed by the OS, which could take a while, but will inevitably happen. I’ve run into this on some stress tests on mock-up tabular data with randomly generated column names.

Thanks for the response - in all cases I have a fixed (<1000) set of columns.

Regards

@nilshg , @simsurace here is a visual of what happens over a period of time, with the pod, then process views:

the process:

and the process metrics:

as you can see - the process RSS growth aligns to a dramatic drop in the Net Bytes Sent on the pod (the execution slowdown). the heap size hint is at 6GB and the pod appears to honour this (although below). These are all similar style test queries, too.

The only way to return to rapid execution is to restart the pods as it stands, which really isnt an approach I want to take.

Regards,