Memory build-up when loading DataFrames in a loop

isaacjeffersonlee · February 3, 2024, 5:37pm

When I sequentially load a very large DataFrame into memory and then sort it in place, the memory usage builds eventually running out of memory?

Is there anything I can do to prevent this? Why is the garbage collector not freeing the memory each loop iteration? My guess would be it is to do with multi-threading?

My use case is that I want to sort some very large parquet files and then re-save them.

MWE:

using Parquet, DataFrames

fpath = "./example.parquet"
for i in 1:100
    println(i)
    df = DataFrame(read_parquet(fpath))
    sort!(df, [:timestamp])
    empty!(df)
    df = nothing
    GC.gc(true)
end

bkamins · February 3, 2024, 6:11pm

Is the memory freed only after the loop (I mean for smaller data so that the process does not crash).

In general GC should be able to reclaim memory in the loop you presented (even without calling it explicitly).

isaacjeffersonlee · February 3, 2024, 6:24pm

No it’s not freed. If I run the script from a REPL for only 5 iterations, the process initially uses 5.2GiB memory, then once the for loop ends and it returns to the REPL the process is using 17.7 GiB
of memory.

I’m using julia 1.10.0 if that helps.

Topic		Replies	Views
Determining size of DataFrame for memory management General Usage memory , dataframes	35	1707	August 4, 2022
What's using up so much memory? General Usage memory	12	1640	February 10, 2023
Aggresive garbage collection behavior with DataFrames in 1.10? Performance	0	170	June 18, 2024
Memory blow-up when passing DataFrame to function inside @threads loop Julia at Scale	1	553	April 2, 2019
Why is the memory blowing up in this multi-threaded code? General Usage	23	1877	April 4, 2019

Memory build-up when loading DataFrames in a loop

Related topics