DataFrame type instability - Unneccesary memory allocation

Eric_Chen · March 28, 2019, 3:52pm

I am running Julia 1.1, here is my Code

using BenchmarkTools
using DataFrames

df = DataFrame(rand(10_000_000, 10));
Threads.nthreads()

function singlethread(df)
    N = nrow(df)
    M = ncol(df)

    cumsum = 0.0
    for i in 1:N
        for j in 1:M
            cumsum += df[j][i]
        end
    end
    return cumsum
end

function multithread(df)
    N = nrow(df)
    M = ncol(df)

    cumsum = Threads.Atomic{Float64}(0.0)
    Threads.@threads for j in 1:M
        for i in 1:N
            Threads.atomic_add!(cumsum, df[j][i])
        end
    end
    return cumsum[]
end

@btime singlethread(df)
@btime multithread(df)

My Result:

Threads.nthreads()
4

@btime singlethread(df)
3.898 s (299994891 allocations: 4.47 GiB)

@btime multithread(df)
5.845 s (143212953 allocations: 2.10 GiB)

There are unnecessary memory allocation, because of I did “df[j][i]” right?
If my function has to read the df using a format of df[j][i] or df[i, j],
e.g. my function has to read the data one by one, row by row, column by column, to perform some operations, then how can i avoid type instability?

Below, I convert everything in a 2D array, then I have almost 0 memory allocation which is good!

using BenchmarkTools
arr = rand(10_000_000, 10);
Threads.nthreads()

function singlethread(arr)
    N, M = size(arr)


    cumsum = 0.0
    for i in 1:N
        for j in 1:M
            cumsum += arr[i, j]
        end
    end
    return cumsum
end

function multithread(arr)
    N, M = size(arr)

    cumsum = Threads.Atomic{Float64}(0.0)
    Threads.@threads for j in 1:M
        for i in 1:N
            Threads.atomic_add!(cumsum, arr[i, j])
        end
    end
    return cumsum[]
end

@btime singlethread(arr)
@btime multithread(arr)

nalimilan · March 28, 2019, 5:27pm

The solution is to pass column vectors to the function as a tuple or named tuple (e.g. using singlethread(tuple(eachcol(df, false)...))) so that it will be specialized on the particular types. If all columns have the same types, you can also pass a vector of columns to avoid specializing on the number of columns (but be careful that it’s concretely typed).

Topic		Replies	Views
Threads.@threads memory leak General Usage	6	1892	March 28, 2019
Why is the memory blowing up in this multi-threaded code? General Usage	23	1881	April 4, 2019
Compiler optimizations around DataFrames Performance	1	104	August 6, 2024
Threads maxing out all cores, but no performance increase General Usage performance , threads	16	1833	April 6, 2021
Allocation puzzler General Usage	6	481	November 19, 2018

DataFrame type instability - Unneccesary memory allocation

Related topics