DataFrame type instability - Unneccesary memory allocation

#1

I am running Julia 1.1, here is my Code

using BenchmarkTools
using DataFrames

df = DataFrame(rand(10_000_000, 10));
Threads.nthreads()

function singlethread(df)
    N = nrow(df)
    M = ncol(df)

    cumsum = 0.0
    for i in 1:N
        for j in 1:M
            cumsum += df[j][i]
        end
    end
    return cumsum
end

function multithread(df)
    N = nrow(df)
    M = ncol(df)

    cumsum = Threads.Atomic{Float64}(0.0)
    Threads.@threads for j in 1:M
        for i in 1:N
            Threads.atomic_add!(cumsum, df[j][i])
        end
    end
    return cumsum[]
end

@btime singlethread(df)
@btime multithread(df)

My Result:

Threads.nthreads()
4

@btime singlethread(df)
3.898 s (299994891 allocations: 4.47 GiB)

@btime multithread(df)
5.845 s (143212953 allocations: 2.10 GiB)

There are unnecessary memory allocation, because of I did “df[j][i]” right?
If my function has to read the df using a format of df[j][i] or df[i, j],
e.g. my function has to read the data one by one, row by row, column by column, to perform some operations, then how can i avoid type instability?

Below, I convert everything in a 2D array, then I have almost 0 memory allocation which is good!

using BenchmarkTools
arr = rand(10_000_000, 10);
Threads.nthreads()

function singlethread(arr)
    N, M = size(arr)


    cumsum = 0.0
    for i in 1:N
        for j in 1:M
            cumsum += arr[i, j]
        end
    end
    return cumsum
end

function multithread(arr)
    N, M = size(arr)

    cumsum = Threads.Atomic{Float64}(0.0)
    Threads.@threads for j in 1:M
        for i in 1:N
            Threads.atomic_add!(cumsum, arr[i, j])
        end
    end
    return cumsum[]
end

@btime singlethread(arr)
@btime multithread(arr)
#2

The solution is to pass column vectors to the function as a tuple or named tuple (e.g. using singlethread(tuple(eachcol(df, false)...))) so that it will be specialized on the particular types. If all columns have the same types, you can also pass a vector of columns to avoid specializing on the number of columns (but be careful that it’s concretely typed).