How to permute the rows of a DataFrame in-place efficiently?

I have a vector of row numbers and I want to use it to permute a DataFrame’s rows. Here is an MVE

using StatsBase
df = DataFrame(a = rand(1_000_000))
r=sample(1:size(df,1), size(df,1),  replace=false)
@time df = df[r,:]

I think the above creates a DataFrame and then assigns it to df. Is there a way to re-assign the rows in place so minimal extra memory is allocated?

What exactly are you trying to accomplish? Usually for group operations a groupby à la split and apply works pretty good and handles memory quite well.

as i have shown. i want to randomize all the rows as efficiently as I can

using StatsBase, DataFrames
df = DataFrame(a = rand(1_000_000))
@time df = df[sample(1:size(df,1), size(df,1), replace=false),:];
@time df = df[shuffle(1:size(df, 1)),:];

shuffle is about a three times more efficient and about 2/3 less memory intensive (also drops StatsBase dependency)

is this line the best possible? Does it not create a new DataFrame?

anyway i got an idea now

You can simply iterate over columns and call permute! on them.

3 Likes
using DataFrames
srand(0)
df = DataFrame(a = rand(Int(1e6)));
function permute_df_bycol!(df::AbstractDataFrame)
    p = shuffle(1:size(df, 1))
    for (name, col) ∈ eachcol(df)
        permute!(col, p)
    end
end
function permute_df!(df::AbstractDataFrame)
    df[:,:] = df[shuffle(1:size(df, 1)),:]
    return
end
@time permute_df_bycol!(df) # 0.064590 seconds (10 allocations: 15.259 MiB)
@time permute_df!(df) # 0.021940 seconds (36 allocations: 15.261 MiB)

that was my idea too. But it’s slower? But there is potential to apply to all columns using threads.

Maybe it’s faster for multiple columns

You could try it, but the more columns you have the more work it would have to do. Rearranging every row for the whole dataframe should be faster as it only has to apply the permutation once and It writes in-place. Column wise might be a last resort if memory is not enough.

I don’t think “apply the permutation only once” is correct here. Under the hood, indexing a data frame implies indexing repeatedly each column.

I suspect the permute!-based sorting of DataFrame is slower just because permuting a vector is slower than indexing it (i.e. allocating a new copy). The overhead related to DataFrame should be negligible here.

Why not use a view?

Say, the computation I want to perform with the permuted dataframe would be faster if the all the columns are permuted as well. This is for “cache-efficiency”, as the next part of the program requires me to go through the vector several times in order.