How to permute the rows of a DataFrame in-place efficiently?

xiaodai · February 5, 2018, 2:16am

I have a vector of row numbers and I want to use it to permute a DataFrame’s rows. Here is an MVE

using StatsBase
df = DataFrame(a = rand(1_000_000))
r=sample(1:size(df,1), size(df,1),  replace=false)
@time df = df[r,:]

I think the above creates a DataFrame and then assigns it to df. Is there a way to re-assign the rows in place so minimal extra memory is allocated?

Nosferican · February 5, 2018, 4:37am

What exactly are you trying to accomplish? Usually for group operations a groupby à la split and apply works pretty good and handles memory quite well.

xiaodai · February 5, 2018, 5:36am

as i have shown. i want to randomize all the rows as efficiently as I can

Nosferican · February 5, 2018, 5:48am

using StatsBase, DataFrames
df = DataFrame(a = rand(1_000_000))
@time df = df[sample(1:size(df,1), size(df,1), replace=false),:];
@time df = df[shuffle(1:size(df, 1)),:];

shuffle is about a three times more efficient and about 2/3 less memory intensive (also drops StatsBase dependency)

xiaodai · February 5, 2018, 5:57am

is this line the best possible? Does it not create a new DataFrame?

anyway i got an idea now

nalimilan · February 5, 2018, 9:47am

You can simply iterate over columns and call permute! on them.

Nosferican · February 5, 2018, 7:31pm

using DataFrames
srand(0)
df = DataFrame(a = rand(Int(1e6)));
function permute_df_bycol!(df::AbstractDataFrame)
    p = shuffle(1:size(df, 1))
    for (name, col) ∈ eachcol(df)
        permute!(col, p)
    end
end
function permute_df!(df::AbstractDataFrame)
    df[:,:] = df[shuffle(1:size(df, 1)),:]
    return
end
@time permute_df_bycol!(df) # 0.064590 seconds (10 allocations: 15.259 MiB)
@time permute_df!(df) # 0.021940 seconds (36 allocations: 15.261 MiB)

xiaodai · February 5, 2018, 7:35pm

that was my idea too. But it’s slower? But there is potential to apply to all columns using threads.

Maybe it’s faster for multiple columns

Nosferican · February 6, 2018, 11:45am

You could try it, but the more columns you have the more work it would have to do. Rearranging every row for the whole dataframe should be faster as it only has to apply the permutation once and It writes in-place. Column wise might be a last resort if memory is not enough.

nalimilan · February 6, 2018, 12:23pm

I don’t think “apply the permutation only once” is correct here. Under the hood, indexing a data frame implies indexing repeatedly each column.

I suspect the permute!-based sorting of DataFrame is slower just because permuting a vector is slower than indexing it (i.e. allocating a new copy). The overhead related to DataFrame should be negligible here.

mkborregaard · February 6, 2018, 1:13pm

Why not use a view?

xiaodai · February 20, 2018, 11:29pm

Say, the computation I want to perform with the permuted dataframe would be faster if the all the columns are permuted as well. This is for “cache-efficiency”, as the next part of the program requires me to go through the vector several times in order.

Topic		Replies	Views
DataFrame inplace change columns order Data	13	715	February 24, 2023
Any way to speed up sorting a dataframe? A more efficient sortperm would be great General Usage sort , sortperm , dataframes	2	948	June 25, 2020
DataFrames reorder columns in place Data dataframes	4	3743	September 13, 2021
Sorting seems to have some low hanging speed fruit for sorting by single column Data performance , sort , dataframes	7	941	December 6, 2017
Is there a function to permute columns or rows of a matrix in place? General Usage	2	1237	May 25, 2022

How to permute the rows of a DataFrame in-place efficiently?

Related topics