Performance of select! on a DataFrame

CameronBieganek · September 17, 2021, 4:02pm

Naively I would expect that dropping columns (in place) from a DataFrame with select! would be a very cheap operation, on the order of nanoseconds. But I’m getting times in the microsecond range:

julia> using DataFrames, BenchmarkTools

julia> @benchmark select!(df, :a) evals=1 setup=(df = DataFrame(a=rand(10_000), b = rand(10_000)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  1.431 μs …   8.846 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.916 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.965 μs ± 272.973 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▂▄▆▆▆█▆▆▆▅▄▃▂▂▁                               
  ▂▁▂▂▂▂▂▂▂▂▃▄▅▆██████████████████▇▇▆▆▆▅▅▄▄▃▄▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂ ▄
  1.43 μs         Histogram: frequency by time        2.65 μs <

 Memory estimate: 1.78 KiB, allocs estimate: 24.

Am I doing something wrong here?

bkamins · September 17, 2021, 4:08pm

You are not doing it wrong. The way select! is implemented now is that it creates a new DataFrame and then overwrites the old one.

What you ask for is doable as an optimization (i.e. we would need to check that the user does only simple column selection) but it was considered that microseconds are fast enough. Do you have any specific use case where you would need this?

Note though that even non-trivial selections should not be expected to be fast, e.g. select!(df, r"x", Not(:y), Between(:y, :z)) is quite complex to process and will be more expensive than nanoseconds.

CameronBieganek · September 17, 2021, 4:12pm

That’s true, microseconds is probably fast enough. I was just experimenting in the REPL. I don’t have a real use case where selecting columns is a performance critical operation.

Topic		Replies	Views
DataFrame transformation is so slow, what am I doing wrong? Performance compilation , dataframes	17	341	May 19, 2024
Fastest way to create new column in DataFrames.jl New to Julia	0	1590	September 2, 2020
How to remove columns from Dataframe General Usage dataframes	1	8879	June 8, 2022
Learning to benchmark and find the best function to select a subset of a dataframe New to Julia question	20	451	December 16, 2022
Large dataframe. fast row selection Data query , dataframes	5	2411	September 13, 2018

Performance of select! on a DataFrame

Related topics