Performance of select! on a DataFrame

Naively I would expect that dropping columns (in place) from a DataFrame with select! would be a very cheap operation, on the order of nanoseconds. But I’m getting times in the microsecond range:

julia> using DataFrames, BenchmarkTools

julia> @benchmark select!(df, :a) evals=1 setup=(df = DataFrame(a=rand(10_000), b = rand(10_000)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  1.431 μs …   8.846 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.916 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.965 μs ± 272.973 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▂▄▆▆▆█▆▆▆▅▄▃▂▂▁                               
  ▂▁▂▂▂▂▂▂▂▂▃▄▅▆██████████████████▇▇▆▆▆▅▅▄▄▃▄▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂ ▄
  1.43 μs         Histogram: frequency by time        2.65 μs <

 Memory estimate: 1.78 KiB, allocs estimate: 24.

Am I doing something wrong here?

You are not doing it wrong. The way select! is implemented now is that it creates a new DataFrame and then overwrites the old one.

What you ask for is doable as an optimization (i.e. we would need to check that the user does only simple column selection) but it was considered that microseconds are fast enough. Do you have any specific use case where you would need this?

Note though that even non-trivial selections should not be expected to be fast, e.g. select!(df, r"x", Not(:y), Between(:y, :z)) is quite complex to process and will be more expensive than nanoseconds.

2 Likes

That’s true, microseconds is probably fast enough. I was just experimenting in the REPL. I don’t have a real use case where selecting columns is a performance critical operation. :slight_smile: