DataFrame transformation is so slow, what am I doing wrong?

liuyxpp · May 18, 2024, 1:01pm

I want to do a transformation which may depends on a large number of columns. The MWE & benchmark codes:

using DataFrames

df = DataFrame(rand(10, 50), :auto)
cols = propertynames(df)

# due to the design of my actual codes, this function can only accept a vector.
f(x) = sum(x)

for i in 1:length(cols)
    @time select(df, cols[1:i] => ByRow((x...) -> f([x...])))
end

which outputs

0.047720 seconds (106.18 k allocations: 7.179 MiB, 98.71% compilation time)
  0.053221 seconds (160.25 k allocations: 10.859 MiB, 99.49% compilation time)
  0.075922 seconds (103.81 k allocations: 7.002 MiB, 69.57% gc time, 99.78% compilation time)
  0.058375 seconds (175.07 k allocations: 11.871 MiB, 99.52% compilation time)
  0.028908 seconds (127.15 k allocations: 8.693 MiB, 99.41% compilation time)
  0.031616 seconds (136.75 k allocations: 9.333 MiB, 99.47% compilation time)
  0.034097 seconds (147.99 k allocations: 10.086 MiB, 99.53% compilation time)
  0.044063 seconds (159.05 k allocations: 10.874 MiB, 9.65% gc time, 99.60% compilation time)
  0.038997 seconds (170.84 k allocations: 11.680 MiB, 99.47% compilation time)
  0.041796 seconds (182.76 k allocations: 12.447 MiB, 99.56% compilation time)
  0.116322 seconds (268.54 k allocations: 18.050 MiB, 3.57% gc time, 99.79% compilation time)
  0.114697 seconds (260.76 k allocations: 17.541 MiB, 99.78% compilation time)
  0.126955 seconds (279.34 k allocations: 18.755 MiB, 99.81% compilation time)
  0.146084 seconds (300.26 k allocations: 20.103 MiB, 2.70% gc time, 99.83% compilation time)
  0.160667 seconds (324.04 k allocations: 21.459 MiB, 99.83% compilation time)
  0.178767 seconds (348.90 k allocations: 22.756 MiB, 2.17% gc time, 99.84% compilation time)
  0.194160 seconds (374.26 k allocations: 24.147 MiB, 99.86% compilation time)
  0.228985 seconds (394.74 k allocations: 25.180 MiB, 1.55% gc time, 99.86% compilation time)
  0.242908 seconds (421.90 k allocations: 26.493 MiB, 1.63% gc time, 99.87% compilation time)
  0.265827 seconds (451.13 k allocations: 27.959 MiB, 99.88% compilation time)
  0.298992 seconds (480.59 k allocations: 29.472 MiB, 1.27% gc time, 99.90% compilation time)
  0.339485 seconds (511.92 k allocations: 31.351 MiB, 1.21% gc time, 99.90% compilation time)
  0.372290 seconds (547.37 k allocations: 33.242 MiB, 99.91% compilation time)
  0.413722 seconds (586.02 k allocations: 34.857 MiB, 0.83% gc time, 99.92% compilation time)
  0.460433 seconds (625.47 k allocations: 36.655 MiB, 0.79% gc time, 99.92% compilation time)
  0.505811 seconds (666.13 k allocations: 38.678 MiB, 0.71% gc time, 99.93% compilation time)
  0.568882 seconds (704.31 k allocations: 40.326 MiB, 0.70% gc time, 99.94% compilation time)
  0.624337 seconds (741.10 k allocations: 41.929 MiB, 0.63% gc time, 99.94% compilation time)
  0.705504 seconds (787.13 k allocations: 43.825 MiB, 0.59% gc time, 99.95% compilation time)
  0.768546 seconds (833.44 k allocations: 45.764 MiB, 0.53% gc time, 99.95% compilation time)
  0.854407 seconds (880.88 k allocations: 47.704 MiB, 0.50% gc time, 99.95% compilation time)
  0.294726 seconds (943.01 k allocations: 65.202 MiB, 1.37% gc time, 99.84% compilation time)
  0.256508 seconds (317.29 k allocations: 19.898 MiB, 1.39% gc time, 99.79% compilation time)
  0.220710 seconds (302.03 k allocations: 18.543 MiB, 99.74% compilation time)
  0.213203 seconds (301.65 k allocations: 18.546 MiB, 1.74% gc time, 99.70% compilation time)
  0.209549 seconds (304.92 k allocations: 18.678 MiB, 99.64% compilation time)
  0.210791 seconds (311.05 k allocations: 18.899 MiB, 1.74% gc time, 99.60% compilation time)
  0.216300 seconds (318.56 k allocations: 19.257 MiB, 99.55% compilation time)
  0.219012 seconds (322.26 k allocations: 19.498 MiB, 1.61% gc time, 99.54% compilation time)
  0.215500 seconds (325.18 k allocations: 19.426 MiB, 99.51% compilation time)
  0.216434 seconds (331.18 k allocations: 19.765 MiB, 99.48% compilation time)
  0.224192 seconds (338.62 k allocations: 20.162 MiB, 1.57% gc time, 99.47% compilation time)
  0.222419 seconds (341.59 k allocations: 20.252 MiB, 99.41% compilation time)
  0.233205 seconds (344.93 k allocations: 20.258 MiB, 1.50% gc time, 99.40% compilation time)
  0.239102 seconds (348.26 k allocations: 20.308 MiB, 99.33% compilation time)
  0.235739 seconds (355.91 k allocations: 20.656 MiB, 99.25% compilation time)
  0.243609 seconds (362.59 k allocations: 21.027 MiB, 99.27% compilation time)
  0.250321 seconds (368.78 k allocations: 21.378 MiB, 1.79% gc time, 99.26% compilation time)
  0.242530 seconds (375.07 k allocations: 21.664 MiB, 99.24% compilation time)
  0.252716 seconds (382.77 k allocations: 21.949 MiB, 1.48% gc time, 99.21% compilation time)

As can be seen, most of time is compilation time. And the situation gets worse as number of dependent columns increases to some point (i=31), when i>31, the running time decreases in half. I don’t quite understand what is going on. If I run the same for loop twice, the situation is the same.

How can I improve the speed of this kind of operation? I have to run such select a million times.

sijo · May 18, 2024, 3:10pm

I don’t know why it recompiles everytime when you use the globals, but it’s much better if you avoid globals e.g. with a function barrier:

g(df, cols) = for i in 1:length(cols)
    select(df, cols[1:i] => ByRow((x...) -> f([x...])))
end;

julia> @time g(df, cols)
 18.581530 seconds (19.28 M allocations: 1.144 GiB, 2.33% gc time, 99.76% compilation time)

julia> @time g(df, cols)
  0.059227 seconds (295.76 k allocations: 13.219 MiB, 8.89% compilation time)

littlelib · May 18, 2024, 3:15pm

If you don’t need your result to be a DataFrame, it might be better off not to use select and iterate eachrow(df). It works way faster in my machine, and only compiles once.

Here are the test code and results.

using DataFrames

df=DataFrame(rand(10, 50), :auto)
cols=propertynames(df)
f(x)=sum(x)

julia> for i in 1:length(cols)
    @time begin
        result=select(df, cols[1:i]=>ByRow((x...)->f([x...])))
        result
    end
end
0.023659 seconds (42.74 k allocations: 2.861 MiB, 99.18% compilation time)
  0.028282 seconds (35.62 k allocations: 2.455 MiB, 99.03% compilation time)
  0.023136 seconds (28.56 k allocations: 1.933 MiB, 98.88% compilation time)
  0.021600 seconds (32.88 k allocations: 2.242 MiB, 98.79% compilation time)
  0.024380 seconds (37.42 k allocations: 2.557 MiB, 98.86% compilation time)
  0.027877 seconds (42.34 k allocations: 2.859 MiB, 98.93% compilation time)
  0.038511 seconds (47.60 k allocations: 3.205 MiB, 19.85% gc time, 99.31% compilation time)
  0.033422 seconds (53.19 k allocations: 3.607 MiB, 99.08% compilation time)
  0.036855 seconds (59.23 k allocations: 3.977 MiB, 99.08% compilation time)
  0.040105 seconds (65.44 k allocations: 4.356 MiB, 99.35% compilation time)
  0.102352 seconds (105.81 k allocations: 6.948 MiB, 99.72% compilation time)
  0.113956 seconds (116.47 k allocations: 7.651 MiB, 99.69% compilation time)
  0.126971 seconds (127.96 k allocations: 8.284 MiB, 99.74% compilation time)
  0.136597 seconds (141.59 k allocations: 9.217 MiB, 4.14% gc time, 99.74% compilation time)
  0.145650 seconds (157.22 k allocations: 10.085 MiB, 99.76% compilation time)
  0.170999 seconds (173.57 k allocations: 10.835 MiB, 99.81% compilation time)
  0.187760 seconds (190.01 k allocations: 11.694 MiB, 99.79% compilation time)
  0.215324 seconds (201.96 k allocations: 12.227 MiB, 2.71% gc time, 99.84% compilation time)
  0.232907 seconds (220.67 k allocations: 13.116 MiB, 99.85% compilation time)
  0.254349 seconds (239.03 k allocations: 14.016 MiB, 99.86% compilation time)
  0.283096 seconds (257.68 k allocations: 14.884 MiB, 99.87% compilation time)
  0.320514 seconds (277.74 k allocations: 16.020 MiB, 99.88% compilation time)
  0.349272 seconds (300.87 k allocations: 17.306 MiB, 99.89% compilation time)
  0.399058 seconds (325.92 k allocations: 18.376 MiB, 1.95% gc time, 99.91% compilation time)
  0.430834 seconds (351.40 k allocations: 19.413 MiB, 99.91% compilation time)
  0.487500 seconds (377.72 k allocations: 20.537 MiB, 1.19% gc time, 99.92% compilation time)
  0.532853 seconds (404.69 k allocations: 21.626 MiB, 99.92% compilation time)
  0.596792 seconds (431.90 k allocations: 22.800 MiB, 0.92% gc time, 99.93% compilation time)
  0.657454 seconds (459.75 k allocations: 23.935 MiB, 99.94% compilation time)
  0.747010 seconds (488.22 k allocations: 25.091 MiB, 0.78% gc time, 99.94% compilation time)
  0.810588 seconds (517.37 k allocations: 26.301 MiB, 99.95% compilation time)
  0.052146 seconds (50.33 k allocations: 3.397 MiB, 99.06% compilation time)
  0.045095 seconds (47.39 k allocations: 3.177 MiB, 98.80% compilation time)
  0.063423 seconds (49.12 k allocations: 3.257 MiB, 26.48% gc time, 99.12% compilation time)
  0.045049 seconds (50.88 k allocations: 3.332 MiB, 98.49% compilation time)
  0.044229 seconds (51.62 k allocations: 3.360 MiB, 98.32% compilation time)
  0.045476 seconds (53.46 k allocations: 3.440 MiB, 98.25% compilation time)
  0.046515 seconds (55.36 k allocations: 3.535 MiB, 98.07% compilation time)
  0.047503 seconds (57.27 k allocations: 3.618 MiB, 97.94% compilation time)
  0.046416 seconds (58.18 k allocations: 3.656 MiB, 97.85% compilation time)
  0.047273 seconds (60.17 k allocations: 3.741 MiB, 97.72% compilation time)
  0.048705 seconds (62.23 k allocations: 3.846 MiB, 97.48% compilation time)
  0.048830 seconds (64.30 k allocations: 3.936 MiB, 97.35% compilation time)
  0.047081 seconds (65.36 k allocations: 3.977 MiB, 97.28% compilation time)
  0.048451 seconds (67.51 k allocations: 4.071 MiB, 97.10% compilation time)
  0.130700 seconds (69.72 k allocations: 4.182 MiB, 61.24% gc time, 98.84% compilation time)
  0.050055 seconds (71.95 k allocations: 4.287 MiB, 96.87% compilation time)
  0.048853 seconds (73.16 k allocations: 4.333 MiB, 96.51% compilation time)
  0.051159 seconds (75.47 k allocations: 4.431 MiB, 96.38% compilation time)
  0.051718 seconds (77.84 k allocations: 4.568 MiB, 96.50% compilation time)

julia> for i in 1:length(cols)
    @time begin
        for row in eachrow(df)
            row=row[cols[1:i]]
            result=row|>collect|>f
            result
        end
    end
end
0.000039 seconds (161 allocations: 11.422 KiB)
  0.000013 seconds (191 allocations: 12.359 KiB)
  0.000020 seconds (221 allocations: 13.141 KiB)
  0.000021 seconds (251 allocations: 14.078 KiB)
  0.000016 seconds (281 allocations: 14.859 KiB)
  0.000026 seconds (311 allocations: 15.797 KiB)
  0.000017 seconds (341 allocations: 16.578 KiB)
  0.000021 seconds (371 allocations: 17.516 KiB)
  0.000018 seconds (401 allocations: 18.297 KiB)
  0.000018 seconds (431 allocations: 19.234 KiB)
  0.000018 seconds (461 allocations: 20.016 KiB)
  0.000020 seconds (491 allocations: 20.953 KiB)
  0.000024 seconds (521 allocations: 21.734 KiB)
  0.000023 seconds (551 allocations: 22.672 KiB)
  0.000024 seconds (581 allocations: 23.453 KiB)
  0.000027 seconds (611 allocations: 24.391 KiB)
  0.000027 seconds (641 allocations: 25.172 KiB)
  0.000026 seconds (671 allocations: 26.109 KiB)
  0.000026 seconds (701 allocations: 26.891 KiB)
  0.000027 seconds (731 allocations: 27.828 KiB)
  0.000031 seconds (761 allocations: 28.609 KiB)
  0.000029 seconds (791 allocations: 29.547 KiB)
  0.000035 seconds (821 allocations: 30.328 KiB)
  0.000031 seconds (851 allocations: 31.266 KiB)
  0.000033 seconds (881 allocations: 32.047 KiB)
  0.000033 seconds (911 allocations: 32.984 KiB)
  0.000034 seconds (941 allocations: 33.766 KiB)
  0.000034 seconds (971 allocations: 34.703 KiB)
  0.000037 seconds (1.00 k allocations: 35.484 KiB)
  0.000037 seconds (1.03 k allocations: 36.422 KiB)
  0.000038 seconds (1.06 k allocations: 37.203 KiB)
  0.000045 seconds (1.17 k allocations: 49.859 KiB)
  0.000048 seconds (1.20 k allocations: 50.797 KiB)
  0.000074 seconds (1.23 k allocations: 51.422 KiB)
  0.000051 seconds (1.26 k allocations: 52.047 KiB)
  0.000049 seconds (1.29 k allocations: 53.297 KiB)
  0.000069 seconds (1.32 k allocations: 54.234 KiB)
  0.000051 seconds (1.35 k allocations: 54.859 KiB)
  0.000059 seconds (1.38 k allocations: 55.484 KiB)
  0.000053 seconds (1.41 k allocations: 56.734 KiB)
  0.000051 seconds (1.44 k allocations: 57.672 KiB)
  0.000055 seconds (1.47 k allocations: 58.297 KiB)
  0.000061 seconds (1.53 k allocations: 83.922 KiB)
  0.000067 seconds (1.56 k allocations: 85.484 KiB)
  0.000069 seconds (1.59 k allocations: 86.578 KiB)
  0.000071 seconds (1.62 k allocations: 87.203 KiB)
  0.000062 seconds (1.65 k allocations: 87.828 KiB)
  0.000064 seconds (1.68 k allocations: 88.453 KiB)
  0.000062 seconds (1.71 k allocations: 89.078 KiB)
  0.000081 seconds (1.74 k allocations: 90.641 KiB)

I have no idea why they differ in performance so drastically though.

pdeffebach · May 18, 2024, 3:16pm

Creating that many vectors on the fly is always going to be slow. Are you sure you can’t do AsTable(cols) => ByRow(f)? That passes a NamedTuple to f, which should be a lot faster.

I don’t think there is any particularly fast way to make all these vectors faster. Maybe using a StaticVector might help?

pdeffebach · May 18, 2024, 3:18pm

I think it’s re-compiling the function in every iteration of the loop because you are running in global scope. That’s why it’s slow.

rocco_sprmnt21 · May 18, 2024, 3:23pm

If you really want to use select to do this, try this.

@time select(df, 1:2=>(x...)->sum.(zip(x...)))

otherwise, as others suggest, change the way you approach the problem.
for example

@time sum.(eachrow(df))

littlelib · May 18, 2024, 3:37pm

If the global scope was the only reason, then I think using the function barrier should make the performance of both select and eachrow equal.
However, even after using the function barrier, it seems that eachrow is still more than 10x faster than select.

function use_select(df, cols)
    for i in 1:length(cols)
        select(df, cols[1:i]=>ByRow((x...)->f([x...])))
    end
end

function use_eachrow(df, cols)
    for i in 1:length(cols)
        map(eachrow(df)) do row
            row=row[cols[1:i]]
            result=row|>collect|>f
            result
        end
    end
end

julia> @time use_select(df, cols)
15.123364 seconds (19.28 M allocations: 1.144 GiB, 0.91% gc time, 99.72% compilation time)
julia> @time use_select(df, cols)
0.026519 seconds (295.76 k allocations: 13.219 MiB, 23.26% gc time, 10.57% compilation time)

julia> @time use_eachrow(df, cols)
0.065980 seconds (127.61 k allocations: 7.558 MiB, 97.26% compilation time)
julia> @time use_eachrow(df, cols)
0.001605 seconds (44.31 k allocations: 1.898 MiB)

I think the main difference should lie somewhere in the instrinsics of select, which I have no clue of. I thought using ByRow would yield the same performance as eachrow, so I’m a bit baffled.

littlelib · May 18, 2024, 4:30pm

Oh you meant the re-compiling of the anonymous function (x...)->f([x...]). Sorry for the misunderstanding. That does seem to slow things way down.
It no longer recompiles when run in the global scope, and only about 10x slower than eachrow.

function named_function(x...)
    f([x...])
end

julia> for i in 1:length(cols)
    @time begin
        result=select(df, cols[1:i]=>ByRow(named_function))
        result
    end
end

0.009552 seconds (1.11 k allocations: 62.133 KiB, 98.44% compilation time)
  0.000119 seconds (158 allocations: 6.727 KiB)
  0.000085 seconds (173 allocations: 7.164 KiB)
  0.000078 seconds (184 allocations: 7.797 KiB)
  0.000121 seconds (194 allocations: 8.047 KiB)
  0.000150 seconds (204 allocations: 8.500 KiB)
  0.000096 seconds (214 allocations: 8.750 KiB)
  0.000084 seconds (224 allocations: 9.203 KiB)
  0.000080 seconds (234 allocations: 9.453 KiB)
  0.000130 seconds (244 allocations: 9.906 KiB)
  0.000127 seconds (258 allocations: 10.453 KiB)
  0.000141 seconds (268 allocations: 10.922 KiB)
  0.000110 seconds (278 allocations: 11.188 KiB)
  0.000102 seconds (288 allocations: 11.656 KiB)
  0.000100 seconds (298 allocations: 11.922 KiB)
  0.000102 seconds (308 allocations: 12.391 KiB)
  0.000107 seconds (318 allocations: 12.656 KiB)
  0.000109 seconds (328 allocations: 13.125 KiB)
  0.000111 seconds (338 allocations: 13.391 KiB)
  0.000114 seconds (348 allocations: 13.859 KiB)
  0.000158 seconds (358 allocations: 14.125 KiB)
  0.000141 seconds (368 allocations: 14.594 KiB)
  0.000221 seconds (378 allocations: 14.859 KiB)
  0.000179 seconds (388 allocations: 15.328 KiB)
  0.000168 seconds (398 allocations: 15.594 KiB)
  0.000151 seconds (408 allocations: 16.062 KiB)
  0.000169 seconds (471 allocations: 17.156 KiB)
  0.000160 seconds (481 allocations: 17.625 KiB)
  0.000177 seconds (491 allocations: 17.891 KiB)
  0.000161 seconds (501 allocations: 18.359 KiB)
  0.000162 seconds (511 allocations: 18.625 KiB)
  0.000229 seconds (963 allocations: 45.812 KiB)
  0.000254 seconds (1.66 k allocations: 82.219 KiB)
  0.000296 seconds (3.13 k allocations: 148.531 KiB)
  0.000385 seconds (4.62 k allocations: 209.750 KiB)
  0.000446 seconds (6.17 k allocations: 286.219 KiB)
  0.000525 seconds (7.74 k allocations: 351.641 KiB)
  0.000610 seconds (9.37 k allocations: 432.375 KiB)
  0.000590 seconds (11.01 k allocations: 501.922 KiB)
  0.000692 seconds (12.73 k allocations: 585.797 KiB)
  0.000762 seconds (14.45 k allocations: 657.531 KiB)
  0.001130 seconds (16.24 k allocations: 747.531 KiB)
  0.001024 seconds (18.04 k allocations: 823.859 KiB)
  0.001007 seconds (19.90 k allocations: 915.938 KiB)
  0.001058 seconds (21.78 k allocations: 994.688 KiB)
  0.001077 seconds (23.73 k allocations: 1.068 MiB)
  0.001120 seconds (25.68 k allocations: 1.149 MiB)
  0.001278 seconds (27.71 k allocations: 1.247 MiB)
  0.001371 seconds (29.74 k allocations: 1.330 MiB)
  0.001383 seconds (31.84 k allocations: 1.435 MiB)

pdeffebach · May 18, 2024, 4:47pm

No I would have expected the compiler to cache those anonymous functions when you call the function the first time.

Maybe this is due to overhead of select and it gets better with more rows? Hard to say.

rocco_sprmnt21 · May 18, 2024, 6:00pm

you could, alternatively, UNvectorize your function


fv(x) = sum(x)
fsplat(x...)=fv(x)

for i in 1:length(cols)
    @time select(df, cols[1:i] => ByRow(fsplat))
end

it is not necessary to use the name of the columns

for i in 1:length(cols)
    @time select(df, 1:i => ByRow(fsplat))
end

liuyxpp · May 19, 2024, 3:49am

If I wrap the test code into a function

       function test(df)
           cols = propertynames(df)
           for i in 1:length(cols)
               @time select(df, cols[1:i] => ByRow((x...) -> f([x...])))
           end
       end

The output is the same. Thus it is not due to global scope.

Edit: I forgot to run it twice. The second time is much faster.

liuyxpp · May 19, 2024, 3:52am

I can not unvectorize my function, because it is the input required by evaluating an expression in DynamicExpressions.jl. In short, the expression expects a list of values for each variable which is used to construct the Node.

liuyxpp · May 19, 2024, 4:13am

I like the eachrow approach. In my actual use case, a subset of columns (not necessary consecutive) is used.

jling · May 19, 2024, 4:15am

this is what LazyRow was born for. Although, vanilla DataFrame won’t have type stability when LazyRow lazily getproperty() in side the row-loop

I have long advocated for LazyRow to be relaxed so we can use it on any column-table-compatible data structure: Relax `LazyRow` for other packages · Issue #271 · JuliaArrays/StructArrays.jl · GitHub

rocco_sprmnt21 · May 19, 2024, 8:20am

I’m not sure I understood your context exactly and maybe I expressed myself badly (or google translated my text wrong).
I’ll try to say it another way.
You don’t have to modify your function that takes a vector as input.
You need to define another one that bridges the way the DataFrames minilanguage works and whether your function expects an input vector.
We therefore need something that collects the loose variables in a vector (perhaps the more suitable name would be fslurp()?) and applies the original function f() on this.
If your function is fv(x) with x vector, you could use the following code

fslurp(x...)=fv(x)
for i in 1:length(cols)
    @time select(df, selectioncols=> ByRow(fslurp))
end

liuyxpp · May 19, 2024, 9:06am

I got it now, thanks! But the select is still slower than the eachrow approach.

rocco_sprmnt21 · May 19, 2024, 10:08am

So if you are willing to use other approaches besides select, try this

@time sum.(Tables.namedtupleiterator(df))

This doesn’t look bad either


for i in 1:length(cols)
    @time select(df,AsTable(:)=>x->sum.(Tables.namedtupleiterator(x)))
end

pdeffebach · May 19, 2024, 2:41pm

Can you file an issue with DataFrames? I really think the select should be just as fast and I don’t know why it’s slower.

Topic		Replies	Views
Fast iteration over rows of a DataFrame Performance	14	14180	June 30, 2020
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10562	April 22, 2022
Simple Table Operation Has Very Large Compilation Time with MLJ General Usage question , dataframes , mlj , tables	20	874	June 12, 2022
Help improving the speed of a DataFrames operation Performance performance , query , dataframes	37	1239	February 22, 2024
Accessing a column value from DataFrameRow allocates Performance dataframes	10	847	March 7, 2022

DataFrame transformation is so slow, what am I doing wrong?

Related topics