Row wise operations in DataFrames

Hey guys, I have a very big DataFrame, where I want to do row wise linear algebra operations if certain conditions are met.

What i need is that for every row, check if column (variable) is 1 or 0, and for each case, do some simple operations with some observations in the Dataframe, and store the resulting scalar in a column of the dataframe.

The code I wrote for a DataFrame with 120000 rows, however, takes very long to accomplish it, around 5 minutes.

I’ve tried coming up with a reproducible example below, however, it is actually doing the operation very fast.

In any case, I believe there must be a better way to handle these types of operations in a DataFrame. I am not very good in coding, but I’ve read somewhere that ‘if’ operations are not usually very efficient. What would be a better way to do this type of row wise operations?

using DataFrames
using Distributions

test = [DataFrame(x = rand(Normal(),1000000)) DataFrame(y = rand(Normal(),1000000)) DataFrame(z = repeat([0.],1000000))]
for i = 1:1000000
    if test[:x][i] >= 0
        test[:z][i] = test[:y][i]
        test[:z][i] = 2*test[:y][i]
julia> using DataFrames, BenchmarkTools

julia> n = 10_000;

julia> df = DataFrame(x = rand(n), y = rand(n), z = zeros(Float64, n));

julia> @inline function testxy(x, y)
           return x < y

julia> function zsbyrow(df::DataFrame)
           for i=1:size(df)[1]
               if testxy(df[:x][i], df[:y][i])
                  df[:z][i] = df[:x][i] + df[:y][i]
                  df[:z][i] = df[:x][i] - df[:y][i]
           return df

julia> function zsbycol(df::DataFrame)
           xv = view(df[:x],:)
           yv = view(df[:y],:)
           xytests = testxy.(xv, yv)
           df[:z]  = xytests .* (xv .+ yv)
           xytests = (x->!x).(xytests)
           df[:z][xytests] = (xv .- yv)[xytests]
           return df

julia> byrow = @btime zsbyrow($df);
  1.567 ms (97446 allocations: 1.49 MiB)
julia> bycol = @btime zsbycol($df);
  164.967 μs (29 allocations: 206.75 KiB)

julia> round(1567/165, digits=2), round(1490/207, digits=2)
(9.5, 7.2)

In DataFrames, column lookup is a bit expensive. So you might want to make test.z just a vector and input it as needed. In fact, because DataFrames doesn’t include the types of the columns as part of its type definition, you should probably add a function barrier in your code. Meaning, you should just input the vectors you need (or a tuple of the vectors) into the function that does the operation.

Keep in mind that eachrow is a bit easier to reason about in this case, but when I just tested it on 1.0 it was a little bit slower than the implementation above.

julia> @btime loop($test) # same as above
  229.869 ms (5498804 allocations: 83.91 MiB)

julia> @btime byrow($test) # with the eachrow function from DataFrames
  275.645 ms (9498805 allocations: 160.20 MiB)

julia> @btime loopvec($test.x, $test.y, $test.z) # Same as above, but with vectors
  1.847 ms (0 allocations: 0 bytes)

Define your helper function and then,

test[:z] = ifelse.(test[:z] .≥ 0, test[:y], 2test[:y])