# Row wise operations in DataFrames

Hey guys, I have a very big DataFrame, where I want to do row wise linear algebra operations if certain conditions are met.

What i need is that for every row, check if column (variable) is 1 or 0, and for each case, do some simple operations with some observations in the Dataframe, and store the resulting scalar in a column of the dataframe.

The code I wrote for a DataFrame with 120000 rows, however, takes very long to accomplish it, around 5 minutes.

I’ve tried coming up with a reproducible example below, however, it is actually doing the operation very fast.

In any case, I believe there must be a better way to handle these types of operations in a DataFrame. I am not very good in coding, but I’ve read somewhere that ‘if’ operations are not usually very efficient. What would be a better way to do this type of row wise operations?

``````using DataFrames
using Distributions

test = [DataFrame(x = rand(Normal(),1000000)) DataFrame(y = rand(Normal(),1000000)) DataFrame(z = repeat([0.],1000000))]
for i = 1:1000000
if test[:x][i] >= 0
test[:z][i] = test[:y][i]
else
test[:z][i] = 2*test[:y][i]
end
end
``````
``````julia> using DataFrames, BenchmarkTools

julia> n = 10_000;

julia> df = DataFrame(x = rand(n), y = rand(n), z = zeros(Float64, n));

julia> @inline function testxy(x, y)
return x < y
end

julia> function zsbyrow(df::DataFrame)
for i=1:size(df)[1]
if testxy(df[:x][i], df[:y][i])
df[:z][i] = df[:x][i] + df[:y][i]
else
df[:z][i] = df[:x][i] - df[:y][i]
end
end
return df
end

julia> function zsbycol(df::DataFrame)
xv = view(df[:x],:)
yv = view(df[:y],:)
xytests = testxy.(xv, yv)
df[:z]  = xytests .* (xv .+ yv)
xytests = (x->!x).(xytests)
df[:z][xytests] = (xv .- yv)[xytests]
return df
end

julia> byrow = @btime zsbyrow(\$df);
1.567 ms (97446 allocations: 1.49 MiB)
julia> bycol = @btime zsbycol(\$df);
164.967 μs (29 allocations: 206.75 KiB)

julia> round(1567/165, digits=2), round(1490/207, digits=2)
(9.5, 7.2)
``````
2 Likes

In DataFrames, column lookup is a bit expensive. So you might want to make `test.z` just a vector and input it as needed. In fact, because DataFrames doesn’t include the types of the columns as part of its type definition, you should probably add a function barrier in your code. Meaning, you should just input the vectors you need (or a tuple of the vectors) into the function that does the operation.

Keep in mind that `eachrow` is a bit easier to reason about in this case, but when I just tested it on 1.0 it was a little bit slower than the implementation above.

``````julia> @btime loop(\$test) # same as above
229.869 ms (5498804 allocations: 83.91 MiB)

julia> @btime byrow(\$test) # with the eachrow function from DataFrames
275.645 ms (9498805 allocations: 160.20 MiB)

julia> @btime loopvec(\$test.x, \$test.y, \$test.z) # Same as above, but with vectors
1.847 ms (0 allocations: 0 bytes)

``````

Define your helper function and then,

``````test[:z] = ifelse.(test[:z] .≥ 0, test[:y], 2test[:y])
``````