Hey guys, I have a very big DataFrame, where I want to do row wise linear algebra operations if certain conditions are met.
What i need is that for every row, check if column (variable) is 1 or 0, and for each case, do some simple operations with some observations in the Dataframe, and store the resulting scalar in a column of the dataframe.
The code I wrote for a DataFrame with 120000 rows, however, takes very long to accomplish it, around 5 minutes.
I’ve tried coming up with a reproducible example below, however, it is actually doing the operation very fast.
In any case, I believe there must be a better way to handle these types of operations in a DataFrame. I am not very good in coding, but I’ve read somewhere that ‘if’ operations are not usually very efficient. What would be a better way to do this type of row wise operations?
using DataFrames
using Distributions
test = [DataFrame(x = rand(Normal(),1000000)) DataFrame(y = rand(Normal(),1000000)) DataFrame(z = repeat([0.],1000000))]
for i = 1:1000000
if test[:x][i] >= 0
test[:z][i] = test[:y][i]
else
test[:z][i] = 2*test[:y][i]
end
end
In DataFrames, column lookup is a bit expensive. So you might want to make test.z just a vector and input it as needed. In fact, because DataFrames doesn’t include the types of the columns as part of its type definition, you should probably add a function barrier in your code. Meaning, you should just input the vectors you need (or a tuple of the vectors) into the function that does the operation.
Keep in mind that eachrow is a bit easier to reason about in this case, but when I just tested it on 1.0 it was a little bit slower than the implementation above.
julia> @btime loop($test) # same as above
229.869 ms (5498804 allocations: 83.91 MiB)
julia> @btime byrow($test) # with the eachrow function from DataFrames
275.645 ms (9498805 allocations: 160.20 MiB)
julia> @btime loopvec($test.x, $test.y, $test.z) # Same as above, but with vectors
1.847 ms (0 allocations: 0 bytes)