Different ways to calculate rowwise sums?

Random.seed!(1234);
df = DataFrame(randn(10, 4), :auto);
 Row │ x1         x2         x3
 ────┼──────────────────────────────────
   1 │  0.867347   2.21188   -0.560501
   2 │ -0.901744   0.532813  -0.0192918
   3 │ -0.494479  -0.271735   0.128064
   4 │ -0.902914   0.502334   1.85278
   5 │  0.864401  -0.516984  -0.827763

Any of these three options gives me the rowwise sum:

sum.(eachrow(df))

combine(df, AsTable(:) .=> sum)

select(df, AsTable(:) => ByRow(sum) => :sum)
 Row │ x1_x2_x3_sum 
─────┼──────────────
   1 │     2.51872
   2 │    -0.388222
   3 │    -0.63815
   4 │     1.4522
   5 │    -0.480346

What’s the difference, or which one should I use?
The two latter options work with transform() if I want to add this new column to the original dataframe.
I don’t know how to include the first one with a transform().

If I want columnwise sums instead:

sum.(eachcol(df)) 
combine(df,names(df) .=> sum)

I don’t know if there is any better alternative.

BenchmarkTools is your friend:

using Random, DataFrames, BenchmarkTools

Random.seed!(1234);
df = DataFrame(randn(10000, 40), :auto);

@btime sum.(eachrow(df))
@btime combine(df, AsTable(:) .=> sum)
@btime select(df, AsTable(:) => ByRow(sum) => :sum)

shows

  66.825 ms (2379067 allocations: 42.48 MiB)
  368.600 μs (318 allocations: 2.99 MiB)
  740.400 μs (240 allocations: 93.88 KiB)

OK, that is about the time and memory but what about other considerations?

Why does the third option use much less memory than the second one?

The latter options are the faster ones, also?

Good question, anyone?

Do you want me to investigate these too?

Yes, but I’m more interested in knowing the best way to append the column of rowwise sums.

transform(df, AsTable(:) .=>  sum)
transform(df, AsTable(:) .=>  ByRow(sum))  
hcat(df,combine(df, AsTable(:) .=> sum))

Anything better?
Here ByRow is much slower, though it needs less memory.

These are very small tables so the performance is affected by factors not related to summation.

In DataFrames.jl 1.3 that will be released soon (it is held back the release of Julia 1.7) the fastest option, especially for wide and large tables will be transform(df, AsTable(:) => ByRow(sum)).

For the time being an easy (i.e. IMO natural for someone knowing how things in Julia Base work), and reasonably fast option is df.sum = sum(eachcol(df)).


Also note that .=> is in this case the same as => the . does not do anything in this situation.

2 Likes

In fact my initial question wasn’t about speed but to know if there are other differences or disadvantages. For example if the returned object is more or less useful (dataframes vs other things) for additional operations.

And I have just discovered that…

df = DataFrame(randn(5, 3), :auto);
allowmissing!(df)
df[1,1] = missing

select(df, AsTable(:) .=> sum∘skipmissing => :sum)

select(df, AsTable(:) => ByRow(sum∘skipmissing) => :sum)  
sum.(skipmissing.(eachrow(df)))

The first option, not using ByRow, doesn’t produce the expected output if we have missings. I guess we will have similar problems with other functions.

And another question,
How can I run more complex functions inside the ByRow()?

I’ve tried
select(df, AsTable(:) => ByRow(x -> x.^2) => :sum)

but it doesn’t work, it says:

ERROR: ArgumentError: broadcasting over dictionaries and NamedTuples is reserved

@bkamins How can I calculate the sum of the squares of the elements for each row?

In this option you are passing whole columns to skipmissing and whole columns are never missing.

select(df, AsTable(:) => ByRow(sum∘skipmissing) => :sum) is correct and will be very fast in DataFrames.jl 1.3.

How can I run more complex functions inside the ByRow()?

select(df, AsTable(:) => ByRow(x -> x.^2) => :sum)

this fails but not because of DataFrames.jl but because of Julia Base and in general it is incorrect as there is no sum in your expression. You have to write:

select(df, AsTable(:) => ByRow(x -> sum(v -> v^2, x)) => :sum)

(you need to sum the squares)

In general I would recommend to handle functions like x -> sum(v -> v^2, x) not as anonymous but rather predefine them - as using a lot of anonymous functions can lead to not very readable code (it is like a decision whether one should write one long one line expression or rather define variables to store intermediate values even if they are discarded later).

2 Likes