Different ways to calculate rowwise sums?

Juan · November 5, 2021, 8:36pm

Random.seed!(1234);
df = DataFrame(randn(10, 4), :auto);

 Row │ x1         x2         x3
 ────┼──────────────────────────────────
   1 │  0.867347   2.21188   -0.560501
   2 │ -0.901744   0.532813  -0.0192918
   3 │ -0.494479  -0.271735   0.128064
   4 │ -0.902914   0.502334   1.85278
   5 │  0.864401  -0.516984  -0.827763

Any of these three options gives me the rowwise sum:

sum.(eachrow(df))

combine(df, AsTable(:) .=> sum)

select(df, AsTable(:) => ByRow(sum) => :sum)

 Row │ x1_x2_x3_sum 
─────┼──────────────
   1 │     2.51872
   2 │    -0.388222
   3 │    -0.63815
   4 │     1.4522
   5 │    -0.480346

What’s the difference, or which one should I use?
The two latter options work with transform() if I want to add this new column to the original dataframe.
I don’t know how to include the first one with a transform().

If I want columnwise sums instead:

sum.(eachcol(df)) 
combine(df,names(df) .=> sum)

I don’t know if there is any better alternative.

goerch · November 5, 2021, 8:51pm

BenchmarkTools is your friend:

using Random, DataFrames, BenchmarkTools

Random.seed!(1234);
df = DataFrame(randn(10000, 40), :auto);

@btime sum.(eachrow(df))
@btime combine(df, AsTable(:) .=> sum)
@btime select(df, AsTable(:) => ByRow(sum) => :sum)

shows

  66.825 ms (2379067 allocations: 42.48 MiB)
  368.600 μs (318 allocations: 2.99 MiB)
  740.400 μs (240 allocations: 93.88 KiB)

Juan · November 5, 2021, 8:54pm

OK, that is about the time and memory but what about other considerations?

Why does the third option use much less memory than the second one?

goerch · November 5, 2021, 9:00pm

The latter options are the faster ones, also?

Good question, anyone?

Do you want me to investigate these too?

Juan · November 5, 2021, 11:19pm

Yes, but I’m more interested in knowing the best way to append the column of rowwise sums.

transform(df, AsTable(:) .=>  sum)
transform(df, AsTable(:) .=>  ByRow(sum))  
hcat(df,combine(df, AsTable(:) .=> sum))

Anything better?
Here ByRow is much slower, though it needs less memory.

bkamins · November 8, 2021, 4:54pm

These are very small tables so the performance is affected by factors not related to summation.

In DataFrames.jl 1.3 that will be released soon (it is held back the release of Julia 1.7) the fastest option, especially for wide and large tables will be transform(df, AsTable(:) => ByRow(sum)).

For the time being an easy (i.e. IMO natural for someone knowing how things in Julia Base work), and reasonably fast option is df.sum = sum(eachcol(df)).

Also note that .=> is in this case the same as => the . does not do anything in this situation.

Juan · November 8, 2021, 8:17pm

In fact my initial question wasn’t about speed but to know if there are other differences or disadvantages. For example if the returned object is more or less useful (dataframes vs other things) for additional operations.

And I have just discovered that…

df = DataFrame(randn(5, 3), :auto);
allowmissing!(df)
df[1,1] = missing

select(df, AsTable(:) .=> sum∘skipmissing => :sum)

select(df, AsTable(:) => ByRow(sum∘skipmissing) => :sum)  
sum.(skipmissing.(eachrow(df)))

The first option, not using ByRow, doesn’t produce the expected output if we have missings. I guess we will have similar problems with other functions.

And another question,
How can I run more complex functions inside the ByRow()?

I’ve tried
select(df, AsTable(:) => ByRow(x -> x.^2) => :sum)

but it doesn’t work, it says:

ERROR: ArgumentError: broadcasting over dictionaries and NamedTuples is reserved

@bkamins How can I calculate the sum of the squares of the elements for each row?

bkamins · November 8, 2021, 8:52pm

In this option you are passing whole columns to skipmissing and whole columns are never missing.

select(df, AsTable(:) => ByRow(sum∘skipmissing) => :sum) is correct and will be very fast in DataFrames.jl 1.3.

How can I run more complex functions inside the ByRow()?

select(df, AsTable(:) => ByRow(x -> x.^2) => :sum)

this fails but not because of DataFrames.jl but because of Julia Base and in general it is incorrect as there is no sum in your expression. You have to write:

select(df, AsTable(:) => ByRow(x -> sum(v -> v^2, x)) => :sum)

(you need to sum the squares)

In general I would recommend to handle functions like x -> sum(v -> v^2, x) not as anonymous but rather predefine them - as using a lot of anonymous functions can lead to not very readable code (it is like a decision whether one should write one long one line expression or rather define variables to store intermediate values even if they are discarded later).

Topic		Replies	Views
Sum rows of DataFrame Data question , dataframes	6	1217	April 26, 2023
Help in sum columns New to Julia question , dataframes , sum	11	1819	March 14, 2022
Sum the columns in a dataframe New to Julia	6	8878	June 24, 2021
Row-wise mean of columns in a DataFrame Data	4	1719	August 13, 2021
Summing dataFrame by index over rows New to Julia	2	198	January 19, 2023

Different ways to calculate rowwise sums?

Related topics