How to speed up rowsum function?

There is a rowsum function in R, it’s very helpful and fast when constructing some likelihood function, rowsum can apply a function to a group subsetted from a matrix then concatenate these resulted vectors to a new matrix. So I write a similar funtion myself in Julia, however, the speed is no ideal, julia version costs 500ms, R version costs 5ms. Is there any way to speed up julia version code ?

# Julia version
using Random
using BenchmarkTools
seed = Random.seed!(2022 - 20 - 13)

function rowsum(m::AbstractMatrix{<:Number}, g::Vector{Int})
    rst = Matrix{eltype(m)}(undef, maximum(g), size(m, 2))
    @inbounds @simd for group in 1:maximum(g)
        rst[group, :] = sum(view(m, g .== group, :); dims = 1)
    end 
    return rst
end

N = 100000
a = rand(seed, N, 4);
g = repeat(1:Int(N/5), inner = 5);

rowsum(a, g);

@btime rowsum($a, $g);
@code_warntype rowsum(a, g) # make sure type stability
# R version
library(microbenchmark)
set.seed(2022 - 20 - 13)

N <- 100000
a <- matrix(rnorm(N * 4), N, 4)

g <- rep(1:(N/5), each = 5)

b <- rowsum(a, g)

t = microbenchmark(
    "rowsum" = {
        b = rowsum(a, g)
    }, 
    times = 1000,
    unit = "ms"
)

t

Your Julia implementation is not efficient.
Your grouping in

view(m, g .== group, :)

has to check the complete vector g again and again for every value in group.
Per definition of the R function the same can be done on a single run through g, e.g. by storing the row index for each factor found in g in a dictionary.

In other words, your implementation compares each group value maximum(g) * 100000 times, means

julia> length(1:maximum(g))*length(g)
2000000000

This times comparisons vs.

julia> length(g)
100000

this times running a single run through g.

I am not sure if I am clear enough. In the mean time I will try some more efficient implementation.

Here’s one way to improve this, but there might be better ways:

julia> function rowsum2(mat::AbstractMatrix{<:Number}, groups::AbstractVector{<:Integer})
           size(mat, 1) == length(groups) || error("length of group vector must match matrix")
           rst = fill!(similar(mat, maximum(groups), size(mat, 2)), 0)
           for (i, g) in pairs(groups)
             @views rst[g, :] .+= mat[i, :]
           end
           return rst
       end;

julia> r = @btime rowsum($a, $g);
  min 1.125 s, mean 1.127 s (160002 allocations, 331.12 MiB)

julia> r ≈ @btime rowsum2($a, $g)
  min 984.500 μs, mean 1.061 ms (2 allocations, 625.05 KiB)
true

Here is my implementation, which isn’t faster but perhaps for OP better to read and which matches better to my wall of text from above:

function rowsum3(m::AbstractMatrix{<:Number}, g::Vector{Int})
	rst = Matrix{eltype(m)}(undef, maximum(g), size(m, 2))
	idx = [ Int[] for i in 1:maximum(g) ]
	for i in 1:length(g)
       push!(idx[g[i]],i)
    end
    @inbounds for (i,rows) in enumerate(idx)
        rst[i, :] = sum(view(m, rows, :); dims = 1)
    end 
	return rst
end
julia> @btime rowsum($a, $g);
  965.716 ms (160002 allocations: 331.12 MiB)

julia> @btime rowsum3($a, $g);
  7.278 ms (120004 allocations: 8.09 MiB)

(EDIT: gave it the name rowsum3)

You’re right. It’s really a big problem

It’s much faster than R version, great solution!

A slight modification to the @mcabbott solution that makes it a little faster (running the matrices by column), but perhaps loses too much in generality.
I do not know if the use cases where you have to apply it do not fall within the scheme that I adopted (groups of fixed length, in essence)

julia> function rowsum3(mat::AbstractMatrix{<:Number}, groups::AbstractVector{<:Integer})
                  size(mat, 1) == length(groups) || error("length of group vector must match matrix")
                  rst = fill!(similar(mat, maximum(groups), size(mat, 2)), 0)    
                   gr = maximum(groups)*size(mat, 2)
                   sgr = Int(length(groups)/maximum(groups))
                   for g in 1:gr
                       for i in (g-1)*sgr+1:g*sgr
                           @views rst[g] += mat[i]
                       end
                   end
                  return rst
              end;

julia> @btime rowsum3($a, $g);
  552.300 μs (2 allocations: 625.05 KiB)

julia> rowsum2(a, g) == rowsum3(a, g) 
true