What's julia's solution to tapply or accumarray?

leon · October 19, 2021, 5:36pm

I have one-column Index as the below:
A = [1, 2, 2, 3, 3, 3];

I have a matrix B as the below:
B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8];

I would want to calculate the average of the duplicate rows based on Index A. The outcome C would be:
C = [1.0 2.0 3.0; 2.5 3.5 4.5; 5.0 6.0 7.0];

In Matlab, I can do this:

[~, ~, subs] = unique(A, 'stable');
C = accumarray(subs, B(:), [], @mean);

In R, we could use the function tapply

What would be the solution in Julia? Many thanks!

mcabbott · October 19, 2021, 7:20pm

One way:

julia> A = [1, 2, 2, 3, 3, 3];

julia> B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8];

julia> using GroupSlices

julia> reduce(vcat, [mean(B[i, :], dims=1) for i in groupinds(A)])
3×3 Matrix{Float64}:
 1.0  2.0  3.0
 2.5  3.5  4.5
 5.0  6.0  7.0

There might be nicer ways to do this using DataFrames, instead of just matrices.

carstenbauer · October 19, 2021, 7:29pm

This gives an error in Julia.

Also, I don’t quite understand what you mean by

I would want to calculate the average of the duplicate rows based on Index A

Can you explain the desired operation a bit more?

leon · October 19, 2021, 7:35pm

Sorry for the error. I have fixed it.

Big thanks to Mcabbott who has offered a solution to this question perfectly!

I was trying to say that if some rows are duplicates, all values within these duplicate rows will be averaged within each of their respective column.

leon · October 19, 2021, 7:44pm

@mcabbott,

Many thanks again for the nice solution! In reality, my A index is a matrix composed of 3 columns:
A = Data[:, [21, 25, 35] ];

In this case, how do I specify the for i in groupinds(A) syntax?

mcabbott · October 19, 2021, 7:52pm

I’m not certain I follow, but you can ask for unique rows like this:

julia> A = [1, 2, 2, 3, 3, 3];

julia> groupinds(A)
3-element Vector{Vector{Int64}}:
 [1]
 [2, 3]
 [4, 5, 6]

julia> A2 = hcat(A, [10, 99, 99, 99, 4, 4])
6×2 Matrix{Int64}:
 1  10
 2  99
 2  99
 3  99
 3   4
 3   4

julia> unique(A2, dims=1)
4×2 Matrix{Int64}:
 1  10
 2  99
 3  99
 3   4

julia> groupinds(groupslices(A2, dims=1))
4-element Vector{Vector{Int64}}:
 [1]
 [2, 3]
 [4]
 [5, 6]

leon · October 19, 2021, 7:55pm

That’s exactly what I am looking for! Many thanks …

rafael.guerra · October 20, 2021, 10:26am

The explanation was not very clear but it is basically: to average the rows of matrix B as per the list of indices in vector A.

Suggestion of a package-less alternative, found thanks to Gabriel Fauré’s Requiem:

A = [1, 2, 2, 3, 3, 3]
B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8]

C = vcat([mean(B[ai .∈ A,:], dims=1) for ai in unique(A)]...)

3×3 Matrix{Float64}:
 1.0  2.0  3.0
 2.5  3.5  4.5
 5.0  6.0  7.0

aplavin · October 20, 2021, 5:13pm

Yet another problem discussed here these days with a direct solution in StructArrays and SplitApplyCombine packages (:

using StructArrays
using SplitApplyCombine
using Statistics

# original arrays
A = [1, 2, 2, 3, 3, 3]
B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8]

# combine A and rows of B into a single array
AB = StructArray(; A, B=splitdims(B, 1))

# compute the desired per-group means
C = map(groupview(x -> x.A, AB)) do gr
	mean(gr.B)
end

# C is a dictionary
# access values as e.g. C[2]

# the same with a nicer piping syntax:
using DataPipes

@p AB |> groupview(_.A) |> map(mean(_.B))

sijo · October 20, 2021, 6:09pm

And with DataFrames:

using DataFrames
using Statistics

A = [1, 2, 2, 3, 3, 3]
B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8]

df = DataFrame(; A, B=collect(eachrow(B)))

combine(groupby(df, :A), :B => Ref∘mean)

3×2 DataFrame
 Row │ A      B_Ref_mean      
     │ Int64  Array…          
─────┼────────────────────────
   1 │     1  [1.0, 2.0, 3.0]
   2 │     2  [2.5, 3.5, 4.5]
   3 │     3  [5.0, 6.0, 7.0]

leon · January 1, 2022, 4:16pm

Very cool! What if A is a multiple column matrix? Instead of relying on unique(A), it will require unique rows, or unique(A, dims=1). How would I write the program then?

Thanks.

aplavin · January 1, 2022, 9:33pm

If you are fine with using packages outside of Base, my solution (above) only requires a small modification:
replace AB = StructArray(; A, B=splitdims(B, 1)) with AB = StructArray(; A=splitdims(A, 1), B=splitdims(B, 1)).

rocco_sprmnt21 · January 1, 2022, 10:39pm

this should work for both A vector and A matrix

using SplitApplyCombine, Statistics
mean.(group(first, last, zip(eachrow(A),eachrow(B))))

rafael.guerra · January 1, 2022, 11:03pm

@leon, your last question (What if A is a multiple column matrix?) requires pointing to the Matlab documentation for the accumarray() function and specifying exactly what you need, as all the available features and options seem to be massive.

Here is a quick & dirty attempt that worked for 2 out of the 3 Matlab examples tried from the link above. The failure occured for the Int8 example. I tried to use reduce(+,...) instead of sum() to avoid auto-promotion but I got something else. Actually, I do not understand what Matlab is doing in that example.

function accumarray1(A::Matrix{Int64}, B::AbstractArray, fun::Function, T::Type)
  N = size(A,2)
  mx = maximum(A, dims=1)
  C = zeros(T, mx[1], mx[2:N]...)
  if fun == sum
    Ci = vcat([reduce(+, B[i .∈ A[:,1],:], dims=1) for i in unique(A[:,1])]...)
  else
    Ci = vcat([fun(B[i .∈ A[:,1],:], dims=1) for i in unique(A[:,1])]...)
  end
  for ri in collect(eachrow(A))
    C[ri...] = Ci[ri[1]]
  end
  return C
end

# Matlab example-1: OK
B = 1:6    # data input
A = [1 1; 2 2; 3 2; 1 1; 2 2; 4 1]    # indices on first column, output to row N-d index
accumarray1(A, B, sum, Int64)

4×2 Matrix{Int64}:
 5  0
 0  7
 0  3
 6  0

# Matlab example-2: OK
using Statistics
B = [100.1, 101.2, 103.4, 102.8, 100.9, 101.5]
A = [1 1; 1 1; 2 2; 3 2; 2 2; 3 2]
accumarray1(A, B, var, Float64)

3×2 Matrix{Float64}:
 0.605  0.0
 0.0    3.125
 0.0    0.845


# Matlab example-3: Not OK, but do not understand Matlab output with 4 different values?
B = Int8.(10:15)
A = [1 1 1; 1 1 1; 1 1 2; 1 1 2; 2 3 1; 2 3 2]
accumarray1(A, B, sum, Int8)

2×3×2 Array{Int8, 3}:
[:, :, 1] =
 46  0   0
  0  0  29

[:, :, 2] =
 46  0   0
  0  0  29

leon · January 1, 2022, 11:32pm

Many thanks all for the alternative solutions. It seems that my current approach is still one of the fastest way of doing this:

B2 = reduce(vcat, [mean(B[i, :], dims=1) for i in groupinds(groupslices(A, dims=1))]);

rocco_sprmnt21 · January 3, 2022, 10:36pm

the following versions (for A vector and A matrix) seem quite competitive with respect to vcat (…).
At least for the small matrixes tested.


function meangrpslm(A,B)
	grp=Dict{Vector{Int64},Tuple{Array{Int64},Int64}}()
	for (i, r) in enumerate(eachrow(A))
		haskey(grp,r) ? grp[r]=grp[r].+(B[i,:],1) : grp[r]=(B[i,:],1)
	end
	[first(g)/last(g) for g in values(grp)]
end


function meangrpslv(Arr,B)
	grp=Dict{Int64,Tuple{Array{Int64},Int64}}()
	for (i, r) in enumerate(Arr)
		haskey(grp,r) ? grp[r]=grp[r].+(B[i,:],1) : grp[r]=(B[i,:],1)
	end
	[first(g)/last(g) for g in values(grp)]
end

Topic		Replies	Views
Return duplicate rows in array with no of times and index of first occurence General Usage question , array	6	702	July 13, 2022
Equal method to matlab unique in rows General Usage question	2	314	November 14, 2022
Collapse duplicate rows in Julia matrix General Usage question	2	140	May 3, 2024
Unique rows indexes in array General Usage	14	2833	September 15, 2020
Finding row index of matrix Performance indexing , array	5	1962	July 30, 2021

What's julia's solution to tapply or accumarray?

Related topics