What's julia's solution to tapply or accumarray?

I have one-column Index as the below:
A = [1, 2, 2, 3, 3, 3];

I have a matrix B as the below:
B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8];

I would want to calculate the average of the duplicate rows based on Index A. The outcome C would be:
C = [1.0 2.0 3.0; 2.5 3.5 4.5; 5.0 6.0 7.0];

In Matlab, I can do this:

[~, ~, subs] = unique(A, 'stable');
C = accumarray(subs, B(:), [], @mean);

In R, we could use the function tapply

What would be the solution in Julia? Many thanks!

One way:

julia> A = [1, 2, 2, 3, 3, 3];

julia> B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8];

julia> using GroupSlices

julia> reduce(vcat, [mean(B[i, :], dims=1) for i in groupinds(A)])
3Γ—3 Matrix{Float64}:
 1.0  2.0  3.0
 2.5  3.5  4.5
 5.0  6.0  7.0

There might be nicer ways to do this using DataFrames, instead of just matrices.

4 Likes

This gives an error in Julia.

Also, I don’t quite understand what you mean by

I would want to calculate the average of the duplicate rows based on Index A

Can you explain the desired operation a bit more?

3 Likes

Sorry for the error. I have fixed it.

Big thanks to Mcabbott who has offered a solution to this question perfectly!

I was trying to say that if some rows are duplicates, all values within these duplicate rows will be averaged within each of their respective column.

@mcabbott,

Many thanks again for the nice solution! In reality, my A index is a matrix composed of 3 columns:
A = Data[:, [21, 25, 35] ];

In this case, how do I specify the for i in groupinds(A) syntax?

I’m not certain I follow, but you can ask for unique rows like this:

julia> A = [1, 2, 2, 3, 3, 3];

julia> groupinds(A)
3-element Vector{Vector{Int64}}:
 [1]
 [2, 3]
 [4, 5, 6]

julia> A2 = hcat(A, [10, 99, 99, 99, 4, 4])
6Γ—2 Matrix{Int64}:
 1  10
 2  99
 2  99
 3  99
 3   4
 3   4

julia> unique(A2, dims=1)
4Γ—2 Matrix{Int64}:
 1  10
 2  99
 3  99
 3   4

julia> groupinds(groupslices(A2, dims=1))
4-element Vector{Vector{Int64}}:
 [1]
 [2, 3]
 [4]
 [5, 6]
3 Likes

That’s exactly what I am looking for! Many thanks …

The explanation was not very clear but it is basically: to average the rows of matrix B as per the list of indices in vector A.

Suggestion of a package-less alternative, found thanks to Gabriel Fauré’s Requiem:

A = [1, 2, 2, 3, 3, 3]
B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8]

C = vcat([mean(B[ai .∈ A,:], dims=1) for ai in unique(A)]...)

3Γ—3 Matrix{Float64}:
 1.0  2.0  3.0
 2.5  3.5  4.5
 5.0  6.0  7.0
4 Likes

Yet another problem discussed here these days with a direct solution in StructArrays and SplitApplyCombine packages (:

using StructArrays
using SplitApplyCombine
using Statistics

# original arrays
A = [1, 2, 2, 3, 3, 3]
B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8]

# combine A and rows of B into a single array
AB = StructArray(; A, B=splitdims(B, 1))

# compute the desired per-group means
C = map(groupview(x -> x.A, AB)) do gr
	mean(gr.B)
end

# C is a dictionary
# access values as e.g. C[2]

# the same with a nicer piping syntax:
using DataPipes

@p AB |> groupview(_.A) |> map(mean(_.B))
3 Likes

And with DataFrames:

using DataFrames
using Statistics

A = [1, 2, 2, 3, 3, 3]
B = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7; 6 7 8]

df = DataFrame(; A, B=collect(eachrow(B)))

combine(groupby(df, :A), :B => Ref∘mean)

3Γ—2 DataFrame
 Row β”‚ A      B_Ref_mean      
     β”‚ Int64  Array…          
─────┼────────────────────────
   1 β”‚     1  [1.0, 2.0, 3.0]
   2 β”‚     2  [2.5, 3.5, 4.5]
   3 β”‚     3  [5.0, 6.0, 7.0]
3 Likes

Very cool! What if A is a multiple column matrix? Instead of relying on unique(A), it will require unique rows, or unique(A, dims=1). How would I write the program then?

Thanks.

If you are fine with using packages outside of Base, my solution (above) only requires a small modification:
replace AB = StructArray(; A, B=splitdims(B, 1)) with AB = StructArray(; A=splitdims(A, 1), B=splitdims(B, 1)).

1 Like

this should work for both A vector and A matrix

using SplitApplyCombine, Statistics
mean.(group(first, last, zip(eachrow(A),eachrow(B))))

@leon, your last question (What if A is a multiple column matrix?) requires pointing to the Matlab documentation for the accumarray() function and specifying exactly what you need, as all the available features and options seem to be massive.

Here is a quick & dirty attempt that worked for 2 out of the 3 Matlab examples tried from the link above. The failure occured for the Int8 example. I tried to use reduce(+,...) instead of sum() to avoid auto-promotion but I got something else. Actually, I do not understand what Matlab is doing in that example.

function accumarray1(A::Matrix{Int64}, B::AbstractArray, fun::Function, T::Type)
  N = size(A,2)
  mx = maximum(A, dims=1)
  C = zeros(T, mx[1], mx[2:N]...)
  if fun == sum
    Ci = vcat([reduce(+, B[i .∈ A[:,1],:], dims=1) for i in unique(A[:,1])]...)
  else
    Ci = vcat([fun(B[i .∈ A[:,1],:], dims=1) for i in unique(A[:,1])]...)
  end
  for ri in collect(eachrow(A))
    C[ri...] = Ci[ri[1]]
  end
  return C
end

# Matlab example-1: OK
B = 1:6    # data input
A = [1 1; 2 2; 3 2; 1 1; 2 2; 4 1]    # indices on first column, output to row N-d index
accumarray1(A, B, sum, Int64)

4Γ—2 Matrix{Int64}:
 5  0
 0  7
 0  3
 6  0

# Matlab example-2: OK
using Statistics
B = [100.1, 101.2, 103.4, 102.8, 100.9, 101.5]
A = [1 1; 1 1; 2 2; 3 2; 2 2; 3 2]
accumarray1(A, B, var, Float64)

3Γ—2 Matrix{Float64}:
 0.605  0.0
 0.0    3.125
 0.0    0.845


# Matlab example-3: Not OK, but do not understand Matlab output with 4 different values?
B = Int8.(10:15)
A = [1 1 1; 1 1 1; 1 1 2; 1 1 2; 2 3 1; 2 3 2]
accumarray1(A, B, sum, Int8)

2Γ—3Γ—2 Array{Int8, 3}:
[:, :, 1] =
 46  0   0
  0  0  29

[:, :, 2] =
 46  0   0
  0  0  29
1 Like

Many thanks all for the alternative solutions. It seems that my current approach is still one of the fastest way of doing this:

B2 = reduce(vcat, [mean(B[i, :], dims=1) for i in groupinds(groupslices(A, dims=1))]);

the following versions (for A vector and A matrix) seem quite competitive with respect to vcat (…).
At least for the small matrixes tested.


function meangrpslm(A,B)
	grp=Dict{Vector{Int64},Tuple{Array{Int64},Int64}}()
	for (i, r) in enumerate(eachrow(A))
		haskey(grp,r) ? grp[r]=grp[r].+(B[i,:],1) : grp[r]=(B[i,:],1)
	end
	[first(g)/last(g) for g in values(grp)]
end


function meangrpslv(Arr,B)
	grp=Dict{Int64,Tuple{Array{Int64},Int64}}()
	for (i, r) in enumerate(Arr)
		haskey(grp,r) ? grp[r]=grp[r].+(B[i,:],1) : grp[r]=(B[i,:],1)
	end
	[first(g)/last(g) for g in values(grp)]
end
1 Like