Fast way to calculate statistics of an array based on values (labels) of another array

You can get a nice speed-up by declaring a precise type for the dictionary with

d = Dict{eltype(labels), eltype(array)}()

In my small test this divides the time by two.

I get another 30% savings by using DataFrames instead:

using DataFrames

a = rand(1000, 1000)
b = rand(["a", "b", "c", "d"], size(a))
df = DataFrame(array=vec(a), labels=vec(b), copycols=false)
combine(groupby(df, :labels), :array => median)

Not sure if you will get the same savings with your real dataset though…

3 Likes