# Fast way to calculate statistics of an array based on values (labels) of another array

Hey,
I have two arrays of the same size (10000x10000). One array include labels as result of an segmentation. What I want is to compute statistics for each of these segments based on the 2nd array. My approach is very slow.

``````function seg_stat(array, labels)
d = Dict()
label_ids = unique(labels)
for label in label_ids
indices = findall(labels .== label)
d[label] = median(@view array[indices])
end
return d
end
``````

Any ideas how to make this more faster?
Thank you!

You can get a nice speed-up by declaring a precise type for the dictionary with

``````d = Dict{eltype(labels), eltype(array)}()
``````

In my small test this divides the time by two.

I get another 30% savings by using DataFrames instead:

``````using DataFrames

a = rand(1000, 1000)
b = rand(["a", "b", "c", "d"], size(a))
df = DataFrame(array=vec(a), labels=vec(b), copycols=false)
combine(groupby(df, :labels), :array => median)
``````

Not sure if you will get the same savings with your real dataset though…

3 Likes

@sudete, this is a nice example of the power of dataframes but, how to easily use the results?

For example, what is a lighter way to get the scalar results for each label, rather than doing:

``````dg = combine(groupby(df, :labels), :array => median)

dg[dg.labels .== "b", :].array_median
``````

How many different labels there are?

1 Like

Now we have to reconstruct test cases from the description;) BTW: found the question also interesting, but there is not enough input for me.

A straightforward and performant solution:

``````using Statistics
using SplitApplyCombine
using StructArrays

a = rand(1000, 1000)
b = rand(["a", "b", "c", "d"], size(a))

ab = StructArray(; a, b)

# dictionary of per-label medians:
meds = map(groupview(x -> x.b, ab)) do gr
median(gr.a)
end
# retrieve median for a single label:
meds["c"]

# same code with a nicer piping syntax:
using DataPipes

meds = @p begin
StructArray(; a, b)
groupview(_.b)
map(median(_.a))
end
``````
3 Likes

Thanks a lot @sudete and @aplavin for your suggestions. In this case the segment array consist of ~500000 unique labels. The DataFrame solution performs a little bit faster but both solutions are much faster than my loop. Thank you very much for your help!

``````julia> @benchmark combine(groupby(df, :labels), :array => StatsBase.mode)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max):  3.667 s …   3.797 s  ┊ GC (min … max): 0.04% … 8.58%
Time  (median):     3.732 s              ┊ GC (median):    4.39%
Time  (mean ± σ):   3.732 s ± 92.293 ms  ┊ GC (mean ± σ):  4.39% ± 6.04%

█                                                       █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
3.67 s         Histogram: frequency by time         3.8 s <

Memory estimate: 2.09 GiB, allocs estimate: 2281604.

julia> @benchmark meds = map(groupview(x -> x.b, ab)) do gr
median(gr.a)
end
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 5.612 s (18.35% GC) to evaluate,
with a memory estimate of 4.99 GiB, over 3688493 allocations.

``````
2 Likes

For a small number of labels you could do something like

``````medians = only(permutedims(dg, :labels))

julia> medians.b  # or medians[:b]
0.5011299410448031
``````

or creating a named tuple manually:

``````nt = NamedTuple(Symbol.(dg.labels) .=> dg.array_median)

julia> nt.b
0.5011299410448031
``````

In case of many labels it’s probably better to use a Dict:

``````d = Dict(dg.labels .=> dg.array_median)

julia> d["b"]
0.5011299410448031

# Or maybe an iterator is better in case of very many labels:
d = Dict(l => m for (l, m) in eachrow(dg))
``````

Another option is to keep the groups, so we can use the group keys:

``````g = combine(groupby(df, :labels), :array => median, ungroup=false)

julia> g[("b",)][1, :array_median]
0.5011299410448031
``````
1 Like