Fast way to calculate statistics of an array based on values (labels) of another array

Hey,
I have two arrays of the same size (10000x10000). One array include labels as result of an segmentation. What I want is to compute statistics for each of these segments based on the 2nd array. My approach is very slow.

function seg_stat(array, labels)
    d = Dict()
    label_ids = unique(labels)
    for label in label_ids
        indices = findall(labels .== label)
        d[label] = median(@view array[indices])
    end
    return d
end

Any ideas how to make this more faster?
Thank you!

You can get a nice speed-up by declaring a precise type for the dictionary with

d = Dict{eltype(labels), eltype(array)}()

In my small test this divides the time by two.

I get another 30% savings by using DataFrames instead:

using DataFrames

a = rand(1000, 1000)
b = rand(["a", "b", "c", "d"], size(a))
df = DataFrame(array=vec(a), labels=vec(b), copycols=false)
combine(groupby(df, :labels), :array => median)

Not sure if you will get the same savings with your real dataset though…

3 Likes

@sijo, this is a nice example of the power of dataframes but, how to easily use the results?

For example, what is a lighter way to get the scalar results for each label, rather than doing:

dg = combine(groupby(df, :labels), :array => median)

dg[dg.labels .== "b", :].array_median[1]

How many different labels there are?

1 Like

Now we have to reconstruct test cases from the description;) BTW: found the question also interesting, but there is not enough input for me.

A straightforward and performant solution:

using Statistics
using SplitApplyCombine
using StructArrays

a = rand(1000, 1000)
b = rand(["a", "b", "c", "d"], size(a))

ab = StructArray(; a, b)

# dictionary of per-label medians:
meds = map(groupview(x -> x.b, ab)) do gr
	median(gr.a)
end
# retrieve median for a single label:
meds["c"]

# same code with a nicer piping syntax:
using DataPipes

meds = @p begin
	StructArray(; a, b)
	groupview(_.b)
	map(median(_.a))
end
3 Likes

Thanks a lot @sijo and @aplavin for your suggestions. In this case the segment array consist of ~500000 unique labels. The DataFrame solution performs a little bit faster but both solutions are much faster than my loop. Thank you very much for your help!

julia> @benchmark combine(groupby(df, :labels), :array => StatsBase.mode)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.667 s …   3.797 s  β”Š GC (min … max): 0.04% … 8.58%
 Time  (median):     3.732 s              β”Š GC (median):    4.39%
 Time  (mean Β± Οƒ):   3.732 s Β± 92.293 ms  β”Š GC (mean Β± Οƒ):  4.39% Β± 6.04%

  β–ˆ                                                       β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  3.67 s         Histogram: frequency by time         3.8 s <

 Memory estimate: 2.09 GiB, allocs estimate: 2281604.




julia> @benchmark meds = map(groupview(x -> x.b, ab)) do gr
               median(gr.a)
       end
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 5.612 s (18.35% GC) to evaluate,
 with a memory estimate of 4.99 GiB, over 3688493 allocations.

2 Likes

For a small number of labels you could do something like

medians = only(permutedims(dg, :labels))

julia> medians.b  # or medians[:b]
0.5011299410448031

or creating a named tuple manually:

nt = NamedTuple(Symbol.(dg.labels) .=> dg.array_median)

julia> nt.b
0.5011299410448031

In case of many labels it’s probably better to use a Dict:

d = Dict(dg.labels .=> dg.array_median)

julia> d["b"]
0.5011299410448031

# Or maybe an iterator is better in case of very many labels:
d = Dict(l => m for (l, m) in eachrow(dg))

Another option is to keep the groups, so we can use the group keys:

g = combine(groupby(df, :labels), :array => median, ungroup=false)

julia> g[("b",)][1, :array_median]
0.5011299410448031
1 Like