Fast way to calculate statistics of an array based on values (labels) of another array

Bjorn_De · October 19, 2021, 5:19pm

Hey,
I have two arrays of the same size (10000x10000). One array include labels as result of an segmentation. What I want is to compute statistics for each of these segments based on the 2nd array. My approach is very slow.

function seg_stat(array, labels)
    d = Dict()
    label_ids = unique(labels)
    for label in label_ids
        indices = findall(labels .== label)
        d[label] = median(@view array[indices])
    end
    return d
end

Any ideas how to make this more faster?
Thank you!

sijo · October 19, 2021, 6:34pm

You can get a nice speed-up by declaring a precise type for the dictionary with

d = Dict{eltype(labels), eltype(array)}()

In my small test this divides the time by two.

I get another 30% savings by using DataFrames instead:

using DataFrames

a = rand(1000, 1000)
b = rand(["a", "b", "c", "d"], size(a))
df = DataFrame(array=vec(a), labels=vec(b), copycols=false)
combine(groupby(df, :labels), :array => median)

Not sure if you will get the same savings with your real dataset though…

rafael.guerra · October 19, 2021, 9:00pm

@sijo, this is a nice example of the power of dataframes but, how to easily use the results?

For example, what is a lighter way to get the scalar results for each label, rather than doing:

dg = combine(groupby(df, :labels), :array => median)

dg[dg.labels .== "b", :].array_median[1]

lmiq · October 19, 2021, 9:25pm

How many different labels there are?

goerch · October 19, 2021, 10:33pm

Now we have to reconstruct test cases from the description;) BTW: found the question also interesting, but there is not enough input for me.

aplavin · October 20, 2021, 5:32am

A straightforward and performant solution:

using Statistics
using SplitApplyCombine
using StructArrays

a = rand(1000, 1000)
b = rand(["a", "b", "c", "d"], size(a))

ab = StructArray(; a, b)

# dictionary of per-label medians:
meds = map(groupview(x -> x.b, ab)) do gr
	median(gr.a)
end
# retrieve median for a single label:
meds["c"]

# same code with a nicer piping syntax:
using DataPipes

meds = @p begin
	StructArray(; a, b)
	groupview(_.b)
	map(median(_.a))
end

Bjorn_De · October 20, 2021, 6:46am

Thanks a lot @sijo and @aplavin for your suggestions. In this case the segment array consist of ~500000 unique labels. The DataFrame solution performs a little bit faster but both solutions are much faster than my loop. Thank you very much for your help!

julia> @benchmark combine(groupby(df, :labels), :array => StatsBase.mode)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  3.667 s …   3.797 s  ┊ GC (min … max): 0.04% … 8.58%
 Time  (median):     3.732 s              ┊ GC (median):    4.39%
 Time  (mean ± σ):   3.732 s ± 92.293 ms  ┊ GC (mean ± σ):  4.39% ± 6.04%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.67 s         Histogram: frequency by time         3.8 s <

 Memory estimate: 2.09 GiB, allocs estimate: 2281604.




julia> @benchmark meds = map(groupview(x -> x.b, ab)) do gr
               median(gr.a)
       end
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 5.612 s (18.35% GC) to evaluate,
 with a memory estimate of 4.99 GiB, over 3688493 allocations.

sijo · October 20, 2021, 7:17am

rafael.guerra:

what is a lighter way to get the scalar results for each label, rather than doing:
dg = combine(groupby(df, :labels), :array => median)

dg[dg.labels .== "b", :].array_median[1]

For a small number of labels you could do something like

medians = only(permutedims(dg, :labels))

julia> medians.b  # or medians[:b]
0.5011299410448031

or creating a named tuple manually:

nt = NamedTuple(Symbol.(dg.labels) .=> dg.array_median)

julia> nt.b
0.5011299410448031

In case of many labels it’s probably better to use a Dict:

d = Dict(dg.labels .=> dg.array_median)

julia> d["b"]
0.5011299410448031

# Or maybe an iterator is better in case of very many labels:
d = Dict(l => m for (l, m) in eachrow(dg))

Another option is to keep the groups, so we can use the group keys:

g = combine(groupby(df, :labels), :array => median, ungroup=false)

julia> g[("b",)][1, :array_median]
0.5011299410448031

Topic		Replies	Views
Split-Apply-Combine in arrays New to Julia splitapplycombine	9	618	April 28, 2023
Split-Apply-Combine on many columns at once? Looking for equivalent to Stata's collapse New to Julia question , dataframes	7	2038	November 9, 2021
Efficiently finding the frequency of patterns in DataFrame columns New to Julia dataframes , dictionaries , splitapplycombine	12	1554	January 1, 2022
Easier way to split-apply-combine in DataFrames.jl General Usage dataframes	5	1111	December 14, 2020
SplitApplyCombine.jl `group` enhancements, reaches version 1.0.0 Data announcement , splitapplycombine	1	875	December 25, 2019

Fast way to calculate statistics of an array based on values (labels) of another array

Related topics