Hey,
I have two arrays of the same size (10000x10000). One array include labels as result of an segmentation. What I want is to compute statistics for each of these segments based on the 2nd array. My approach is very slow.
function seg_stat(array, labels)
d = Dict()
label_ids = unique(labels)
for label in label_ids
indices = findall(labels .== label)
d[label] = median(@view array[indices])
end
return d
end
Any ideas how to make this more faster?
Thank you!
using Statistics
using SplitApplyCombine
using StructArrays
a = rand(1000, 1000)
b = rand(["a", "b", "c", "d"], size(a))
ab = StructArray(; a, b)
# dictionary of per-label medians:
meds = map(groupview(x -> x.b, ab)) do gr
median(gr.a)
end
# retrieve median for a single label:
meds["c"]
# same code with a nicer piping syntax:
using DataPipes
meds = @p begin
StructArray(; a, b)
groupview(_.b)
map(median(_.a))
end
Thanks a lot @sijo and @aplavin for your suggestions. In this case the segment array consist of ~500000 unique labels. The DataFrame solution performs a little bit faster but both solutions are much faster than my loop. Thank you very much for your help!
julia> @benchmark combine(groupby(df, :labels), :array => StatsBase.mode)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min β¦ max): 3.667 s β¦ 3.797 s β GC (min β¦ max): 0.04% β¦ 8.58%
Time (median): 3.732 s β GC (median): 4.39%
Time (mean Β± Ο): 3.732 s Β± 92.293 ms β GC (mean Β± Ο): 4.39% Β± 6.04%
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.67 s Histogram: frequency by time 3.8 s <
Memory estimate: 2.09 GiB, allocs estimate: 2281604.
julia> @benchmark meds = map(groupview(x -> x.b, ab)) do gr
median(gr.a)
end
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 5.612 s (18.35% GC) to evaluate,
with a memory estimate of 4.99 GiB, over 3688493 allocations.
For a small number of labels you could do something like
medians = only(permutedims(dg, :labels))
julia> medians.b # or medians[:b]
0.5011299410448031
or creating a named tuple manually:
nt = NamedTuple(Symbol.(dg.labels) .=> dg.array_median)
julia> nt.b
0.5011299410448031
In case of many labels itβs probably better to use a Dict:
d = Dict(dg.labels .=> dg.array_median)
julia> d["b"]
0.5011299410448031
# Or maybe an iterator is better in case of very many labels:
d = Dict(l => m for (l, m) in eachrow(dg))
Another option is to keep the groups, so we can use the group keys: