I was calculating some statistics on an irregular data stored in a Dict
, when I learned that StatsBase.countmap
does not support generators. Since it is time I learned how to use them, I thought I would implement a method:
using StatsBase
function StatsBase.addcounts!{T}(cm::Dict{T}, g::Base.Generator)
## how to make sure T matches eltype of g?
for v in g
cm[v] = get(cm, v, 0) + 1
end
cm
end
function StatsBase.countmap(g::Base.Generator)
_eltype = Base.iteratoreltype(g)
if _eltype == Base.EltypeUnknown()
_eltype = Any
end
addcounts!(Dict{_eltype, Int}(), g)
end
But it is about 3x slower than just collecting the values (note: records
and the calculation I do on them is just a toy example, to make my code self-contained):
records = Dict(rand(Int) => rand(Int, rand(1:5)) for i in 1:200000)
using BenchmarkTools
@benchmark countmap(collect(length(v) for v in values(records)))
@benchmark countmap(length(v) for v in values(records))
I suspect my code is not type stable: it does not use the element type of the generator, for one thing. So how can I speed this up? Apologies if this is in the manual, I could not find it.