I have a square array, c
, with integer entries that I expect to repeat. I would like to get frequency counts on the different indices, as they represent categorical data. I was wondering what might be the most efficient way to do this. I know that I could flatten the array and use DataFrames.jl
, but I have to do this many times, so I’m concerned about introducing unnecessary overhead through those conversions (square array to flat array to data frame).
Something like this?
julia> using StatsBase
julia> M = rand(1:10, 10, 10)
10×10 Array{Int64,2}:
8 8 6 7 5 9 5 10 5 7
5 1 7 3 10 9 8 4 8 2
2 2 3 9 2 7 9 4 8 7
4 6 8 3 6 2 10 5 3 6
8 7 7 6 3 8 1 4 6 6
6 3 5 5 9 6 7 1 7 5
1 4 7 9 5 8 4 2 5 1
6 8 6 7 3 5 1 2 8 10
6 2 9 7 3 6 7 2 6 2
5 9 9 10 4 2 6 7 9 1
julia> StatsBase.countmap(vec(M))
Dict{Int64,Int64} with 10 entries:
7 => 14
4 => 7
9 => 10
10 => 5
2 => 11
3 => 8
5 => 12
8 => 11
6 => 15
1 => 7
Works for me.
Note that if you know in advance that you have limited set of entries, e.g. values in 1:10
, then you can do much better than countmap
just by allocating an array of counts and incrementing it as you iterate through your data. For your example above, I get a speedup by more than a factor of 5:
julia> function countmap10(M)
counts = zeros(Int, 10)
for x in M
counts[x] += 1
end
return counts
end
julia> @btime StatsBase.countmap(vec($M))
628.174 ns (8 allocations: 1.70 KiB)
Dict{Int64,Int64} with 10 entries:
7 => 14
4 => 7
9 => 10
10 => 5
2 => 11
3 => 8
5 => 12
8 => 11
6 => 15
1 => 7
julia> @btime countmap10($M)
115.560 ns (1 allocation: 160 bytes)
10-element Array{Int64,1}:
7
11
8
7
12
15
14
11
10
5
should countmap be able to take an optional AbstractArray as possible set?
It seems like there should be a countmap!(counts, array)
function that takes any counts
object supporting getindex/setindex!
(e.g. a Dict
or an array or some other data structure).
If you have too many counts to stick into memory I really recommend OnlineStats.jl’s countmap :). https://github.com/joshday/OnlineStats.jl
https://joshday.github.io/OnlineStats.jl/latest/api/#OnlineStatsBase.CountMap