Good question! There are lots of ways to do this, some better than others depending on the specifics of your problem. I’ll walk through several methods and compare them. For all methods, I’ll use the following sample string:
s = randstring(100_000)
One-liner
oneliner(s) = Dict([(c, count(collect(s) .== c)) for c in unique(s)])
julia> @btime oneliner($s)
28.911 ms (401 allocations: 24.68 MiB)
Minimal code. No external dependencies. Stupid algorithm. Only consider this if performance doesn’t matter.
Dictionary with haskey
count_char(s) = begin
res = Dict{Char, Int}()
for c in s
if haskey(res, c)
res[c] += 1
else
res[c] = 1
end
end
res
end
julia> @btime count_char($s)
1.684 ms (10 allocations: 5.25 KiB)
Decent implementation, but can be made both faster and less verbose since it does three lookups for each duplicate key (first haskey
, then obtaining the current count, then writing an updated count)
Dictionary with default getter
function count_char2(s)
res = Dict{Char, Int}()
for c in s
res[c] = get(res, c, 0) + 1
end
return res
end
julia> @btime count_char2($s)
1.320 ms (10 allocations: 5.25 KiB)
A better version then the previous one since it only does two lookups (one read and one write). It’s also less verbose and doesn’t duplicate logic. All-around better.
DataStructures counter
using DataStructures
julia> @btime counter($s)
1.335 ms (11 allocations: 5.27 KiB)
This is pretty much the same as the previous implementation, but you don’t need to write the code yourself.
StatsBase countmap
using StatsBase
julia> @btime countmap([c for c in $s])
949.163 μs (13 allocations: 395.97 KiB)
countmap
in StatsBase is in general the best method I know of for this task, and would be my method of choice in most cases. It uses an introspective algorithm of either a single lookup dictionary approach or a radix sort depending on the input data. Unfortunately, it doesn’t accept generator input, which wastes some time and memory in this case.
Single lookup dictionary
function count_char_single_lookup(s)
res = Dict{Char, Int}()
for c in s
index = Base.ht_keyindex2!(res, c)
if index > 0
@inbounds res.vals[index] += 1
else
@inbounds Base._setindex!(res, 1, c, -index)
end
end
res
end
julia> @btime count_char_single_lookup($s)
765.020 μs (10 allocations: 5.25 KiB)
Extracting the method used by countmap
, we can gain some performance. I would be quite wary using this method in code that you intend to live on for a while, since it exploits the internal, undocumented, structure of Dict
which could change at some point in the future.
Dictionary with faster hash
Base.hash(c::Char) = UInt(c)
julia> @btime count_char_single_lookup($s)
487.873 μs (10 allocations: 5.25 KiB)
The default hash implementation for primitive types is not very efficient. It’s possible to squeeze a bit more performance out of dictionary based implementations by using a more efficient hash method. Again, I’d be wary using this method in practice since it redefines the hash code for Char
globally. I guess one could use a custom Char
type to get around that, but I’m not sure if that can be done in a clean and efficient way (suggestions appreciated).
StatsBase countmap with radix sort
using StatsBase
julia> @btime countmap([UInt(c) for c in $s]);
448.074 μs (16 allocations: 1.53 MiB)
To get even faster, we can turn to non-dictionary based approaches. By making the keys integers, countmap
switches to a radix sort implementation. At the end, one could trivially change the keys back to Char
s (see Kristoffer’s post). Just like above, we could extract this code to avoid creating the temporary array to gain some performance. (Or better yet, add support for generators in StatsBase.)
Fixed-size array
function count_limited_range(s)
counts = zeros(Int, 128)
@inbounds for c in codeunits(s)
counts[c] += 1
end
counts
end
julia> @btime count_limited_range($s)
72.770 μs (1 allocation: 1.14 KiB)
Finally, if the input data is guaranteed to be within a limited range, such as ASCII, by far the fastest way is to use a fixed-size array of counters.