How to count all unique character frequency in a string?

Good question! There are lots of ways to do this, some better than others depending on the specifics of your problem. I’ll walk through several methods and compare them. For all methods, I’ll use the following sample string:

s = randstring(100_000)

One-liner

oneliner(s) = Dict([(c, count(collect(s) .== c)) for c in unique(s)])

julia> @btime oneliner($s)
  28.911 ms (401 allocations: 24.68 MiB)

Minimal code. No external dependencies. Stupid algorithm. Only consider this if performance doesn’t matter.

Dictionary with haskey

count_char(s) = begin
  res = Dict{Char, Int}()
  for c in s
      if haskey(res, c)
         res[c] += 1
      else
         res[c] = 1
      end
   end
   res
end

julia> @btime count_char($s)
  1.684 ms (10 allocations: 5.25 KiB)

Decent implementation, but can be made both faster and less verbose since it does three lookups for each duplicate key (first haskey, then obtaining the current count, then writing an updated count)

Dictionary with default getter

function count_char2(s)
    res = Dict{Char, Int}()
    for c in s
        res[c] = get(res, c, 0) + 1
    end
    return res
end

julia> @btime count_char2($s)
  1.320 ms (10 allocations: 5.25 KiB)

A better version then the previous one since it only does two lookups (one read and one write). It’s also less verbose and doesn’t duplicate logic. All-around better.

DataStructures counter

using DataStructures

julia> @btime counter($s)
  1.335 ms (11 allocations: 5.27 KiB)

This is pretty much the same as the previous implementation, but you don’t need to write the code yourself.

StatsBase countmap

using StatsBase

julia> @btime countmap([c for c in $s])
  949.163 μs (13 allocations: 395.97 KiB)

countmap in StatsBase is in general the best method I know of for this task, and would be my method of choice in most cases. It uses an introspective algorithm of either a single lookup dictionary approach or a radix sort depending on the input data. Unfortunately, it doesn’t accept generator input, which wastes some time and memory in this case.

Single lookup dictionary

function count_char_single_lookup(s)
    res = Dict{Char, Int}()
    for c in s
        index = Base.ht_keyindex2!(res, c)
        if index > 0
            @inbounds res.vals[index] += 1
        else
            @inbounds Base._setindex!(res, 1, c, -index)
        end
    end
    res
end

julia> @btime count_char_single_lookup($s)
  765.020 μs (10 allocations: 5.25 KiB)

Extracting the method used by countmap, we can gain some performance. I would be quite wary using this method in code that you intend to live on for a while, since it exploits the internal, undocumented, structure of Dict which could change at some point in the future.

Dictionary with faster hash

Base.hash(c::Char) = UInt(c)

julia> @btime count_char_single_lookup($s)
  487.873 μs (10 allocations: 5.25 KiB)

The default hash implementation for primitive types is not very efficient. It’s possible to squeeze a bit more performance out of dictionary based implementations by using a more efficient hash method. Again, I’d be wary using this method in practice since it redefines the hash code for Char globally. I guess one could use a custom Char type to get around that, but I’m not sure if that can be done in a clean and efficient way (suggestions appreciated).

StatsBase countmap with radix sort

using StatsBase

julia> @btime countmap([UInt(c) for c in $s]);
  448.074 μs (16 allocations: 1.53 MiB)

To get even faster, we can turn to non-dictionary based approaches. By making the keys integers, countmap switches to a radix sort implementation. At the end, one could trivially change the keys back to Chars (see Kristoffer’s post). Just like above, we could extract this code to avoid creating the temporary array to gain some performance. (Or better yet, add support for generators in StatsBase.)

Fixed-size array

function count_limited_range(s)
    counts = zeros(Int, 128)
    @inbounds for c in codeunits(s)
        counts[c] += 1
    end
    counts
end

julia> @btime count_limited_range($s)
  72.770 μs (1 allocation: 1.14 KiB)

Finally, if the input data is guaranteed to be within a limited range, such as ASCII, by far the fastest way is to use a fixed-size array of counters.

21 Likes