How to count all unique character frequency in a string?

I feel like there’s a joke about different programming language communities in here…

  • C: gets a fast simple code that ignores the existence of Unicode
  • C++: gets a fast solution that handles Unicode but uses STL and takes five minutes to compile
  • Haskell: a clever usage of the S, K and I combinators solves this problem quite elegantly
  • Python: everyone agrees that there one correct way to do it, but five ways are proposed and no one can decide which one is correct and Guido is tired of deciding what the one correct way to do things is
  • Rust: gets an essay on borrow semantics and mutation
  • Julia: gets twelve different ways to do it, including several that are faster than anyone actually needs to count letters, making use of SIMD, threads and farming out work to the nearest top five supercomputer
25 Likes

I am pretty proud of this as it’s one of my first open source contributions

1 Like

Doesn’t look like that code generate any SIMD instructions?

Hm. That’s odd. There is a definite speedup, though less than expected.

I amend my advice: use @simd :wink:

I am curious too. Wonder why?

Good suggestion! I didn’t see any SIMD instructions either when I tested earlier, but you’re right it’s considerably faster. Looking at the native code, @simd simply results in a tighter inner loop.

We can optimize it a bit further by converting the code units to an array, and keeping two separate count arrays, which avoids some of the memory dependencies and resulting latency:

function count_limited_range_multi_count(s)
    cnt1 = zeros(Int, 128)
    cnt2 = zeros(Int, 128)
    sv = unsafe_wrap(Array, pointer(s), sizeof(s))
    @inbounds @simd for n = 1:2:length(sv)
        a = sv[n]
        b = sv[n+1]
        cnt1[a] += 1
        cnt2[b] += 1
    end
    cnt1 .+ cnt2
end

julia> @btime count_limited_range_simd($s);
  59.156 μs (1 allocation: 1.14 KiB)

julia> @btime count_limited_range_multi_count($s);
  44.826 μs (4 allocations: 3.50 KiB)

As for SIMD: with AVX512 one could use gather and scatter to read and write 16 counters at a time, but there’s a problem: If there are duplicate characters within the block, the update will not be correct (explained in more detail here).

Instead, I made an attempt using vload to load 32 characters at a time (suitable for AVX2), with the hope that updating each count would be done in only 2 instructions:

vpextrb	esi, xmm0, 0
add	qword ptr [rdx + 8*rsi - 8], 1

For up to 3 additions, this optimal assembly is generated, but when I unroll the loop with more additions (need 32), it starts generating nonsense assembly. The code is below if anyone knows how this can be done. It works as it is, just not as fast as it could be:

macro unrolled_add(n)
    esc(Expr(:block, (k -> :( counts[v[$k]] += 1 )).(1:n)...))
end

function count_limited_range_real_simd(s)
    counts = zeros(Int, 128)
    sv = unsafe_wrap(Array, pointer(s), sizeof(s))
    @inbounds @simd for n = 1:32:length(sv)
        v = vload(Vec{32,UInt8}, sv, n)
        @unrolled_add(32) # n <= 3 generates ideal assembly
    end
    counts
end

Nice! I would be even more impressed if Char and FastHashChar could be made to use the same memory (like union in C), so that we wouldn’t have to rebuild the dictionary at the end. For an alphabet this small, it’s cheap of course, but I think that could be useful in other situations.

Agreed… On the other hand most things I learned about Julia came from reading and answering questions like this on this forum. So if not for OP, it is at least worth it to me :slight_smile:

11 Likes