I feel like there’s a joke about different programming language communities in here…
C: gets a fast simple code that ignores the existence of Unicode
C++: gets a fast solution that handles Unicode but uses STL and takes five minutes to compile
Haskell: a clever usage of the S, K and I combinators solves this problem quite elegantly
Python: everyone agrees that there one correct way to do it, but five ways are proposed and no one can decide which one is correct and Guido is tired of deciding what the one correct way to do things is
Rust: gets an essay on borrow semantics and mutation
Julia: gets twelve different ways to do it, including several that are faster than anyone actually needs to count letters, making use of SIMD, threads and farming out work to the nearest top five supercomputer
Good suggestion! I didn’t see any SIMD instructions either when I tested earlier, but you’re right it’s considerably faster. Looking at the native code, @simd simply results in a tighter inner loop.
We can optimize it a bit further by converting the code units to an array, and keeping two separate count arrays, which avoids some of the memory dependencies and resulting latency:
function count_limited_range_multi_count(s)
cnt1 = zeros(Int, 128)
cnt2 = zeros(Int, 128)
sv = unsafe_wrap(Array, pointer(s), sizeof(s))
@inbounds @simd for n = 1:2:length(sv)
a = sv[n]
b = sv[n+1]
cnt1[a] += 1
cnt2[b] += 1
end
cnt1 .+ cnt2
end
julia> @btime count_limited_range_simd($s);
59.156 μs (1 allocation: 1.14 KiB)
julia> @btime count_limited_range_multi_count($s);
44.826 μs (4 allocations: 3.50 KiB)
As for SIMD: with AVX512 one could use gather and scatter to read and write 16 counters at a time, but there’s a problem: If there are duplicate characters within the block, the update will not be correct (explained in more detail here).
Instead, I made an attempt using vload to load 32 characters at a time (suitable for AVX2), with the hope that updating each count would be done in only 2 instructions:
For up to 3 additions, this optimal assembly is generated, but when I unroll the loop with more additions (need 32), it starts generating nonsense assembly. The code is below if anyone knows how this can be done. It works as it is, just not as fast as it could be:
macro unrolled_add(n)
esc(Expr(:block, (k -> :( counts[v[$k]] += 1 )).(1:n)...))
end
function count_limited_range_real_simd(s)
counts = zeros(Int, 128)
sv = unsafe_wrap(Array, pointer(s), sizeof(s))
@inbounds @simd for n = 1:32:length(sv)
v = vload(Vec{32,UInt8}, sv, n)
@unrolled_add(32) # n <= 3 generates ideal assembly
end
counts
end
Nice! I would be even more impressed if Char and FastHashChar could be made to use the same memory (like union in C), so that we wouldn’t have to rebuild the dictionary at the end. For an alphabet this small, it’s cheap of course, but I think that could be useful in other situations.
Agreed… On the other hand most things I learned about Julia came from reading and answering questions like this on this forum. So if not for OP, it is at least worth it to me