A bit ago I mentioned a new hashing library I have been working on:

After a huge amount of effort, I’ve just managed to make the “simple” single-threaded version (k12_singlethreaded) zero-alloc and only off the C performance by a factor of 2 (I’m seeing ~1.5 GB/s, should be seeing 3-4 GB/s).

One feature that should help make up the gap is SIMD. I’ve been trying to use SIMD.jl to accelerate the parallel branches of the hashing, however despite my best efforts so far, the SIMD take is:

We can make the SIMD version faster (new time: ~0.6s) by adding @inline to the rol64 inner function within keccak_p1600, however then the results are non-deterministic .

I’d hugely appreciate any help getting this closer to the performance of the C/Rust implementations floating around, and getting SIMD working .

One thing that SIMD.jl do different compared to libraries that do native assembly code is that SIMD.jl creates target independent LLVM IR and it relies on LLVM to do the optimization and selection of the actual intrinsics that will run on the native machine. This means that code written with SIMD.jl is portable but if LLVM does not do that great of a job with the assembly generation it will end up slower than hand picked machine specific intrinsics.

The thing that gets me is that this really feels like this is a situation where SIMD should be an easy win. The core permutation function that is SIMD’d is

Where the operations applied to the SIMD.jl Vec are UInt64 bitshifts, not-s, or-s, and xor-s. Adding @inline to the rol64 function improves the performance of the SIMD method a bit, but also makes the results non-deterministic!!

This leaves me at quite a loss, which is a pity because I suspect this is one of the major factors stopping us from nearing the performance of C/Rust implementations.

help?> 1 >> 64
>>(x, n)
Right bit shift operator, x >> n. For n >= 0, the result is x shifted right by n bits, filling with 0s if x >= 0, 1s if x <
0, preserving the sign of x. This is equivalent to fld(x, 2^n). For n < 0, this is equivalent to x << -n.

Though this may be a case of hardware semantics not exactly lining up with Julia semantics. If that’s the case, a MWE would be a good bug report.

I would profile to find the hotspots, presumably its keccak_p1600. Then compare LLVM IR/ assembly to see if there’s a difference there, perf could be useful for doing both steps at once.

Yea, that’s not really a good option here. I need to xor with the UInt64 based state, which means I need UInt64s. It would be possible to instead reinterpret the NTuple{25, UInt64} state as whatever the input vector is (say UInt8s), but I’m pretty sure that will reduce performance in another way.

I think I had performance problems with reinterpreted arrays before. At the time I “solved” it with some pointer hackery, although I’m not sure that’s appropriate for your issue.

In the profiles you can see that a lot of time is spent on indexing.
This is @profview KangarooTwelve.turboshake(UInt128, (longmsg_u64, longmsg_u64, longmsg_u64, longmsg_u64))

I just played around with this a bit more, and it seems like with this rather hacky reinterpreter that just reads the bytes through a converted pointer access, it’s as fast as the native UInt64 version for me:

struct UnsafeReinterpretedVector <: AbstractArray{UInt64,1}
v::Vector{UInt8}
len::Int
function UnsafeReinterpretedVector(v::Vector{UInt8})
length(v) % 8 == 0 || error("Incompatible length")
new(v, length(v) ÷ 8)
end
end
Base.size(u::UnsafeReinterpretedVector) = (u.len,)
function Base.getindex(u::UnsafeReinterpretedVector, i::Int)
0 < i <= u.len || throw(BoundsError(u, i))
v = u.v
GC.@preserve v begin
p = pointer(v)
p64 = Ptr{UInt64}(p)
unsafe_load(p64, i)
end
end
longmsg_unsafe_u64 = UnsafeReinterpretedVector(longmsg_u8);
@time KangarooTwelve.turboshake(UInt128, (longmsg_unsafe_u64, longmsg_unsafe_u64, longmsg_unsafe_u64, longmsg_unsafe_u64))
1.905471 seconds (3 allocations: 976 bytes)
(0xcc093729c8f34a590fb7e34103068fbb, 0xcc093729c8f34a590fb7e34103068fbb, 0xcc093729c8f34a590fb7e34103068fbb, 0xcc093729c8f34a590fb7e34103068fbb)