Interesting post about SIMD dot product (and cosine similarity)

Ah, I did spot one error! I didn’t enable optimisations, but they help quite a bit despite using intrinsics. Adding the -O2/-O3 flags to the compiler invocation (it doesn’t change much in this case between the two) I now get

julia> function mydot(x, y)
           s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
           @simd for i in eachindex(x, y)
               s = muladd(x[i], y[i], s)
           end
           return s
       end
mydot (generic function with 1 method)

julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
           result = Ref{Float64}()
           @ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
           return result[]
       end
c_simsimd_dot_f32_sve (generic function with 1 method)

julia> using BenchmarkTools

julia> n = 1000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  139.483 ns (0 allocations: 0 bytes)
  77.405 ns (0 allocations: 0 bytes)

julia> n = 10_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  1.482 μs (0 allocations: 0 bytes)
  770.591 ns (0 allocations: 0 bytes)

julia> n = 100_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  14.975 μs (0 allocations: 0 bytes)
  8.682 μs (0 allocations: 0 bytes)

which is an improvement, but still 2x slower than the Julia code. I tried on a different machine with both gcc 14.1 and clang 17.0.6, got pretty much same timings. For the curious, you can see the native code on Godbold.

10 Likes