Interesting post about SIMD dot product (and cosine similarity)

giordano · December 1, 2024, 4:31pm

Ah, I did spot one error! I didn’t enable optimisations, but they help quite a bit despite using intrinsics. Adding the -O2/-O3 flags to the compiler invocation (it doesn’t change much in this case between the two) I now get

julia> function mydot(x, y)
           s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
           @simd for i in eachindex(x, y)
               s = muladd(x[i], y[i], s)
           end
           return s
       end
mydot (generic function with 1 method)

julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
           result = Ref{Float64}()
           @ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
           return result[]
       end
c_simsimd_dot_f32_sve (generic function with 1 method)

julia> using BenchmarkTools

julia> n = 1000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  139.483 ns (0 allocations: 0 bytes)
  77.405 ns (0 allocations: 0 bytes)

julia> n = 10_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  1.482 μs (0 allocations: 0 bytes)
  770.591 ns (0 allocations: 0 bytes)

julia> n = 100_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  14.975 μs (0 allocations: 0 bytes)
  8.682 μs (0 allocations: 0 bytes)

which is an improvement, but still 2x slower than the Julia code. I tried on a different machine with both gcc 14.1 and clang 17.0.6, got pretty much same timings. For the curious, you can see the native code on Godbold.

Topic		Replies	Views
Cosine seems slow Performance	14	1880	November 27, 2019
Dot function General Usage	45	5788	September 26, 2018
Sum performance for Array{Float64,2} elements Performance	13	2541	May 15, 2018
Datatypes SIMD by default? (as in Mojo) Internals & Design question , simd	14	1338	May 15, 2025
Optimizing sums of products (dot products) Performance simd , linearalgebra , sum	17	801	September 24, 2024

Interesting post about SIMD dot product (and cosine similarity)

Related topics