Ah, I did spot one error! I didn’t enable optimisations, but they help quite a bit despite using intrinsics. Adding the -O2/-O3 flags to the compiler invocation (it doesn’t change much in this case between the two) I now get
julia> function mydot(x, y)
s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
@simd for i in eachindex(x, y)
s = muladd(x[i], y[i], s)
end
return s
end
mydot (generic function with 1 method)
julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
result = Ref{Float64}()
@ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
return result[]
end
c_simsimd_dot_f32_sve (generic function with 1 method)
julia> using BenchmarkTools
julia> n = 1000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
139.483 ns (0 allocations: 0 bytes)
77.405 ns (0 allocations: 0 bytes)
julia> n = 10_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
1.482 μs (0 allocations: 0 bytes)
770.591 ns (0 allocations: 0 bytes)
julia> n = 100_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
14.975 μs (0 allocations: 0 bytes)
8.682 μs (0 allocations: 0 bytes)
which is an improvement, but still 2x slower than the Julia code. I tried on a different machine with both gcc 14.1 and clang 17.0.6, got pretty much same timings. For the curious, you can see the native code on Godbold.