This is part 1 of a two posts showing how to SIMD the hell out of a dot-product:
Part 2 (forthcoming) will be about Modular implementation.
Seems dot product is a place Julia should aspire to hug the SOTA.
This is part 1 of a two posts showing how to SIMD the hell out of a dot-product:
Part 2 (forthcoming) will be about Modular implementation.
Seems dot product is a place Julia should aspire to hug the SOTA.
Itās nice as a tutorial example for hand SIMD optimization, but (a) optimized BLAS libraries (like the ones Julia uses) already contain an optimized dot product and (b) my understanding is that compilers do a pretty good job at SIMD-vectorizing BLAS-1 loops like dot products (as long as you give them permission to re-associate fp operations with @simd
or similar).
For example, consider this simplistic implementation:
function mydot(x, y)
s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
@simd for i in eachindex(x, y)
s = muladd(x[i], y[i], s)
end
return s
end
It gives timings comparable to (or even slightly faster than) the optimized BLAS dot(x,y)
call (not including multi-threading for very large sizes):
julia> using BenchmarkTools, LinearAlgebra
julia> LinearAlgebra.BLAS.set_num_threads(1);
julia> n = 1000; x, y = randn(n), randn(n); @btime dot($x, $y); @btime mydot($x, $y);
84.354 ns (0 allocations: 0 bytes)
77.861 ns (0 allocations: 0 bytes)
julia> n = 10000; x, y = randn(n), randn(n); @btime dot($x, $y); @btime mydot($x, $y);
1.557 Ī¼s (0 allocations: 0 bytes)
1.546 Ī¼s (0 allocations: 0 bytes)
julia> n = 100000; x, y = randn(n), randn(n); @btime dot($x, $y); @btime mydot($x, $y);
24.785 Ī¼s (0 allocations: 0 bytes)
23.004 Ī¼s (0 allocations: 0 bytes)
The article argues that compilers will do a worse job with mixed-precision code:
Second, itās well known that even in data-parallel code like this, compilers have a hard time vectorizing low-precision code, sometimes leaving as much as 99% of single-core performance on the table, and with mixed precision itās even worse! One needs to be explicit if they want reliable data parallelism.
but it actually seems to be doing a pretty good job for a mixed Float64
āFloat32
dot product (note that there is no optimized mixed-precision BLAS routine):
julia> n = 1000; x, y = randn(n), randn(Float32, n); @btime dot($x, $y); @btime mydot($x, $y);
998.273 ns (0 allocations: 0 bytes)
85.035 ns (0 allocations: 0 bytes)
julia> n = 10000; x, y = randn(n), randn(Float32, n); @btime dot($x, $y); @btime mydot($x, $y);
10.051 Ī¼s (0 allocations: 0 bytes)
1.500 Ī¼s (0 allocations: 0 bytes)
julia> n = 100000; x, y = randn(n), randn(Float32, n); @btime dot($x, $y); @btime mydot($x, $y);
100.302 Ī¼s (0 allocations: 0 bytes)
18.010 Ī¼s (0 allocations: 0 bytes)
Interestingly, the built-in dot
routine is doing much worse, here. Itās calling a generic fallback routine that doesnāt use @simd
(so the compiler doesnāt have permission to re-order the loop iterations).
(That being said, there are more complicated problems that will fail to auto-vectorize. Compilers have been using dot products and matrix multiplications as benchmark problems for decades.)
I find it disheartening that mojo gets points for what Julia has been able to do for a long time. Itās a shame that mojo gets better promotion than Julia.
Mojo gets better promotion because Python is much more well-known than Julia.
To be fair:
Of course, itās true that hand vectorization is possible in Julia as well, along with many other languages, but I wouldnāt expect a Mojo tutorial to spend time on this. Using C++ as a baseline is pretty reasonable, too.
Modular seems to be having a fairly hard time as a business, so I wouldnāt worry much about this as a long term issue.
Julia-lang blog used to have this kind of post a long time ago.
I would be a pity. I understand that Modular is trying to make the LLVM project easier for non-clang based languages, so it may benefit Julia.
So, yes, the @simd
does a good job. Two additional points:
bfloat16
types, for example.Since this is also a BLAS-1-like operation, the compiler does a pretty good job on a handwritten loop with @simd
. e.g. this SIMD-vectorizes:
function mycos(x, y)
s = zero(x[begin]) * zero(y[begin])
nx = ny = s
@simd for i in eachindex(x, y)
s = muladd(x[i], y[i], s)
nx = muladd(x[i], x[i], nx)
ny = muladd(y[i], y[i], ny)
end
return s / sqrt(nx * ny)
end
and is about 1.8x slower than dot(x, y)
for length-1000 arrays (and almost the same speed for length-100000 arrays where it is memory-bound). In comparison, dot(x, y) / sqrt(dot(x, x) * dot(y, y))
is about 3x slower, and dot(x, y) / (norm(x) * norm(y))
is about 8x slower ā thatās because the norm
function is more complicated to guard against spurious overflow/underflow (i.e. it trades off performance for robustness).
(The cost of the sqrt
computation seems irrelevant unless the arrays are tiny.) So if this is truly a performance-critical operation in some application, you could optimize it in Julia.
This is an instance of a general rule: re-using optimized library routines is good practice, but (if you know what you are doing and expend enough effort) code that is specialized for your problem can often beat more general optimized-library calls, especially a sequence of optimized library calls (e.g. by fusing loops) ā¦ at least, if you are using a language (Julia, Mojo, C++, Rust, ā¦) in which you can efficiently implement performance-critical inner loops.
And again, I agree that there are other problems that benefit from hand-vectorization. (Which can also be done in Julia.)
Iām not using a processor with hardware-accelerated BFloat16
or Float16
at the moment, sorry.
But itās not clear what point you are trying to make here? Is there some optimization you think can be done in Mojo but not in Julia? Or are you suggesting that we should add an optimized library routine for cosine-similarity (Statistics.jl has a cor
function for Pearson correlation, though Iām not sure how heavily optimized it is)? Or ā¦?
I wanted to see how the SVE implementation of dot
in SimSIMD, the library mentioned in the blogpost, fares, so I built the following library
#include "simsimd.h"
void c_simsimd_dot_f16_sve(simsimd_f16_t const* a, simsimd_f16_t const* b, simsimd_size_t n, simsimd_distance_t* result) {
simsimd_dot_f16_sve(a, b, n, result);
}
void c_simsimd_dot_f32_sve(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n, simsimd_distance_t* result) {
simsimd_dot_f32_sve(a, b, n, result);
}
void c_simsimd_dot_f64_sve(simsimd_f64_t const* a, simsimd_f64_t const* b, simsimd_size_t n, simsimd_distance_t* result) {
simsimd_dot_f64_sve(a, b, n, result);
}
on neoverse-v2 (Nvidia Grace) with
$ gcc --version
gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4)
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ gcc -mcpu=neoverse-v2 -mtune=neoverse-v2 -shared -fPIC -o libsimsimd.so libsimsimd.c
With Julia v1.11.1 I get
julia> function mydot(x, y)
s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
@simd for i in eachindex(x, y)
s = muladd(x[i], y[i], s)
end
return s
end
mydot (generic function with 1 method)
julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
result = Ref{Float64}()
@ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
return result[]
end
c_simsimd_dot_f32_sve (generic function with 1 method)
julia> using BenchmarkTools
julia> n = 1000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā mydot(x, y);
928.000 ns (0 allocations: 0 bytes)
77.441 ns (0 allocations: 0 bytes)
julia> n = 10_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā mydot(x, y);
9.056 Ī¼s (0 allocations: 0 bytes)
759.072 ns (0 allocations: 0 bytes)
julia> n = 100_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā mydot(x, y);
89.953 Ī¼s (0 allocations: 0 bytes)
8.673 Ī¼s (0 allocations: 0 bytes)
For Float16
:
julia> function c_simsimd_dot_f16_sve(a::Vector{Float16}, b::Vector{Float16})
result = Ref{Float64}()
@ccall "./libsimsimd.so".c_simsimd_dot_f16_sve(a::Ptr{Float16}, b::Ptr{Float16}, length(a)::UInt64, result::Ref{Float64})::Cvoid
return result[]
end
c_simsimd_dot_f16_sve (generic function with 1 method)
julia> n = 1000; x, y = randn(Float16, n), randn(Float16, n); @btime c_simsimd_dot_f16_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f16_sve(x, y) ā mydot(x, y);
465.801 ns (0 allocations: 0 bytes)
42.893 ns (0 allocations: 0 bytes)
julia> n = 10_000; x, y = randn(Float16, n), randn(Float16, n); @btime c_simsimd_dot_f16_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f16_sve(x, y) ā mydot(x, y);
4.539 Ī¼s (0 allocations: 0 bytes)
382.108 ns (0 allocations: 0 bytes)
julia> n = 100_000; x, y = randn(Float16, n), randn(Float16, n); @btime c_simsimd_dot_f16_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f16_sve(x, y) ā mydot(x, y);
45.025 Ī¼s (0 allocations: 0 bytes)
3.992 Ī¼s (0 allocations: 0 bytes)
Iād assume Iām doing something wrong (please do tell me if you spot something wrong!), or perhaps my GCC 13 is somehow generating inefficient code, but theyāre using intrinsics in their code so the compiler shouldnāt do much work besides using the instructions corresponding to those intrinsics, but in practice Iām seeing their manually written hardcoded SIMD code is 10x slower than a simple generic Julia function.
Ah, I did spot one error! I didnāt enable optimisations, but they help quite a bit despite using intrinsics. Adding the -O2
/-O3
flags to the compiler invocation (it doesnāt change much in this case between the two) I now get
julia> function mydot(x, y)
s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
@simd for i in eachindex(x, y)
s = muladd(x[i], y[i], s)
end
return s
end
mydot (generic function with 1 method)
julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
result = Ref{Float64}()
@ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
return result[]
end
c_simsimd_dot_f32_sve (generic function with 1 method)
julia> using BenchmarkTools
julia> n = 1000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā mydot(x, y);
139.483 ns (0 allocations: 0 bytes)
77.405 ns (0 allocations: 0 bytes)
julia> n = 10_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā mydot(x, y);
1.482 Ī¼s (0 allocations: 0 bytes)
770.591 ns (0 allocations: 0 bytes)
julia> n = 100_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā mydot(x, y);
14.975 Ī¼s (0 allocations: 0 bytes)
8.682 Ī¼s (0 allocations: 0 bytes)
which is an improvement, but still 2x slower than the Julia code. I tried on a different machine with both gcc 14.1 and clang 17.0.6, got pretty much same timings. For the curious, you can see the native code on Godbold.
Does this load libsimdsimd.so
from the filesystem every time the function is run and causes an overhead? Or does some caching make the overhead negligible?
No, a library is dlopened once, and if there was any latency related to loading the library it would be an additive constant, not a multiplicative one.
Does simsimd have smaller rounding errors than the Julia code?
Not sure what you mean (also, to measure any error Iād need a āground truthā value to compare with, but with non-associative floating point numbers the answer is a bit undefined)
For ground truth, you could convert inputs to Rational{BigInt}
: x_exact = rationalize.(BigInt, x; tol=0)
, so there are no rounding errors.
julia> n = 100_000; x, y = randn(Float64, n), randn(Float64, n); @btime c_simsimd_dot_f64_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f64_sve(x, y) ā mydot(x, y);
29.791 Ī¼s (0 allocations: 0 bytes)
16.863 Ī¼s (0 allocations: 0 bytes)
julia> x_exact = rationalize.(BigInt, x; tol=0); y_exact = rationalize.(BigInt, y; tol=0); float(mydot(x_exact, y_exact)), mydot(x, y), c_simsimd_dot_f64_sve(x, y)
(-337.8510597761317967370383903222862904443775171629039591085511873542157582559819, -337.85105977613193, -337.8510597761299)
julia> abs(1 - ans[1] / ans[2]), abs(1 - ans[1] / ans[3])
(4.038454663574069458451217737384723428484604743100525511608584387305775076871613e-16, 5.653152195112320160938281258458505198150574516822128226252675770455608604295732e-15)
It doesnāt look like itās more accurate. But in case you were trying to explain the factor of 2 of performance difference, in the code native I linked above thereās a single fmla
call, while in the vector body of the native code of mydot
(code_native(mydot, NTuple{2,Vector{Float64}}; debuginfo=:none)
) there are two:
.LBB0_8: // %vector.body
// =>This Inner Loop Header: Depth=1
lsl x17, x15, #3
ld1d { z2.d }, p0/z, [x10, x15, lsl #3]
ld1d { z4.d }, p0/z, [x11, x15, lsl #3]
add x15, x15, x13
add x18, x10, x17
add x17, x11, x17
cmp x12, x15
ld1d { z3.d }, p0/z, [x18, x16, lsl #3]
ld1d { z5.d }, p0/z, [x17, x16, lsl #3]
fmla z1.d, p0/m, z2.d, z4.d
fmla z0.d, p0/m, z3.d, z5.d
b.ne .LBB0_8
// %bb.9: // %middle.block
fadd z0.d, z0.d, z1.d
faddv d0, p0, z0.d
cbz x14, .LBB0_12
.LBB0_10: // %L109.preheader
lsl x13, x12, #3
sub x9, x9, x12
add x11, x11, x13
add x10, x10, x13
.p2align 5, , 16
So itād appear the 2x difference stems from loop unrolling, which is beneficial in this case (I assume because there are 4 SIMD registries on Nvidia Grace?).
The exact answer for this sort of thing is usually well defined ā floating-point values are just rational numbers, so take these rational numbers and do the computation in āinfinite precisionā, and compare that to what you got from your finite-precision floating-point algorithm.
In practice, itās usually sufficient to compute the answer in twice the precision. But there are also efficient algorithms like Xsum.jl that compute the āexactly roundedā sum or dot product, i.e. as if it were computed in infinite precision and then the final answer were rounded to Float64
.)
Using Rational{BigInt}
is also an option, but is usually horrendously expensive compared to just doing floating-point arithmetic in a higher precision. (The size of the denominator can grow exponentially quickly as the computation progresses.)