Interesting post about SIMD dot product (and cosine similarity)

This is part 1 of a two posts showing how to SIMD the hell out of a dot-product:

Part 2 (forthcoming) will be about Modular implementation.

Seems dot product is a place Julia should aspire to hug the SOTA.

2 Likes

Itā€™s nice as a tutorial example for hand SIMD optimization, but (a) optimized BLAS libraries (like the ones Julia uses) already contain an optimized dot product and (b) my understanding is that compilers do a pretty good job at SIMD-vectorizing BLAS-1 loops like dot products (as long as you give them permission to re-associate fp operations with @simd or similar).

For example, consider this simplistic implementation:

function mydot(x, y)
    s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
    @simd for i in eachindex(x, y)
        s = muladd(x[i], y[i], s)
    end
    return s
end

It gives timings comparable to (or even slightly faster than) the optimized BLAS dot(x,y) call (not including multi-threading for very large sizes):

julia> using BenchmarkTools, LinearAlgebra

julia> LinearAlgebra.BLAS.set_num_threads(1);

julia> n = 1000; x, y = randn(n), randn(n); @btime dot($x, $y); @btime mydot($x, $y);
  84.354 ns (0 allocations: 0 bytes)
  77.861 ns (0 allocations: 0 bytes)

julia> n = 10000; x, y = randn(n), randn(n); @btime dot($x, $y); @btime mydot($x, $y);
  1.557 Ī¼s (0 allocations: 0 bytes)
  1.546 Ī¼s (0 allocations: 0 bytes)

julia> n = 100000; x, y = randn(n), randn(n); @btime dot($x, $y); @btime mydot($x, $y);
  24.785 Ī¼s (0 allocations: 0 bytes)
  23.004 Ī¼s (0 allocations: 0 bytes)

The article argues that compilers will do a worse job with mixed-precision code:

Second, itā€™s well known that even in data-parallel code like this, compilers have a hard time vectorizing low-precision code, sometimes leaving as much as 99% of single-core performance on the table, and with mixed precision itā€™s even worse! One needs to be explicit if they want reliable data parallelism.

but it actually seems to be doing a pretty good job for a mixed Float64ā€“Float32 dot product (note that there is no optimized mixed-precision BLAS routine):

julia> n = 1000; x, y = randn(n), randn(Float32, n); @btime dot($x, $y); @btime mydot($x, $y);
  998.273 ns (0 allocations: 0 bytes)
  85.035 ns (0 allocations: 0 bytes)

julia> n = 10000; x, y = randn(n), randn(Float32, n); @btime dot($x, $y); @btime mydot($x, $y);
  10.051 Ī¼s (0 allocations: 0 bytes)
  1.500 Ī¼s (0 allocations: 0 bytes)

julia> n = 100000; x, y = randn(n), randn(Float32, n); @btime dot($x, $y); @btime mydot($x, $y);
  100.302 Ī¼s (0 allocations: 0 bytes)
  18.010 Ī¼s (0 allocations: 0 bytes)

Interestingly, the built-in dot routine is doing much worse, here. Itā€™s calling a generic fallback routine that doesnā€™t use @simd (so the compiler doesnā€™t have permission to re-order the loop iterations).

(That being said, there are more complicated problems that will fail to auto-vectorize. Compilers have been using dot products and matrix multiplications as benchmark problems for decades.)

12 Likes

I find it disheartening that mojo gets points for what Julia has been able to do for a long time. Itā€™s a shame that mojo gets better promotion than Julia.

Mojo gets better promotion because Python is much more well-known than Julia.

To be fair:

  1. This is on modular.com, which is a sensible place for Mojo tutorials (in this case, starting with C++ as a baseline).
  2. Itā€™s explicitly framed as a tutorial. A simple problem like dot products is a good topic for a tutorial.
  3. I absolutely agree that some problems benefit from hand vectorization (just not dot products).

Of course, itā€™s true that hand vectorization is possible in Julia as well, along with many other languages, but I wouldnā€™t expect a Mojo tutorial to spend time on this. Using C++ as a baseline is pretty reasonable, too.

6 Likes

Modular seems to be having a fairly hard time as a business, so I wouldnā€™t worry much about this as a long term issue.

3 Likes

Julia-lang blog used to have this kind of post a long time ago.

I would be a pity. I understand that Modular is trying to make the LLVM project easier for non-clang based languages, so it may benefit Julia.

1 Like

So, yes, the @simd does a good job. Two additional points:

  1. The article actually talks about cosine-similarity, which can be further optimized (e.g. by fusing the norm and dot, and doing the clever inv-sqrt). Maybe benchmarking the cosine-similarity is the right comparison.
  2. The mixed and custom types are a pretty important bit of modern calculations (and will probably just increase in popularity). So what is the benchmark with bfloat16 types, for example.

Since this is also a BLAS-1-like operation, the compiler does a pretty good job on a handwritten loop with @simd. e.g. this SIMD-vectorizes:

function mycos(x, y)
    s = zero(x[begin]) * zero(y[begin])
    nx = ny = s
    @simd for i in eachindex(x, y)
        s = muladd(x[i], y[i], s)
        nx = muladd(x[i], x[i], nx)
        ny = muladd(y[i], y[i], ny)
    end
    return s / sqrt(nx * ny)
end

and is about 1.8x slower than dot(x, y) for length-1000 arrays (and almost the same speed for length-100000 arrays where it is memory-bound). In comparison, dot(x, y) / sqrt(dot(x, x) * dot(y, y)) is about 3x slower, and dot(x, y) / (norm(x) * norm(y)) is about 8x slower ā€” thatā€™s because the norm function is more complicated to guard against spurious overflow/underflow (i.e. it trades off performance for robustness).

(The cost of the sqrt computation seems irrelevant unless the arrays are tiny.) So if this is truly a performance-critical operation in some application, you could optimize it in Julia.

This is an instance of a general rule: re-using optimized library routines is good practice, but (if you know what you are doing and expend enough effort) code that is specialized for your problem can often beat more general optimized-library calls, especially a sequence of optimized library calls (e.g. by fusing loops) ā€¦ at least, if you are using a language (Julia, Mojo, C++, Rust, ā€¦) in which you can efficiently implement performance-critical inner loops.

And again, I agree that there are other problems that benefit from hand-vectorization. (Which can also be done in Julia.)

Iā€™m not using a processor with hardware-accelerated BFloat16 or Float16 at the moment, sorry.

But itā€™s not clear what point you are trying to make here? Is there some optimization you think can be done in Mojo but not in Julia? Or are you suggesting that we should add an optimized library routine for cosine-similarity (Statistics.jl has a cor function for Pearson correlation, though Iā€™m not sure how heavily optimized it is)? Or ā€¦?

11 Likes

I wanted to see how the SVE implementation of dot in SimSIMD, the library mentioned in the blogpost, fares, so I built the following library

#include "simsimd.h"

void c_simsimd_dot_f16_sve(simsimd_f16_t const* a, simsimd_f16_t const* b, simsimd_size_t n, simsimd_distance_t* result) {
    simsimd_dot_f16_sve(a, b, n, result);
}

void c_simsimd_dot_f32_sve(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n, simsimd_distance_t* result) {
    simsimd_dot_f32_sve(a, b, n, result);
}

void c_simsimd_dot_f64_sve(simsimd_f64_t const* a, simsimd_f64_t const* b, simsimd_size_t n, simsimd_distance_t* result) {
    simsimd_dot_f64_sve(a, b, n, result);
}

on neoverse-v2 (Nvidia Grace) with

$ gcc --version
gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4)
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ gcc -mcpu=neoverse-v2 -mtune=neoverse-v2 -shared -fPIC -o libsimsimd.so libsimsimd.c

With Julia v1.11.1 I get

julia> function mydot(x, y)
           s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
           @simd for i in eachindex(x, y)
               s = muladd(x[i], y[i], s)
           end
           return s
       end
mydot (generic function with 1 method)

julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
           result = Ref{Float64}()
           @ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
           return result[]
       end
c_simsimd_dot_f32_sve (generic function with 1 method)

julia> using BenchmarkTools

julia> n = 1000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā‰ˆ mydot(x, y);
  928.000 ns (0 allocations: 0 bytes)
  77.441 ns (0 allocations: 0 bytes)

julia> n = 10_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā‰ˆ mydot(x, y);
  9.056 Ī¼s (0 allocations: 0 bytes)
  759.072 ns (0 allocations: 0 bytes)

julia> n = 100_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā‰ˆ mydot(x, y);
  89.953 Ī¼s (0 allocations: 0 bytes)
  8.673 Ī¼s (0 allocations: 0 bytes)

For Float16:

julia> function c_simsimd_dot_f16_sve(a::Vector{Float16}, b::Vector{Float16})
           result = Ref{Float64}()
           @ccall "./libsimsimd.so".c_simsimd_dot_f16_sve(a::Ptr{Float16}, b::Ptr{Float16}, length(a)::UInt64, result::Ref{Float64})::Cvoid
           return result[]
       end
c_simsimd_dot_f16_sve (generic function with 1 method)

julia> n = 1000; x, y = randn(Float16, n), randn(Float16, n); @btime c_simsimd_dot_f16_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f16_sve(x, y) ā‰ˆ mydot(x, y);
  465.801 ns (0 allocations: 0 bytes)
  42.893 ns (0 allocations: 0 bytes)

julia> n = 10_000; x, y = randn(Float16, n), randn(Float16, n); @btime c_simsimd_dot_f16_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f16_sve(x, y) ā‰ˆ mydot(x, y);
  4.539 Ī¼s (0 allocations: 0 bytes)
  382.108 ns (0 allocations: 0 bytes)

julia> n = 100_000; x, y = randn(Float16, n), randn(Float16, n); @btime c_simsimd_dot_f16_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f16_sve(x, y) ā‰ˆ mydot(x, y);
  45.025 Ī¼s (0 allocations: 0 bytes)
  3.992 Ī¼s (0 allocations: 0 bytes)

Iā€™d assume Iā€™m doing something wrong (please do tell me if you spot something wrong!), or perhaps my GCC 13 is somehow generating inefficient code, but theyā€™re using intrinsics in their code so the compiler shouldnā€™t do much work besides using the instructions corresponding to those intrinsics, but in practice Iā€™m seeing their manually written hardcoded SIMD code is 10x slower than a simple generic Julia function.

6 Likes

Ah, I did spot one error! I didnā€™t enable optimisations, but they help quite a bit despite using intrinsics. Adding the -O2/-O3 flags to the compiler invocation (it doesnā€™t change much in this case between the two) I now get

julia> function mydot(x, y)
           s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
           @simd for i in eachindex(x, y)
               s = muladd(x[i], y[i], s)
           end
           return s
       end
mydot (generic function with 1 method)

julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
           result = Ref{Float64}()
           @ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
           return result[]
       end
c_simsimd_dot_f32_sve (generic function with 1 method)

julia> using BenchmarkTools

julia> n = 1000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā‰ˆ mydot(x, y);
  139.483 ns (0 allocations: 0 bytes)
  77.405 ns (0 allocations: 0 bytes)

julia> n = 10_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā‰ˆ mydot(x, y);
  1.482 Ī¼s (0 allocations: 0 bytes)
  770.591 ns (0 allocations: 0 bytes)

julia> n = 100_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ā‰ˆ mydot(x, y);
  14.975 Ī¼s (0 allocations: 0 bytes)
  8.682 Ī¼s (0 allocations: 0 bytes)

which is an improvement, but still 2x slower than the Julia code. I tried on a different machine with both gcc 14.1 and clang 17.0.6, got pretty much same timings. For the curious, you can see the native code on Godbold.

10 Likes

Does this load libsimdsimd.so from the filesystem every time the function is run and causes an overhead? Or does some caching make the overhead negligible?

2 Likes

No, a library is dlopened once, and if there was any latency related to loading the library it would be an additive constant, not a multiplicative one.

1 Like

Does simsimd have smaller rounding errors than the Julia code?

Not sure what you mean (also, to measure any error Iā€™d need a ā€œground truthā€ value to compare with, but with non-associative floating point numbers the answer is a bit undefined)

For ground truth, you could convert inputs to Rational{BigInt}: x_exact = rationalize.(BigInt, x; tol=0), so there are no rounding errors.

julia> n = 100_000; x, y = randn(Float64, n), randn(Float64, n); @btime c_simsimd_dot_f64_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f64_sve(x, y) ā‰ˆ mydot(x, y);
  29.791 Ī¼s (0 allocations: 0 bytes)
  16.863 Ī¼s (0 allocations: 0 bytes)

julia> x_exact = rationalize.(BigInt, x; tol=0); y_exact = rationalize.(BigInt, y; tol=0); float(mydot(x_exact, y_exact)), mydot(x, y), c_simsimd_dot_f64_sve(x, y)
(-337.8510597761317967370383903222862904443775171629039591085511873542157582559819, -337.85105977613193, -337.8510597761299)

julia> abs(1 - ans[1] / ans[2]), abs(1 - ans[1] / ans[3])
(4.038454663574069458451217737384723428484604743100525511608584387305775076871613e-16, 5.653152195112320160938281258458505198150574516822128226252675770455608604295732e-15)

It doesnā€™t look like itā€™s more accurate. But in case you were trying to explain the factor of 2 of performance difference, in the code native I linked above thereā€™s a single fmla call, while in the vector body of the native code of mydot (code_native(mydot, NTuple{2,Vector{Float64}}; debuginfo=:none)) there are two:

.LBB0_8:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
        lsl     x17, x15, #3
        ld1d    { z2.d }, p0/z, [x10, x15, lsl #3]
        ld1d    { z4.d }, p0/z, [x11, x15, lsl #3]
        add     x15, x15, x13
        add     x18, x10, x17
        add     x17, x11, x17
        cmp     x12, x15
        ld1d    { z3.d }, p0/z, [x18, x16, lsl #3]
        ld1d    { z5.d }, p0/z, [x17, x16, lsl #3]
        fmla    z1.d, p0/m, z2.d, z4.d
        fmla    z0.d, p0/m, z3.d, z5.d
        b.ne    .LBB0_8
// %bb.9:                               // %middle.block
        fadd    z0.d, z0.d, z1.d
        faddv   d0, p0, z0.d
        cbz     x14, .LBB0_12
.LBB0_10:                               // %L109.preheader
        lsl     x13, x12, #3
        sub     x9, x9, x12
        add     x11, x11, x13
        add     x10, x10, x13
        .p2align        5, , 16

So itā€™d appear the 2x difference stems from loop unrolling, which is beneficial in this case (I assume because there are 4 SIMD registries on Nvidia Grace?).

2 Likes

The exact answer for this sort of thing is usually well defined ā€” floating-point values are just rational numbers, so take these rational numbers and do the computation in ā€˜infinite precisionā€™, and compare that to what you got from your finite-precision floating-point algorithm.

In practice, itā€™s usually sufficient to compute the answer in twice the precision. But there are also efficient algorithms like Xsum.jl that compute the ā€œexactly roundedā€ sum or dot product, i.e. as if it were computed in infinite precision and then the final answer were rounded to Float64.)

Using Rational{BigInt} is also an option, but is usually horrendously expensive compared to just doing floating-point arithmetic in a higher precision. (The size of the denominator can grow exponentially quickly as the computation progresses.)

5 Likes