Interesting post about SIMD dot product (and cosine similarity)

Dan · November 30, 2024, 8:21am

This is part 1 of a two posts showing how to SIMD the hell out of a dot-product:

Part 2 (forthcoming) will be about Modular implementation.

Seems dot product is a place Julia should aspire to hug the SOTA.

stevengj · November 30, 2024, 1:57pm

It’s nice as a tutorial example for hand SIMD optimization, but (a) optimized BLAS libraries (like the ones Julia uses) already contain an optimized dot product and (b) my understanding is that compilers do a pretty good job at SIMD-vectorizing BLAS-1 loops like dot products (as long as you give them permission to re-associate fp operations with @simd or similar).

For example, consider this simplistic implementation:

function mydot(x, y)
    s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
    @simd for i in eachindex(x, y)
        s = muladd(x[i], y[i], s)
    end
    return s
end

It gives timings comparable to (or even slightly faster than) the optimized BLAS dot(x,y) call (not including multi-threading for very large sizes):

julia> using BenchmarkTools, LinearAlgebra

julia> LinearAlgebra.BLAS.set_num_threads(1);

julia> n = 1000; x, y = randn(n), randn(n); @btime dot($x, $y); @btime mydot($x, $y);
  84.354 ns (0 allocations: 0 bytes)
  77.861 ns (0 allocations: 0 bytes)

julia> n = 10000; x, y = randn(n), randn(n); @btime dot($x, $y); @btime mydot($x, $y);
  1.557 μs (0 allocations: 0 bytes)
  1.546 μs (0 allocations: 0 bytes)

julia> n = 100000; x, y = randn(n), randn(n); @btime dot($x, $y); @btime mydot($x, $y);
  24.785 μs (0 allocations: 0 bytes)
  23.004 μs (0 allocations: 0 bytes)

The article argues that compilers will do a worse job with mixed-precision code:

Second, it’s well known that even in data-parallel code like this, compilers have a hard time vectorizing low-precision code, sometimes leaving as much as 99% of single-core performance on the table, and with mixed precision it’s even worse! One needs to be explicit if they want reliable data parallelism.

but it actually seems to be doing a pretty good job for a mixed Float64–Float32 dot product (note that there is no optimized mixed-precision BLAS routine):

julia> n = 1000; x, y = randn(n), randn(Float32, n); @btime dot($x, $y); @btime mydot($x, $y);
  998.273 ns (0 allocations: 0 bytes)
  85.035 ns (0 allocations: 0 bytes)

julia> n = 10000; x, y = randn(n), randn(Float32, n); @btime dot($x, $y); @btime mydot($x, $y);
  10.051 μs (0 allocations: 0 bytes)
  1.500 μs (0 allocations: 0 bytes)

julia> n = 100000; x, y = randn(n), randn(Float32, n); @btime dot($x, $y); @btime mydot($x, $y);
  100.302 μs (0 allocations: 0 bytes)
  18.010 μs (0 allocations: 0 bytes)

Interestingly, the built-in dot routine is doing much worse, here. It’s calling a generic fallback routine that doesn’t use @simd (so the compiler doesn’t have permission to re-order the loop iterations).

(That being said, there are more complicated problems that will fail to auto-vectorize. Compilers have been using dot products and matrix multiplications as benchmark problems for decades.)

xiaoxi · November 30, 2024, 2:14pm

I find it disheartening that mojo gets points for what Julia has been able to do for a long time. It’s a shame that mojo gets better promotion than Julia.

photor · November 30, 2024, 2:16pm

Mojo gets better promotion because Python is much more well-known than Julia.

stevengj · November 30, 2024, 2:20pm

To be fair:

This is on modular.com, which is a sensible place for Mojo tutorials (in this case, starting with C++ as a baseline).
It’s explicitly framed as a tutorial. A simple problem like dot products is a good topic for a tutorial.
I absolutely agree that some problems benefit from hand vectorization (just not dot products).

Of course, it’s true that hand vectorization is possible in Julia as well, along with many other languages, but I wouldn’t expect a Mojo tutorial to spend time on this. Using C++ as a baseline is pretty reasonable, too.

johnmyleswhite · November 30, 2024, 2:53pm

Modular seems to be having a fairly hard time as a business, so I wouldn’t worry much about this as a long term issue.

xiaoxi · November 30, 2024, 3:05pm

Julia-lang blog used to have this kind of post a long time ago.

I would be a pity. I understand that Modular is trying to make the LLVM project easier for non-clang based languages, so it may benefit Julia.

Dan · November 30, 2024, 3:20pm

So, yes, the @simd does a good job. Two additional points:

The article actually talks about cosine-similarity, which can be further optimized (e.g. by fusing the norm and dot, and doing the clever inv-sqrt). Maybe benchmarking the cosine-similarity is the right comparison.
The mixed and custom types are a pretty important bit of modern calculations (and will probably just increase in popularity). So what is the benchmark with bfloat16 types, for example.

stevengj · November 30, 2024, 3:35pm

Since this is also a BLAS-1-like operation, the compiler does a pretty good job on a handwritten loop with @simd. e.g. this SIMD-vectorizes:

function mycos(x, y)
    s = zero(x[begin]) * zero(y[begin])
    nx = ny = s
    @simd for i in eachindex(x, y)
        s = muladd(x[i], y[i], s)
        nx = muladd(x[i], x[i], nx)
        ny = muladd(y[i], y[i], ny)
    end
    return s / sqrt(nx * ny)
end

and is about 1.8x slower than dot(x, y) for length-1000 arrays (and almost the same speed for length-100000 arrays where it is memory-bound). In comparison, dot(x, y) / sqrt(dot(x, x) * dot(y, y)) is about 3x slower, and dot(x, y) / (norm(x) * norm(y)) is about 8x slower — that’s because the norm function is more complicated to guard against spurious overflow/underflow (i.e. it trades off performance for robustness).

(The cost of the sqrt computation seems irrelevant unless the arrays are tiny.) So if this is truly a performance-critical operation in some application, you could optimize it in Julia.

This is an instance of a general rule: re-using optimized library routines is good practice, but (if you know what you are doing and expend enough effort) code that is specialized for your problem can often beat more general optimized-library calls, especially a sequence of optimized library calls (e.g. by fusing loops) … at least, if you are using a language (Julia, Mojo, C++, Rust, …) in which you can efficiently implement performance-critical inner loops.

And again, I agree that there are other problems that benefit from hand-vectorization. (Which can also be done in Julia.)

I’m not using a processor with hardware-accelerated BFloat16 or Float16 at the moment, sorry.

But it’s not clear what point you are trying to make here? Is there some optimization you think can be done in Mojo but not in Julia? Or are you suggesting that we should add an optimized library routine for cosine-similarity (Statistics.jl has a cor function for Pearson correlation, though I’m not sure how heavily optimized it is)? Or …?

giordano · December 1, 2024, 2:53pm

I wanted to see how the SVE implementation of dot in SimSIMD, the library mentioned in the blogpost, fares, so I built the following library

#include "simsimd.h"

void c_simsimd_dot_f16_sve(simsimd_f16_t const* a, simsimd_f16_t const* b, simsimd_size_t n, simsimd_distance_t* result) {
    simsimd_dot_f16_sve(a, b, n, result);
}

void c_simsimd_dot_f32_sve(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n, simsimd_distance_t* result) {
    simsimd_dot_f32_sve(a, b, n, result);
}

void c_simsimd_dot_f64_sve(simsimd_f64_t const* a, simsimd_f64_t const* b, simsimd_size_t n, simsimd_distance_t* result) {
    simsimd_dot_f64_sve(a, b, n, result);
}

on neoverse-v2 (Nvidia Grace) with

$ gcc --version
gcc (GCC) 13.1.1 20230614 (Red Hat 13.1.1-4)
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ gcc -mcpu=neoverse-v2 -mtune=neoverse-v2 -shared -fPIC -o libsimsimd.so libsimsimd.c

With Julia v1.11.1 I get

julia> function mydot(x, y)
           s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
           @simd for i in eachindex(x, y)
               s = muladd(x[i], y[i], s)
           end
           return s
       end
mydot (generic function with 1 method)

julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
           result = Ref{Float64}()
           @ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
           return result[]
       end
c_simsimd_dot_f32_sve (generic function with 1 method)

julia> using BenchmarkTools

julia> n = 1000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  928.000 ns (0 allocations: 0 bytes)
  77.441 ns (0 allocations: 0 bytes)

julia> n = 10_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  9.056 μs (0 allocations: 0 bytes)
  759.072 ns (0 allocations: 0 bytes)

julia> n = 100_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  89.953 μs (0 allocations: 0 bytes)
  8.673 μs (0 allocations: 0 bytes)

For Float16:

julia> function c_simsimd_dot_f16_sve(a::Vector{Float16}, b::Vector{Float16})
           result = Ref{Float64}()
           @ccall "./libsimsimd.so".c_simsimd_dot_f16_sve(a::Ptr{Float16}, b::Ptr{Float16}, length(a)::UInt64, result::Ref{Float64})::Cvoid
           return result[]
       end
c_simsimd_dot_f16_sve (generic function with 1 method)

julia> n = 1000; x, y = randn(Float16, n), randn(Float16, n); @btime c_simsimd_dot_f16_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f16_sve(x, y) ≈ mydot(x, y);
  465.801 ns (0 allocations: 0 bytes)
  42.893 ns (0 allocations: 0 bytes)

julia> n = 10_000; x, y = randn(Float16, n), randn(Float16, n); @btime c_simsimd_dot_f16_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f16_sve(x, y) ≈ mydot(x, y);
  4.539 μs (0 allocations: 0 bytes)
  382.108 ns (0 allocations: 0 bytes)

julia> n = 100_000; x, y = randn(Float16, n), randn(Float16, n); @btime c_simsimd_dot_f16_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f16_sve(x, y) ≈ mydot(x, y);
  45.025 μs (0 allocations: 0 bytes)
  3.992 μs (0 allocations: 0 bytes)

I’d assume I’m doing something wrong (please do tell me if you spot something wrong!), or perhaps my GCC 13 is somehow generating inefficient code, but they’re using intrinsics in their code so the compiler shouldn’t do much work besides using the instructions corresponding to those intrinsics, but in practice I’m seeing their manually written hardcoded SIMD code is 10x slower than a simple generic Julia function.

giordano · December 1, 2024, 4:31pm

Ah, I did spot one error! I didn’t enable optimisations, but they help quite a bit despite using intrinsics. Adding the -O2/-O3 flags to the compiler invocation (it doesn’t change much in this case between the two) I now get

julia> function mydot(x, y)
           s = zero(x[begin]) * zero(y[begin]) # (punt on empty arrays)
           @simd for i in eachindex(x, y)
               s = muladd(x[i], y[i], s)
           end
           return s
       end
mydot (generic function with 1 method)

julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
           result = Ref{Float64}()
           @ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
           return result[]
       end
c_simsimd_dot_f32_sve (generic function with 1 method)

julia> using BenchmarkTools

julia> n = 1000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  139.483 ns (0 allocations: 0 bytes)
  77.405 ns (0 allocations: 0 bytes)

julia> n = 10_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  1.482 μs (0 allocations: 0 bytes)
  770.591 ns (0 allocations: 0 bytes)

julia> n = 100_000; x, y = randn(Float32, n), randn(Float32, n); @btime c_simsimd_dot_f32_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f32_sve(x, y) ≈ mydot(x, y);
  14.975 μs (0 allocations: 0 bytes)
  8.682 μs (0 allocations: 0 bytes)

which is an improvement, but still 2x slower than the Julia code. I tried on a different machine with both gcc 14.1 and clang 17.0.6, got pretty much same timings. For the curious, you can see the native code on Godbold.

greatpet · December 1, 2024, 5:25pm

giordano:

julia> function c_simsimd_dot_f32_sve(a::Vector{Float32}, b::Vector{Float32})
           result = Ref{Float64}()
           @ccall "./libsimsimd.so".c_simsimd_dot_f32_sve(a::Ptr{Float32}, b::Ptr{Float32}, length(a)::UInt64, result::Ref{Float64})::Cvoid
           return result[]
       end

Does this load libsimdsimd.so from the filesystem every time the function is run and causes an overhead? Or does some caching make the overhead negligible?

giordano · December 1, 2024, 5:27pm

No, a library is dlopened once, and if there was any latency related to loading the library it would be an additive constant, not a multiplicative one.

nhz2 · December 2, 2024, 3:44pm

Does simsimd have smaller rounding errors than the Julia code?

giordano · December 2, 2024, 3:54pm

Not sure what you mean (also, to measure any error I’d need a “ground truth” value to compare with, but with non-associative floating point numbers the answer is a bit undefined)

nhz2 · December 2, 2024, 4:08pm

For ground truth, you could convert inputs to Rational{BigInt}: x_exact = rationalize.(BigInt, x; tol=0), so there are no rounding errors.

giordano · December 2, 2024, 5:36pm

julia> n = 100_000; x, y = randn(Float64, n), randn(Float64, n); @btime c_simsimd_dot_f64_sve($x, $y); @btime mydot($x, $y); @assert c_simsimd_dot_f64_sve(x, y) ≈ mydot(x, y);
  29.791 μs (0 allocations: 0 bytes)
  16.863 μs (0 allocations: 0 bytes)

julia> x_exact = rationalize.(BigInt, x; tol=0); y_exact = rationalize.(BigInt, y; tol=0); float(mydot(x_exact, y_exact)), mydot(x, y), c_simsimd_dot_f64_sve(x, y)
(-337.8510597761317967370383903222862904443775171629039591085511873542157582559819, -337.85105977613193, -337.8510597761299)

julia> abs(1 - ans[1] / ans[2]), abs(1 - ans[1] / ans[3])
(4.038454663574069458451217737384723428484604743100525511608584387305775076871613e-16, 5.653152195112320160938281258458505198150574516822128226252675770455608604295732e-15)

It doesn’t look like it’s more accurate. But in case you were trying to explain the factor of 2 of performance difference, in the code native I linked above there’s a single fmla call, while in the vector body of the native code of mydot (code_native(mydot, NTuple{2,Vector{Float64}}; debuginfo=:none)) there are two:

.LBB0_8:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
        lsl     x17, x15, #3
        ld1d    { z2.d }, p0/z, [x10, x15, lsl #3]
        ld1d    { z4.d }, p0/z, [x11, x15, lsl #3]
        add     x15, x15, x13
        add     x18, x10, x17
        add     x17, x11, x17
        cmp     x12, x15
        ld1d    { z3.d }, p0/z, [x18, x16, lsl #3]
        ld1d    { z5.d }, p0/z, [x17, x16, lsl #3]
        fmla    z1.d, p0/m, z2.d, z4.d
        fmla    z0.d, p0/m, z3.d, z5.d
        b.ne    .LBB0_8
// %bb.9:                               // %middle.block
        fadd    z0.d, z0.d, z1.d
        faddv   d0, p0, z0.d
        cbz     x14, .LBB0_12
.LBB0_10:                               // %L109.preheader
        lsl     x13, x12, #3
        sub     x9, x9, x12
        add     x11, x11, x13
        add     x10, x10, x13
        .p2align        5, , 16

So it’d appear the 2x difference stems from loop unrolling, which is beneficial in this case (I assume because there are 4 SIMD registries on Nvidia Grace?).

stevengj · December 2, 2024, 6:49pm

The exact answer for this sort of thing is usually well defined — floating-point values are just rational numbers, so take these rational numbers and do the computation in ‘infinite precision’, and compare that to what you got from your finite-precision floating-point algorithm.

In practice, it’s usually sufficient to compute the answer in twice the precision. But there are also efficient algorithms like Xsum.jl that compute the “exactly rounded” sum or dot product, i.e. as if it were computed in infinite precision and then the final answer were rounded to Float64.)

Using Rational{BigInt} is also an option, but is usually horrendously expensive compared to just doing floating-point arithmetic in a higher precision. (The size of the denominator can grow exponentially quickly as the computation progresses.)

Topic		Replies	Views
Dot function General Usage	45	5672	September 26, 2018
Sum performance for Array{Float64,2} elements Performance	13	2520	May 15, 2018
Cosine seems slow Performance	14	1854	November 27, 2019
Innefficient paralellization? Need some help optimizing a simple dot product Performance question , parallel	32	4884	March 28, 2018
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	3012	July 16, 2018

Interesting post about SIMD dot product (and cosine similarity)

Related topics