Interesting post about SIMD dot product (and cosine similarity)

stevengj · November 30, 2024, 3:35pm

Since this is also a BLAS-1-like operation, the compiler does a pretty good job on a handwritten loop with @simd. e.g. this SIMD-vectorizes:

function mycos(x, y)
    s = zero(x[begin]) * zero(y[begin])
    nx = ny = s
    @simd for i in eachindex(x, y)
        s = muladd(x[i], y[i], s)
        nx = muladd(x[i], x[i], nx)
        ny = muladd(y[i], y[i], ny)
    end
    return s / sqrt(nx * ny)
end

and is about 1.8x slower than dot(x, y) for length-1000 arrays (and almost the same speed for length-100000 arrays where it is memory-bound). In comparison, dot(x, y) / sqrt(dot(x, x) * dot(y, y)) is about 3x slower, and dot(x, y) / (norm(x) * norm(y)) is about 8x slower — that’s because the norm function is more complicated to guard against spurious overflow/underflow (i.e. it trades off performance for robustness).

(The cost of the sqrt computation seems irrelevant unless the arrays are tiny.) So if this is truly a performance-critical operation in some application, you could optimize it in Julia.

This is an instance of a general rule: re-using optimized library routines is good practice, but (if you know what you are doing and expend enough effort) code that is specialized for your problem can often beat more general optimized-library calls, especially a sequence of optimized library calls (e.g. by fusing loops) … at least, if you are using a language (Julia, Mojo, C++, Rust, …) in which you can efficiently implement performance-critical inner loops.

And again, I agree that there are other problems that benefit from hand-vectorization. (Which can also be done in Julia.)

I’m not using a processor with hardware-accelerated BFloat16 or Float16 at the moment, sorry.

But it’s not clear what point you are trying to make here? Is there some optimization you think can be done in Mojo but not in Julia? Or are you suggesting that we should add an optimized library routine for cosine-similarity (Statistics.jl has a cor function for Pearson correlation, though I’m not sure how heavily optimized it is)? Or …?

Topic		Replies	Views
Cosine seems slow Performance	14	1797	November 27, 2019
Dot function General Usage	45	5525	September 26, 2018
Sum performance for Array{Float64,2} elements Performance	13	2468	May 15, 2018
Datatypes SIMD by default? (as in Mojo) Internals & Design question , simd	14	1131	May 15, 2025
Optimizing sums of products (dot products) Performance simd , linearalgebra , sum	17	614	September 24, 2024

Interesting post about SIMD dot product (and cosine similarity)

Related topics