Since this is also a BLAS-1-like operation, the compiler does a pretty good job on a handwritten loop with @simd
. e.g. this SIMD-vectorizes:
function mycos(x, y)
s = zero(x[begin]) * zero(y[begin])
nx = ny = s
@simd for i in eachindex(x, y)
s = muladd(x[i], y[i], s)
nx = muladd(x[i], x[i], nx)
ny = muladd(y[i], y[i], ny)
end
return s / sqrt(nx * ny)
end
and is about 1.8x slower than dot(x, y)
for length-1000 arrays (and almost the same speed for length-100000 arrays where it is memory-bound). In comparison, dot(x, y) / sqrt(dot(x, x) * dot(y, y))
is about 3x slower, and dot(x, y) / (norm(x) * norm(y))
is about 8x slower — that’s because the norm
function is more complicated to guard against spurious overflow/underflow (i.e. it trades off performance for robustness).
(The cost of the sqrt
computation seems irrelevant unless the arrays are tiny.) So if this is truly a performance-critical operation in some application, you could optimize it in Julia.
This is an instance of a general rule: re-using optimized library routines is good practice, but (if you know what you are doing and expend enough effort) code that is specialized for your problem can often beat more general optimized-library calls, especially a sequence of optimized library calls (e.g. by fusing loops) … at least, if you are using a language (Julia, Mojo, C++, Rust, …) in which you can efficiently implement performance-critical inner loops.
And again, I agree that there are other problems that benefit from hand-vectorization. (Which can also be done in Julia.)
I’m not using a processor with hardware-accelerated BFloat16
or Float16
at the moment, sorry.
But it’s not clear what point you are trying to make here? Is there some optimization you think can be done in Mojo but not in Julia? Or are you suggesting that we should add an optimized library routine for cosine-similarity (Statistics.jl has a cor
function for Pearson correlation, though I’m not sure how heavily optimized it is)? Or …?