In trying to speed up .*, I was wondering if much progress had been made at making v?Mul accessible (maybe as part of MKL.jl). Mostly just wondering if it is worth upgrading to 1.7 to make the install work. Currently the fastest .* that I can see is just to do:

function vMul_test!(A,B,C)
n = length(A)
@simd for i = 1 : n
@inbounds A[i] = B[i] * C[i]
end
end

Have you tried ccalling into MKL and benchmarked the performance of the MKL functions? I’d be curious.

I would assume that the MKL variants are multithreaded. So you should probably also use threads in your Julia implementation. Unfortunately, we don’t have multithreaded broadcasting (yet). Maybe there exist implementations in packages?

You can access MKL’s v?Mul via IntelVectorMath.jl
Previous bench shows that the performance of v?Mul is equivalent to Base or slower. So the offical release didn’t add related routines.
You can add

Thanks for pointing out how to add the v?Mul to IntelVectorMath.jl (which is a really nice package!) and providing context as to why it was not added to the official release. I will check out some of the other multithreading suggestions above for an attempt at a speed up.

For context timings on my machine with some of the easiest multithreading solutions.

using Einsum, BenchmarkTools, LoopVectorization
A = rand(1000,1000)
B = rand(1000,1000)
C = rand(1000,1000);
function f1!(A,B,C)
@einsum A[i,j] = B[i,j] * C[i,j]
end
function f2!(A,B,C)
n = length(A)
@simd for i = 1 : n
@inbounds A[i] = B[i] * C[i]
end
end
function f3!(A,B,C)
n = length(A)
@avxt for i = 1 : n
@inbounds A[i] = B[i] * C[i]
end
end
function f4!(A,B,C)
@vielsum A[i,j] = B[i,j] * C[i,j]
end

On a single thread I find ~ 1 ms from all of the methods with f2! being optimal (since it has no overhead asking about if there are more threads. When using 2 threads, f3! seems optimal, timings below.

Whoops, I was looking at old docs, but using a current version (v0.12.66), It appears both are still available, but thanks for the comment. I will switch to @tturbo.

And, I guess I don’t have a good enough intuition to know whether or not to expect the MKL solution would be faster which is why I wanted to just do the experiment.