v?Mul in MKL

In trying to speed up .*, I was wondering if much progress had been made at making v?Mul accessible (maybe as part of MKL.jl). Mostly just wondering if it is worth upgrading to 1.7 to make the install work. Currently the fastest .* that I can see is just to do:

function vMul_test!(A,B,C)
    n = length(A)
    @simd for i = 1 : n
        @inbounds A[i] = B[i] * C[i]
    end
end

Have you tried ccalling into MKL and benchmarked the performance of the MKL functions? I’d be curious.

I would assume that the MKL variants are multithreaded. So you should probably also use threads in your Julia implementation. Unfortunately, we don’t have multithreaded broadcasting (yet). Maybe there exist implementations in packages?

Oh, and using MKL.jl with Julia 1.7 is a dream. Definitely worth the upgrade! :slight_smile:

1 Like

FastBroadcast.jl has multithreaded broadcast for such a case.

3 Likes

You can access MKL’s v?Mul via IntelVectorMath.jl
Previous bench shows that the performance of v?Mul is equivalent to Base or slower. So the offical release didn’t add related routines.
You can add

def_binary_op(Float64, Float64, :multiply, :multiply!, :Mul, false)
def_binary_op(Float32, Float32, :multiply, :multiply!, :Mul, false)
def_binary_op(ComplexF64, ComplexF64, :multiply, :multiply!, :Mul, false)
def_binary_op(ComplexF32, ComplexF32, :multiply, :multiply!, :Mul, false)

to src\IntelVectorMath.jl, and call IVM.multiply(!) for usage.

1 Like

Thanks for pointing out how to add the v?Mul to IntelVectorMath.jl (which is a really nice package!) and providing context as to why it was not added to the official release. I will check out some of the other multithreading suggestions above for an attempt at a speed up.

For context timings on my machine with some of the easiest multithreading solutions.

using Einsum, BenchmarkTools, LoopVectorization

A = rand(1000,1000)
B = rand(1000,1000)
C = rand(1000,1000);

function f1!(A,B,C)
    @einsum A[i,j] = B[i,j] * C[i,j]
end

function f2!(A,B,C)
    n = length(A)
    @simd for i = 1 : n
        @inbounds A[i] = B[i] * C[i]
    end
end

function f3!(A,B,C)
    n = length(A)
    @avxt for i = 1 : n
        @inbounds A[i] = B[i] * C[i]
    end
end

function f4!(A,B,C)
    @vielsum A[i,j] = B[i,j] * C[i,j]
end

On a single thread I find ~ 1 ms from all of the methods with f2! being optimal (since it has no overhead asking about if there are more threads. When using 2 threads, f3! seems optimal, timings below.

julia> @btime f1!($A,$B,$C);
  1.174 ms (0 allocations: 0 bytes)

julia> @btime f2!($A,$B,$C);
  1.029 ms (0 allocations: 0 bytes)

julia> @btime f3!($A,$B,$C);
  457.723 μs (0 allocations: 0 bytes)

julia> @btime f4!($A,$B,$C);
  567.861 μs (11 allocations: 1.55 KiB)

I wouldn’t expect MKL to be faster than Julia on plain broadcasted multiplication. Is there any reason to?

Are you using an old version of LoopVectorization.jl? Or are the @avx/@avxt still available along with the new @turbo/@tturbo?

1 Like

Whoops, I was looking at old docs, but using a current version (v0.12.66), It appears both are still available, but thanks for the comment. I will switch to @tturbo.

And, I guess I don’t have a good enough intuition to know whether or not to expect the MKL solution would be faster which is why I wanted to just do the experiment.