You forgot the @inbounds
, now your code doesn’t SIMD and it is 4x-8x slower depending on the element type. But now when you add @inbounds
you need to add code to check the sizes and give good error messages. Also, if you have matrices, make sure to iterate column-wise otherwise your performance will be trash.
I beg to disagree. True, there is more performance to be squeezed out, but look at this
module linearcombinationtest
using BenchmarkTools
using InteractiveUtils
using Test
lincomb2!(result, alpha, a, beta, b) = begin
for i in eachindex(a)
result[i] = alpha * a[i] + beta * b[i]
end
return result
end
lincomb3!(result, alpha, a, beta, b, gamma, c) = begin
for i in eachindex(a)
result[i] = alpha * a[i] + beta * b[i] + gamma * c[i]
end
return result
end
function test()
M, N = 2000, 2000
result = fill(0.0, M, N)
alpha, beta, gamma, delta, epsilon, zeta, eta = rand(7)
a, b, c, d, e, f, g = (rand(M, N) for _ in 1:7)
println("Fusion versus loops")
println("Linear Combination of 2 terms")
@btime begin @. $result = $alpha * $a + $beta * $b; end
@btime lincomb2!($result, $alpha, $a, $beta, $b)
println("Linear Combination of 3 terms")
@btime @. $result = $alpha * $a + $beta * $b + $gamma * $c
@btime lincomb3!($result, $alpha, $a, $beta, $b, $gamma, $c)
true
end
end
using .linearcombinationtest
linearcombinationtest.test()
runs in
Fusion versus loops
Linear Combination of 2 terms
9.634 ms (0 allocations: 0 bytes)
10.405 ms (0 allocations: 0 bytes)
Linear Combination of 3 terms
12.922 ms (0 allocations: 0 bytes)
13.469 ms (0 allocations: 0 bytes)
Throw in @inbounds
and broadcast and loops are equally fast. All it takes is a few more keystrokes.
Also that code is artificial; when you have larger vector expressions (which is definitely a common use case, that’s why matlab and numpy are enough for 90% of numeric computing) you don’t want to come back down to indices. Vector operations are nice for a lot of applications, that’s why matlab and numpy were able to develop as well as they did while being atrociously slow in loops.
The question is: is there anything actionable here? As far as I understand, there’s:
- a performance bug with broadcasting, that is apparently just waiting for someone with the right skills and time
- the fact that Julia uses openblas by default, but MKL is essentially one command line away (although there are caveats).
- the fact that julia’s default allocator is not as good as it could be for this kind of workflow; the usual answer is to discourage allocations, which is sometimes annoying
- automatic parallelization, for which there is Strided.jl, and which might see some more love once threading gets stable and fast
Is there anything else coming out of the benchmark that doesn’t fit into that?
My argument is as following:
- Julia indeed should allow the user to leverage MKL easily.
You may say it is done givenMKL.jl
. - Julia needs to integrate MKL and other Intel packages with synergistic integration.
This is more tricky, but can be done. For instance I’d like to see Julia taking advantage of all MKL API’s. It has JIT and Direct Call to make it faster in smaller matrices. I would like Julia smart JIT to see there is a loop multiplying the same matrix over and over with different matrices and use it Compact API. Also there are ways to utilize Batch and Packed API’s (See Squeeze More Performance from Intel MKL). I want Julia to use SVML for SIMD with Element Wise Math Operations. Namely tighter integration. Julia, with its flexibility built in into the language, can be the perfect driver to squeeze everything MKL has to offer. This is far form done. Actually I’m not sure even a single step was taken.
There is no drama that MKL BLAS + LAPACK is faster than OpenBLAS. On the contrary, it is amazing that they are even close given the huge amount or resources MKL team has compared to OpenBLAS.
Also, Julia’s weakness with its broadcasting will be solved. The way I see it there will be some decorations to tell compiler there is no aliasing, no dependence, etc… Just like those PRAGMA’s on C (ivdep
, vector
, aligned
, etc…) and a decoration to force, disable or auto mode for SIMD and Multi Threading.
The question is if we one day something like this will be optimized:
for ii = 1:1000
tC[:, :, ii] = tA[:, :, ii] * tB[:, :, ii];
end
Will be optimized into a Batch Call of MKL so we’ll have Big Matrices performance on small to medium matrices.
Will we see one day something like this:
mA = qr(mB);
mC = qr(mD);
mE = qr(mF);
mG = qr(mH);
Transformed into MKL Compact API for small matrices.
I also wish to see:
for ii = 1:100
tC[:, :, ii] = mA * tB[:, :, ii];
end
Calling MKL Compact API.
The above has the potential to bring large matrices performance in many use cases where the user use small to medium matrices. It is a Joker Julia can have in its hand.
Not to mention having the Sparse Solvers of MKL, utilize some of its VML Library etc…
Using SVML will allow Julia match Numba JIT optimization in Python for arrays with Element wise Mathematical expressions.
So, for me, Julia’s elegant machinery can create the perfect driver around those librtaries.
I don’t think everything should be written in Julia. I think Julia can also be the intelligent glue to utilize to the maximum low level libraries in a fashion no other language can without being explicit.
Remark
I have no idea where did you bring “No optimized allocator in Julia”. Do you have any issue to show that in GitHub? It seems Julia allocator is as efficient as any other. It is more flexible as you can do “Low Level” operations which usually are not exposed.
Am I the only one who is a bit uneasy about using a closed-source library with Julia? (MKL…)
Edit: Wouldn’t it be nice to have a fast BLAS implementation in Julia?
I was just in the process of writing basically that Intel can change the MKL licensing terms again in the future as they see fit. Also, there are more platforms then x86/amd64 out there. ARM and eventually RISC V which would be really interesting.
Edit: And of course POWER9 !
Plus, by boosting MKL with Julia we are hurting a OSS project, OpenBLAS.
Imo, the preferred solution would be to use pure Julia for at least level 1 and 2 blas. It will be a ton of work to get it to openblas, not to mention mkl speeds, but besides showing off high performance Julia, it will give us better handling of custom types and mixed precision operations.
Semi-jokingly saying this but, in many tasks, language (Julia vs. Python) and more cores (AMD 7mm vs. Intel 14mm) will destroy the advantage of MKL, not mentioning the potential risk of propiatary (say they decided to charge you in future iteration, or be an asshole like Nvidia)
Unless 90% of computing time is just strict dense matrix computation in tight loops, which is rare for science / data science.(even in ML you have diff time which dominant the time(I think))