First and foremost, thanks to everyone involved in advancing Julia. I recently started testing it out, and I am having lots of fun . I am not sure if this is the proper forum for this question, but since I do am a beginner, I figure this would be appropriate
I am trying out a code for CFD, and I am discretizing the Navier-Stokes equations using finite differences (using Symbolics to construct the expressions). My loop (which is placed within a function, which I checked is type stable) then looks like this:
Itβs ugly, but that is why I got it from Symbolics. Upon reviewing the code, I can see some vectorization, but not a significant amount. I would expect (perhaps naively) that the compiler would recognize that the same operations are being performed for ii =1, 2, 3, etc, and vectorize them, computing i=1:8 in βone goβ. Instead, it seems to be trying to vectorize each loop iteration (which is harder).
I tried using @turbo or @avx, but this only resulted in MUCH slower code. Am I having unrealistic expectations, or just using the wrong macros?
I definitely wouldnβt expect autovectorization to try to work within an iteration, as opposed to across iterations. How exactly did you observe this, and more generally what did you mean by βreviewing the codeβ?
Maybe something to do with the SLP vectorizer pass? It may have some trouble with big expressions.
You could try to help it by typing each of the similar mul-adds that access memory sequentially on different lines, and/or try out combinations of / removing the various macros you used. In my experience, fast math doesnβt always help vectorisation.
In general, LLVMβs loop vectorizer isnβt very good (primarily because it tries to go inside to outside rather than outside to inside). We want to replace it with vplan eventually, but last time we checked it wasnβt ready for prime time.
So here is a minimum example: I am comparing the standard loop with the use of views, turbo , turbo and avx. For turbo I tried some of the arguments, but nothing did better (not even close to) the standard loop.
How exactly did you observe this, and more generally what did you mean by βreviewing the codeβ?
I checked @code_llvm and did not find (or mostly not) vectorized additions/multiplications.
Humβ¦ is there a way I βforceβ it to, at least when the loop is somewhat easy to parallelize, as this one? Or should I just wait for the compiler to improve?
I am not desperate for the maximum performance now, so that would not be a big issue, and I can always export this as a C or Fortran function and call it: but I really want to avoid the multiple languages paradigm as much as I canβ¦
Do you want to make it parallel or is it part of a bigger parallel workflow ?
the best you can get using fully parallel code + simd compared to the simple version leads to this
where fb is the function with simd, inbounds and I also added fastmath while the fc is harder (compiled to MLIR code on cpu) but parallel, the last one is paralel with julia so I guess its fine youβre not too far for a single threaded version. Those are overkill obviously but it gives a hint thtat going lower may not be possible.
You can still go to SIMD.jl and give a try but it will be hard. Also try with fortran -O3 just to see how mush of a range you have in term of perf
this is what I get
PS C:\Users\yolha\Desktop\juju_tests\MITNote> gfortran -O3 -ffree-line-length-none main.f90 -o fb.exe
PS C:\Users\yolha\Desktop\juju_tests\MITNote> ./fb
Median execution time (s): 7.9800000000000002E-005
PS C:\Users\yolha\Desktop\juju_tests\MITNote> ./fb
Median execution time (s): 8.4400000000000005E-005
PS C:\Users\yolha\Desktop\juju_tests\MITNote> gfortran -O3 -march=native -funroll-loops -ftree-vectorize -ffast-math -ffree-line-length-none main.f90 -o fb.exe
PS C:\Users\yolha\Desktop\juju_tests\MITNote> ./fb
Samples ( 10000 ), evals per sample: 1
Min (s): 2.3899999999999998E-005
Median (s): 2.5100000000000000E-005
Mean (s): 2.6278129999999912E-005
Max (s): 4.2549999999999999E-004
Iβm similar with julia here in single threaded, I would continue with
I think you probably meant to write (1:100) .+ 4 instead, what you wrote is equivalent to 1:(100 .+ 4)
Are you sure your baseline isnβt already vectorized? Looking at @code_llvm f(u_Ξ, v_Ξ, p_Ξ, Ο_yy_Ξ, Ο_xy_Ξ, py_Ξ, dpy_dt_Ξ); on my zen 5 machine, I already see a bunch of <8 x double>s:
I saw quite a lot in the baselind @simd method. Thatβs system-dependent though because Iβm seeing <4 x double>. The thing is thereβs a vectorized (%vector.body) and a non-vectorized branch (%L9), and I donβt know what the check is doing or which branch is running for the exampleβs inputs.
Youβre making an anonymous function and assigning it to a non-const global variable, which is not the behavior of the idiomatic function f(.... In general, Julia cannot optimize non-const global variables without helpful type annotations, itβs the 2nd entry of the Performance Tips. The method runs long enough that this doesnβt skew the benchmarks much for me, but shorter methods or calling this method in another method would incur hefty performance penalties.
Cool ! Can you point me to some reference for MLIR? I want a serial version of the function (I am parallelizing calls to it), but a 2x speed up (what you got compared from parallel with julia) would be cool.
Sounds feasible. I will give it a try at a later stage to compare at least! Thanks for the suggestion!
Nice! I actually tried reactant.jl (without kernel abstraction), but didnβt got any speed up.
Would you mind sharing your script? I am actually intending to scale up the code, so a factor of 2 in performance is definitely something I would like to look into!!!
using Reactant,KernelAbstractions
using CUDA # this is only to be able to raise the kernel the gpu isn't used in this example
using BenchmarkTools
Reactant.set_default_backend("cpu")
@kernel inbounds=true function fooker!(u_Ξ, v_Ξ, p_Ξ, Ο_yy_Ξ, Ο_xy_Ξ, py_Ξ, dpy_dt_Ξ)
i,j = @index(Global,NTuple)
ii = i+4
jj = j +4
@fastmath dpy_dt_Ξ[ii, jj] = (((((((((((((((((((((((((((-120.0 * Ο_xy_Ξ[-1 + ii, jj] + 30.0 * Ο_xy_Ξ[-2 + ii, jj]) - 5.714285714285714 * Ο_xy_Ξ[-3 + ii, jj]) + 0.5357142857142857 * Ο_xy_Ξ[-4 + ii, jj] + 120.0 * Ο_xy_Ξ[1 + ii, jj]) - 30.0 * Ο_xy_Ξ[2 + ii, jj]) + 5.714285714285714 * Ο_xy_Ξ[3 + ii, jj]) - 0.5357142857142857 * Ο_xy_Ξ[4 + ii, jj]) - 120.0 * Ο_yy_Ξ[ii, -1 + jj]) + 30.0 * Ο_yy_Ξ[ii, -2 + jj]) - 5.714285714285714 * Ο_yy_Ξ[ii, -3 + jj]) + 0.5357142857142857 * Ο_yy_Ξ[ii, -4 + jj] + 120.0 * Ο_yy_Ξ[ii, 1 + jj]) - 30.0 * Ο_yy_Ξ[ii, 2 + jj]) + 5.714285714285714 * Ο_yy_Ξ[ii, 3 + jj]) - 0.5357142857142857 * Ο_yy_Ξ[ii, 4 + jj]) + (120.0 * py_Ξ[-1 + ii, jj]) * u_Ξ[-1 + ii, jj]) - (30.0 * py_Ξ[-2 + ii, jj]) * u_Ξ[-2 + ii, jj]) + (5.714285714285714 * py_Ξ[-3 + ii, jj]) * u_Ξ[-3 + ii, jj]) - (0.5357142857142857 * py_Ξ[-4 + ii, jj]) * u_Ξ[-4 + ii, jj]) - (120.0 * py_Ξ[1 + ii, jj]) * u_Ξ[1 + ii, jj]) + (30.0 * py_Ξ[2 + ii, jj]) * u_Ξ[2 + ii, jj]) - (5.714285714285714 * py_Ξ[3 + ii, jj]) * u_Ξ[3 + ii, jj]) + (0.5357142857142857 * py_Ξ[4 + ii, jj]) * u_Ξ[4 + ii, jj] + 120.0 * (p_Ξ[ii, -1 + jj] + py_Ξ[ii, -1 + jj] * v_Ξ[ii, -1 + jj])) - 30.0 * (p_Ξ[ii, -2 + jj] + py_Ξ[ii, -2 + jj] * v_Ξ[ii, -2 + jj])) + 5.714285714285714 * (p_Ξ[ii, -3 + jj] + py_Ξ[ii, -3 + jj] * v_Ξ[ii, -3 + jj])) - 0.5357142857142857 * (p_Ξ[ii, -4 + jj] + py_Ξ[ii, -4 + jj] * v_Ξ[ii, -4 + jj])) - 120.0 * (p_Ξ[ii, 1 + jj] + py_Ξ[ii, 1 + jj] * v_Ξ[ii, 1 + jj])) + 30.0 * (p_Ξ[ii, 2 + jj] + py_Ξ[ii, 2 + jj] * v_Ξ[ii, 2 + jj])) - 5.714285714285714 * (p_Ξ[ii, 3 + jj] + py_Ξ[ii, 3 + jj] * v_Ξ[ii, 3 + jj])) + 0.5357142857142857 * (p_Ξ[ii, 4 + jj] + py_Ξ[ii, 4 + jj] * v_Ξ[ii, 4 + jj])
end
function f(u_Ξ, v_Ξ, p_Ξ, Ο_yy_Ξ, Ο_xy_Ξ, py_Ξ, dpy_dt_Ξ)
fooker!(get_backend(dpy_dt_Ξ))(u_Ξ, v_Ξ, p_Ξ, Ο_yy_Ξ, Ο_xy_Ξ, py_Ξ, dpy_dt_Ξ;ndrange = (100,100))
end
n = 108
u_Ξ = randn(n,n) ; v_Ξ= randn(n,n) ; p_Ξ= randn(n,n);
Ο_yy_Ξ= randn(n,n) ; Ο_xy_Ξ= randn(n,n) ; py_Ξ= randn(n,n);
dpy_dt_Ξ = zeros(n,n)
@btime f($u_Ξ, $v_Ξ, $p_Ξ, $Ο_yy_Ξ, $Ο_xy_Ξ, $py_Ξ, $dpy_dt_Ξ)
dev = Reactant.to_rarray
ru_Ξ_d = dev(u_Ξ) ; rv_Ξ_d= dev(v_Ξ) ; rp_Ξ_d= dev(p_Ξ);
rΟ_yy_Ξ_d= dev(Ο_yy_Ξ) ; rΟ_xy_Ξ_d= dev(Ο_xy_Ξ) ; rpy_Ξ_d= dev(py_Ξ);
rdpy_dt_Ξ_d = dev(dpy_dt_Ξ)
fc = @compile raise=true f(ru_Ξ_d, rv_Ξ_d, rp_Ξ_d, rΟ_yy_Ξ_d, rΟ_xy_Ξ_d, rpy_Ξ_d, rdpy_dt_Ξ_d)
@btime fc($ru_Ξ_d, $rv_Ξ_d, $rp_Ξ_d, $rΟ_yy_Ξ_d, $rΟ_xy_Ξ_d, $rpy_Ξ_d, $rdpy_dt_Ξ_d)
for now we need to add CUDA in order to raise the kernel but the gpu isnβt used.
If you donβt want to add CUDA, Reactant.jl can also execute stablehlo code so I give you the one generated by this, also it should be fine with the broadcast one too just not with the loop
Again if you planned to parallel a bigger process you shouldnβt really use that, it was mainly to show the biggest perf you can get but it doesnβt mean its what you want