Vectorization of "complex" loops

Hi,

First and foremost, thanks to everyone involved in advancing Julia. I recently started testing it out, and I am having lots of fun :slight_smile:. I am not sure if this is the proper forum for this question, but since I do am a beginner, I figure this would be appropriate

I am trying out a code for CFD, and I am discretizing the Navier-Stokes equations using finite differences (using Symbolics to construct the expressions). My loop (which is placed within a function, which I checked is type stable) then looks like this:

@fastmath @inbounds for jj = jrange ;  @simd ivdep for  ii = irange
        dpy_dt_Ξ”[ii, jj] = (((((((((((((((((((((((((((-120.0 * Ο„_xy_Ξ”[-1 + ii, jj] + 30.0 * Ο„_xy_Ξ”[-2 + ii, jj]) - 5.714285714285714 * Ο„_xy_Ξ”[-3 + ii, jj]) + 0.5357142857142857 * Ο„_xy_Ξ”[-4 + ii, jj] + 120.0 * Ο„_xy_Ξ”[1 + ii, jj]) - 30.0 * Ο„_xy_Ξ”[2 + ii, jj]) + 5.714285714285714 * Ο„_xy_Ξ”[3 + ii, jj]) - 0.5357142857142857 * Ο„_xy_Ξ”[4 + ii, jj]) - 120.0 * Ο„_yy_Ξ”[ii, -1 + jj]) + 30.0 * Ο„_yy_Ξ”[ii, -2 + jj]) - 5.714285714285714 * Ο„_yy_Ξ”[ii, -3 + jj]) + 0.5357142857142857 * Ο„_yy_Ξ”[ii, -4 + jj] + 120.0 * Ο„_yy_Ξ”[ii, 1 + jj]) - 30.0 * Ο„_yy_Ξ”[ii, 2 + jj]) + 5.714285714285714 * Ο„_yy_Ξ”[ii, 3 + jj]) - 0.5357142857142857 * Ο„_yy_Ξ”[ii, 4 + jj]) + (120.0 * py_Ξ”[-1 + ii, jj]) * u_Ξ”[-1 + ii, jj]) - (30.0 * py_Ξ”[-2 + ii, jj]) * u_Ξ”[-2 + ii, jj]) + (5.714285714285714 * py_Ξ”[-3 + ii, jj]) * u_Ξ”[-3 + ii, jj]) - (0.5357142857142857 * py_Ξ”[-4 + ii, jj]) * u_Ξ”[-4 + ii, jj]) - (120.0 * py_Ξ”[1 + ii, jj]) * u_Ξ”[1 + ii, jj]) + (30.0 * py_Ξ”[2 + ii, jj]) * u_Ξ”[2 + ii, jj]) - (5.714285714285714 * py_Ξ”[3 + ii, jj]) * u_Ξ”[3 + ii, jj]) + (0.5357142857142857 * py_Ξ”[4 + ii, jj]) * u_Ξ”[4 + ii, jj] + 120.0 * (p_Ξ”[ii, -1 + jj] + py_Ξ”[ii, -1 + jj] * v_Ξ”[ii, -1 + jj])) - 30.0 * (p_Ξ”[ii, -2 + jj] + py_Ξ”[ii, -2 + jj] * v_Ξ”[ii, -2 + jj])) + 5.714285714285714 * (p_Ξ”[ii, -3 + jj] + py_Ξ”[ii, -3 + jj] * v_Ξ”[ii, -3 + jj])) - 0.5357142857142857 * (p_Ξ”[ii, -4 + jj] + py_Ξ”[ii, -4 + jj] * v_Ξ”[ii, -4 + jj])) - 120.0 * (p_Ξ”[ii, 1 + jj] + py_Ξ”[ii, 1 + jj] * v_Ξ”[ii, 1 + jj])) + 30.0 * (p_Ξ”[ii, 2 + jj] + py_Ξ”[ii, 2 + jj] * v_Ξ”[ii, 2 + jj])) - 5.714285714285714 * (p_Ξ”[ii, 3 + jj] + py_Ξ”[ii, 3 + jj] * v_Ξ”[ii, 3 + jj])) + 0.5357142857142857 * (p_Ξ”[ii, 4 + jj] + py_Ξ”[ii, 4 + jj] * v_Ξ”[ii, 4 + jj])
        
end
end

It’s ugly, but that is why I got it from Symbolics. Upon reviewing the code, I can see some vectorization, but not a significant amount. I would expect (perhaps naively) that the compiler would recognize that the same operations are being performed for ii =1, 2, 3, etc, and vectorize them, computing i=1:8 in β€œone go”. Instead, it seems to be trying to vectorize each loop iteration (which is harder).

I tried using @turbo or @avx, but this only resulted in MUCH slower code. Am I having unrealistic expectations, or just using the wrong macros?

Thanks for any insights :slight_smile:

I definitely wouldn’t expect autovectorization to try to work within an iteration, as opposed to across iterations. How exactly did you observe this, and more generally what did you mean by β€œreviewing the code”?

Maybe something to do with the SLP vectorizer pass? It may have some trouble with big expressions.
You could try to help it by typing each of the similar mul-adds that access memory sequentially on different lines, and/or try out combinations of / removing the various macros you used. In my experience, fast math doesn’t always help vectorisation.

In general, LLVM’s loop vectorizer isn’t very good (primarily because it tries to go inside to outside rather than outside to inside). We want to replace it with vplan eventually, but last time we checked it wasn’t ready for prime time.

So here is a minimum example: I am comparing the standard loop with the use of views, turbo , turbo and avx. For turbo I tried some of the arguments, but nothing did better (not even close to) the standard loop.

How exactly did you observe this, and more generally what did you mean by β€œreviewing the code”?

I checked @code_llvm and did not find (or mostly not) vectorized additions/multiplications.

Example:


using LoopVectorization, BenchmarkTools

f = @inline function (u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”)
    jrange = 1:100 .+ 4
    irange = 1:100 .+ 4
    @inbounds for jj = jrange ;  
        @simd  for  ii = irange
            dpy_dt_Ξ”[ii, jj] = (((((((((((((((((((((((((((-120.0 * Ο„_xy_Ξ”[-1 + ii, jj] + 30.0 * Ο„_xy_Ξ”[-2 + ii, jj]) - 5.714285714285714 * Ο„_xy_Ξ”[-3 + ii, jj]) + 0.5357142857142857 * Ο„_xy_Ξ”[-4 + ii, jj] + 120.0 * Ο„_xy_Ξ”[1 + ii, jj]) - 30.0 * Ο„_xy_Ξ”[2 + ii, jj]) + 5.714285714285714 * Ο„_xy_Ξ”[3 + ii, jj]) - 0.5357142857142857 * Ο„_xy_Ξ”[4 + ii, jj]) - 120.0 * Ο„_yy_Ξ”[ii, -1 + jj]) + 30.0 * Ο„_yy_Ξ”[ii, -2 + jj]) - 5.714285714285714 * Ο„_yy_Ξ”[ii, -3 + jj]) + 0.5357142857142857 * Ο„_yy_Ξ”[ii, -4 + jj] + 120.0 * Ο„_yy_Ξ”[ii, 1 + jj]) - 30.0 * Ο„_yy_Ξ”[ii, 2 + jj]) + 5.714285714285714 * Ο„_yy_Ξ”[ii, 3 + jj]) - 0.5357142857142857 * Ο„_yy_Ξ”[ii, 4 + jj]) + (120.0 * py_Ξ”[-1 + ii, jj]) * u_Ξ”[-1 + ii, jj]) - (30.0 * py_Ξ”[-2 + ii, jj]) * u_Ξ”[-2 + ii, jj]) + (5.714285714285714 * py_Ξ”[-3 + ii, jj]) * u_Ξ”[-3 + ii, jj]) - (0.5357142857142857 * py_Ξ”[-4 + ii, jj]) * u_Ξ”[-4 + ii, jj]) - (120.0 * py_Ξ”[1 + ii, jj]) * u_Ξ”[1 + ii, jj]) + (30.0 * py_Ξ”[2 + ii, jj]) * u_Ξ”[2 + ii, jj]) - (5.714285714285714 * py_Ξ”[3 + ii, jj]) * u_Ξ”[3 + ii, jj]) + (0.5357142857142857 * py_Ξ”[4 + ii, jj]) * u_Ξ”[4 + ii, jj] + 120.0 * (p_Ξ”[ii, -1 + jj] + py_Ξ”[ii, -1 + jj] * v_Ξ”[ii, -1 + jj])) - 30.0 * (p_Ξ”[ii, -2 + jj] + py_Ξ”[ii, -2 + jj] * v_Ξ”[ii, -2 + jj])) + 5.714285714285714 * (p_Ξ”[ii, -3 + jj] + py_Ξ”[ii, -3 + jj] * v_Ξ”[ii, -3 + jj])) - 0.5357142857142857 * (p_Ξ”[ii, -4 + jj] + py_Ξ”[ii, -4 + jj] * v_Ξ”[ii, -4 + jj])) - 120.0 * (p_Ξ”[ii, 1 + jj] + py_Ξ”[ii, 1 + jj] * v_Ξ”[ii, 1 + jj])) + 30.0 * (p_Ξ”[ii, 2 + jj] + py_Ξ”[ii, 2 + jj] * v_Ξ”[ii, 2 + jj])) - 5.714285714285714 * (p_Ξ”[ii, 3 + jj] + py_Ξ”[ii, 3 + jj] * v_Ξ”[ii, 3 + jj])) + 0.5357142857142857 * (p_Ξ”[ii, 4 + jj] + py_Ξ”[ii, 4 + jj] * v_Ξ”[ii, 4 + jj])
        end; 
    end
end

fviews = @inline function (u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”)
    jj = 1:100 .+ 4
    ii = 1:100 .+ 4
    @inbounds @views @. dpy_dt_Ξ”[ii, jj] = (((((((((((((((((((((((((((-120.0 * Ο„_xy_Ξ”[-1 + ii, jj] + 30.0 * Ο„_xy_Ξ”[-2 + ii, jj]) - 5.714285714285714 * Ο„_xy_Ξ”[-3 + ii, jj]) + 0.5357142857142857 * Ο„_xy_Ξ”[-4 + ii, jj] + 120.0 * Ο„_xy_Ξ”[1 + ii, jj]) - 30.0 * Ο„_xy_Ξ”[2 + ii, jj]) + 5.714285714285714 * Ο„_xy_Ξ”[3 + ii, jj]) - 0.5357142857142857 * Ο„_xy_Ξ”[4 + ii, jj]) - 120.0 * Ο„_yy_Ξ”[ii, -1 + jj]) + 30.0 * Ο„_yy_Ξ”[ii, -2 + jj]) - 5.714285714285714 * Ο„_yy_Ξ”[ii, -3 + jj]) + 0.5357142857142857 * Ο„_yy_Ξ”[ii, -4 + jj] + 120.0 * Ο„_yy_Ξ”[ii, 1 + jj]) - 30.0 * Ο„_yy_Ξ”[ii, 2 + jj]) + 5.714285714285714 * Ο„_yy_Ξ”[ii, 3 + jj]) - 0.5357142857142857 * Ο„_yy_Ξ”[ii, 4 + jj]) + (120.0 * py_Ξ”[-1 + ii, jj]) * u_Ξ”[-1 + ii, jj]) - (30.0 * py_Ξ”[-2 + ii, jj]) * u_Ξ”[-2 + ii, jj]) + (5.714285714285714 * py_Ξ”[-3 + ii, jj]) * u_Ξ”[-3 + ii, jj]) - (0.5357142857142857 * py_Ξ”[-4 + ii, jj]) * u_Ξ”[-4 + ii, jj]) - (120.0 * py_Ξ”[1 + ii, jj]) * u_Ξ”[1 + ii, jj]) + (30.0 * py_Ξ”[2 + ii, jj]) * u_Ξ”[2 + ii, jj]) - (5.714285714285714 * py_Ξ”[3 + ii, jj]) * u_Ξ”[3 + ii, jj]) + (0.5357142857142857 * py_Ξ”[4 + ii, jj]) * u_Ξ”[4 + ii, jj] + 120.0 * (p_Ξ”[ii, -1 + jj] + py_Ξ”[ii, -1 + jj] * v_Ξ”[ii, -1 + jj])) - 30.0 * (p_Ξ”[ii, -2 + jj] + py_Ξ”[ii, -2 + jj] * v_Ξ”[ii, -2 + jj])) + 5.714285714285714 * (p_Ξ”[ii, -3 + jj] + py_Ξ”[ii, -3 + jj] * v_Ξ”[ii, -3 + jj])) - 0.5357142857142857 * (p_Ξ”[ii, -4 + jj] + py_Ξ”[ii, -4 + jj] * v_Ξ”[ii, -4 + jj])) - 120.0 * (p_Ξ”[ii, 1 + jj] + py_Ξ”[ii, 1 + jj] * v_Ξ”[ii, 1 + jj])) + 30.0 * (p_Ξ”[ii, 2 + jj] + py_Ξ”[ii, 2 + jj] * v_Ξ”[ii, 2 + jj])) - 5.714285714285714 * (p_Ξ”[ii, 3 + jj] + py_Ξ”[ii, 3 + jj] * v_Ξ”[ii, 3 + jj])) + 0.5357142857142857 * (p_Ξ”[ii, 4 + jj] + py_Ξ”[ii, 4 + jj] * v_Ξ”[ii, 4 + jj])
end

 fdot = @inline function (u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”)
    jj = 1:100 .+ 4
    ii = 1:100 .+ 4
    @inbounds @.  dpy_dt_Ξ”[ii, jj] = (((((((((((((((((((((((((((-120.0 * Ο„_xy_Ξ”[-1 + ii, jj] + 30.0 * Ο„_xy_Ξ”[-2 + ii, jj]) - 5.714285714285714 * Ο„_xy_Ξ”[-3 + ii, jj]) + 0.5357142857142857 * Ο„_xy_Ξ”[-4 + ii, jj] + 120.0 * Ο„_xy_Ξ”[1 + ii, jj]) - 30.0 * Ο„_xy_Ξ”[2 + ii, jj]) + 5.714285714285714 * Ο„_xy_Ξ”[3 + ii, jj]) - 0.5357142857142857 * Ο„_xy_Ξ”[4 + ii, jj]) - 120.0 * Ο„_yy_Ξ”[ii, -1 + jj]) + 30.0 * Ο„_yy_Ξ”[ii, -2 + jj]) - 5.714285714285714 * Ο„_yy_Ξ”[ii, -3 + jj]) + 0.5357142857142857 * Ο„_yy_Ξ”[ii, -4 + jj] + 120.0 * Ο„_yy_Ξ”[ii, 1 + jj]) - 30.0 * Ο„_yy_Ξ”[ii, 2 + jj]) + 5.714285714285714 * Ο„_yy_Ξ”[ii, 3 + jj]) - 0.5357142857142857 * Ο„_yy_Ξ”[ii, 4 + jj]) + (120.0 * py_Ξ”[-1 + ii, jj]) * u_Ξ”[-1 + ii, jj]) - (30.0 * py_Ξ”[-2 + ii, jj]) * u_Ξ”[-2 + ii, jj]) + (5.714285714285714 * py_Ξ”[-3 + ii, jj]) * u_Ξ”[-3 + ii, jj]) - (0.5357142857142857 * py_Ξ”[-4 + ii, jj]) * u_Ξ”[-4 + ii, jj]) - (120.0 * py_Ξ”[1 + ii, jj]) * u_Ξ”[1 + ii, jj]) + (30.0 * py_Ξ”[2 + ii, jj]) * u_Ξ”[2 + ii, jj]) - (5.714285714285714 * py_Ξ”[3 + ii, jj]) * u_Ξ”[3 + ii, jj]) + (0.5357142857142857 * py_Ξ”[4 + ii, jj]) * u_Ξ”[4 + ii, jj] + 120.0 * (p_Ξ”[ii, -1 + jj] + py_Ξ”[ii, -1 + jj] * v_Ξ”[ii, -1 + jj])) - 30.0 * (p_Ξ”[ii, -2 + jj] + py_Ξ”[ii, -2 + jj] * v_Ξ”[ii, -2 + jj])) + 5.714285714285714 * (p_Ξ”[ii, -3 + jj] + py_Ξ”[ii, -3 + jj] * v_Ξ”[ii, -3 + jj])) - 0.5357142857142857 * (p_Ξ”[ii, -4 + jj] + py_Ξ”[ii, -4 + jj] * v_Ξ”[ii, -4 + jj])) - 120.0 * (p_Ξ”[ii, 1 + jj] + py_Ξ”[ii, 1 + jj] * v_Ξ”[ii, 1 + jj])) + 30.0 * (p_Ξ”[ii, 2 + jj] + py_Ξ”[ii, 2 + jj] * v_Ξ”[ii, 2 + jj])) - 5.714285714285714 * (p_Ξ”[ii, 3 + jj] + py_Ξ”[ii, 3 + jj] * v_Ξ”[ii, 3 + jj])) + 0.5357142857142857 * (p_Ξ”[ii, 4 + jj] + py_Ξ”[ii, 4 + jj] * v_Ξ”[ii, 4 + jj])
end
 

fturbo = @inline function (u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”)
    jrange = 1:100 .+ 4
    irange = 1:100 .+ 4
    @turbo for jj = jrange , ii = irange
            dpy_dt_Ξ”[ii, jj] = (((((((((((((((((((((((((((-120.0 * Ο„_xy_Ξ”[-1 + ii, jj] + 30.0 * Ο„_xy_Ξ”[-2 + ii, jj]) - 5.714285714285714 * Ο„_xy_Ξ”[-3 + ii, jj]) + 0.5357142857142857 * Ο„_xy_Ξ”[-4 + ii, jj] + 120.0 * Ο„_xy_Ξ”[1 + ii, jj]) - 30.0 * Ο„_xy_Ξ”[2 + ii, jj]) + 5.714285714285714 * Ο„_xy_Ξ”[3 + ii, jj]) - 0.5357142857142857 * Ο„_xy_Ξ”[4 + ii, jj]) - 120.0 * Ο„_yy_Ξ”[ii, -1 + jj]) + 30.0 * Ο„_yy_Ξ”[ii, -2 + jj]) - 5.714285714285714 * Ο„_yy_Ξ”[ii, -3 + jj]) + 0.5357142857142857 * Ο„_yy_Ξ”[ii, -4 + jj] + 120.0 * Ο„_yy_Ξ”[ii, 1 + jj]) - 30.0 * Ο„_yy_Ξ”[ii, 2 + jj]) + 5.714285714285714 * Ο„_yy_Ξ”[ii, 3 + jj]) - 0.5357142857142857 * Ο„_yy_Ξ”[ii, 4 + jj]) + (120.0 * py_Ξ”[-1 + ii, jj]) * u_Ξ”[-1 + ii, jj]) - (30.0 * py_Ξ”[-2 + ii, jj]) * u_Ξ”[-2 + ii, jj]) + (5.714285714285714 * py_Ξ”[-3 + ii, jj]) * u_Ξ”[-3 + ii, jj]) - (0.5357142857142857 * py_Ξ”[-4 + ii, jj]) * u_Ξ”[-4 + ii, jj]) - (120.0 * py_Ξ”[1 + ii, jj]) * u_Ξ”[1 + ii, jj]) + (30.0 * py_Ξ”[2 + ii, jj]) * u_Ξ”[2 + ii, jj]) - (5.714285714285714 * py_Ξ”[3 + ii, jj]) * u_Ξ”[3 + ii, jj]) + (0.5357142857142857 * py_Ξ”[4 + ii, jj]) * u_Ξ”[4 + ii, jj] + 120.0 * (p_Ξ”[ii, -1 + jj] + py_Ξ”[ii, -1 + jj] * v_Ξ”[ii, -1 + jj])) - 30.0 * (p_Ξ”[ii, -2 + jj] + py_Ξ”[ii, -2 + jj] * v_Ξ”[ii, -2 + jj])) + 5.714285714285714 * (p_Ξ”[ii, -3 + jj] + py_Ξ”[ii, -3 + jj] * v_Ξ”[ii, -3 + jj])) - 0.5357142857142857 * (p_Ξ”[ii, -4 + jj] + py_Ξ”[ii, -4 + jj] * v_Ξ”[ii, -4 + jj])) - 120.0 * (p_Ξ”[ii, 1 + jj] + py_Ξ”[ii, 1 + jj] * v_Ξ”[ii, 1 + jj])) + 30.0 * (p_Ξ”[ii, 2 + jj] + py_Ξ”[ii, 2 + jj] * v_Ξ”[ii, 2 + jj])) - 5.714285714285714 * (p_Ξ”[ii, 3 + jj] + py_Ξ”[ii, 3 + jj] * v_Ξ”[ii, 3 + jj])) + 0.5357142857142857 * (p_Ξ”[ii, 4 + jj] + py_Ξ”[ii, 4 + jj] * v_Ξ”[ii, 4 + jj])
    end
end

favx = @inline function (u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”)
    jrange = 1:100 .+ 4
    irange = 1:100 .+ 4
    @avx for jj = jrange ,  ii = irange
            dpy_dt_Ξ”[ii, jj] = (((((((((((((((((((((((((((-120.0 * Ο„_xy_Ξ”[-1 + ii, jj] + 30.0 * Ο„_xy_Ξ”[-2 + ii, jj]) - 5.714285714285714 * Ο„_xy_Ξ”[-3 + ii, jj]) + 0.5357142857142857 * Ο„_xy_Ξ”[-4 + ii, jj] + 120.0 * Ο„_xy_Ξ”[1 + ii, jj]) - 30.0 * Ο„_xy_Ξ”[2 + ii, jj]) + 5.714285714285714 * Ο„_xy_Ξ”[3 + ii, jj]) - 0.5357142857142857 * Ο„_xy_Ξ”[4 + ii, jj]) - 120.0 * Ο„_yy_Ξ”[ii, -1 + jj]) + 30.0 * Ο„_yy_Ξ”[ii, -2 + jj]) - 5.714285714285714 * Ο„_yy_Ξ”[ii, -3 + jj]) + 0.5357142857142857 * Ο„_yy_Ξ”[ii, -4 + jj] + 120.0 * Ο„_yy_Ξ”[ii, 1 + jj]) - 30.0 * Ο„_yy_Ξ”[ii, 2 + jj]) + 5.714285714285714 * Ο„_yy_Ξ”[ii, 3 + jj]) - 0.5357142857142857 * Ο„_yy_Ξ”[ii, 4 + jj]) + (120.0 * py_Ξ”[-1 + ii, jj]) * u_Ξ”[-1 + ii, jj]) - (30.0 * py_Ξ”[-2 + ii, jj]) * u_Ξ”[-2 + ii, jj]) + (5.714285714285714 * py_Ξ”[-3 + ii, jj]) * u_Ξ”[-3 + ii, jj]) - (0.5357142857142857 * py_Ξ”[-4 + ii, jj]) * u_Ξ”[-4 + ii, jj]) - (120.0 * py_Ξ”[1 + ii, jj]) * u_Ξ”[1 + ii, jj]) + (30.0 * py_Ξ”[2 + ii, jj]) * u_Ξ”[2 + ii, jj]) - (5.714285714285714 * py_Ξ”[3 + ii, jj]) * u_Ξ”[3 + ii, jj]) + (0.5357142857142857 * py_Ξ”[4 + ii, jj]) * u_Ξ”[4 + ii, jj] + 120.0 * (p_Ξ”[ii, -1 + jj] + py_Ξ”[ii, -1 + jj] * v_Ξ”[ii, -1 + jj])) - 30.0 * (p_Ξ”[ii, -2 + jj] + py_Ξ”[ii, -2 + jj] * v_Ξ”[ii, -2 + jj])) + 5.714285714285714 * (p_Ξ”[ii, -3 + jj] + py_Ξ”[ii, -3 + jj] * v_Ξ”[ii, -3 + jj])) - 0.5357142857142857 * (p_Ξ”[ii, -4 + jj] + py_Ξ”[ii, -4 + jj] * v_Ξ”[ii, -4 + jj])) - 120.0 * (p_Ξ”[ii, 1 + jj] + py_Ξ”[ii, 1 + jj] * v_Ξ”[ii, 1 + jj])) + 30.0 * (p_Ξ”[ii, 2 + jj] + py_Ξ”[ii, 2 + jj] * v_Ξ”[ii, 2 + jj])) - 5.714285714285714 * (p_Ξ”[ii, 3 + jj] + py_Ξ”[ii, 3 + jj] * v_Ξ”[ii, 3 + jj])) + 0.5357142857142857 * (p_Ξ”[ii, 4 + jj] + py_Ξ”[ii, 4 + jj] * v_Ξ”[ii, 4 + jj])
    end
end

n = 108
u_Ξ” = randn(n,n)    ; v_Ξ”= randn(n,n)       ; p_Ξ”= randn(n,n); 
Ο„_yy_Ξ”= randn(n,n)  ; Ο„_xy_Ξ”= randn(n,n)    ; 
py_Ξ”= randn(n,n)    ; dpy_dt_Ξ”= randn(n,n)  ;

print("Baseline...\n")
@btime f($u_Ξ”, $v_Ξ”, $p_Ξ”, $Ο„_yy_Ξ”, $Ο„_xy_Ξ”, $py_Ξ”, $dpy_dt_Ξ”);
print("Views...\n")
@btime fviews($u_Ξ”, $v_Ξ”, $p_Ξ”, $Ο„_yy_Ξ”, $Ο„_xy_Ξ”, $py_Ξ”, $dpy_dt_Ξ”);
print("dot...\n")
@btime fdot($u_Ξ”, $v_Ξ”, $p_Ξ”, $Ο„_yy_Ξ”, $Ο„_xy_Ξ”, $py_Ξ”, $dpy_dt_Ξ”);
print("Turbo...\n")
@btime fturbo($u_Ξ”, $v_Ξ”, $p_Ξ”, $Ο„_yy_Ξ”, $Ο„_xy_Ξ”, $py_Ξ”, $dpy_dt_Ξ”);
print("Avx...\n")
@btime favx($u_Ξ”, $v_Ξ”, $p_Ξ”, $Ο„_yy_Ξ”, $Ο„_xy_Ξ”, $py_Ξ”, $dpy_dt_Ξ”);

@code_warntype f(u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”);
@code_llvm f(u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”);









Hum… is there a way I β€œforce” it to, at least when the loop is somewhat easy to parallelize, as this one? Or should I just wait for the compiler to improve?

I am not desperate for the maximum performance now, so that would not be a big issue, and I can always export this as a C or Fortran function and call it: but I really want to avoid the multiple languages paradigm as much as I can…

Do you want to make it parallel or is it part of a bigger parallel workflow ?
the best you can get using fully parallel code + simd compared to the simple version leads to this

julia> @btime fb($u_Ξ”, $v_Ξ”, $p_Ξ”, $Ο„_yy_Ξ”, $Ο„_xy_Ξ”, $py_Ξ”, $dpy_dt_Ξ”);
  19.515 ΞΌs (0 allocations: 0 bytes)

julia> @btime $fc($ru_Ξ”, $rv_Ξ”, $rp_Ξ”, $rΟ„_yy_Ξ”, $rΟ„_xy_Ξ”, $rpy_Ξ”, $rdpy_dt_Ξ”);
  6.198 ΞΌs (18 allocations: 576 bytes)

julia> @btime f($u_Ξ”, $v_Ξ”, $p_Ξ”, $Ο„_yy_Ξ”, $Ο„_xy_Ξ”, $py_Ξ”, $dpy_dt_Ξ”);
  11.809 ΞΌs (104 allocations: 8.58 KiB)

where fb is the function with simd, inbounds and I also added fastmath while the fc is harder (compiled to MLIR code on cpu) but parallel, the last one is paralel with julia so I guess its fine you’re not too far for a single threaded version. Those are overkill obviously but it gives a hint thtat going lower may not be possible.
You can still go to SIMD.jl and give a try but it will be hard. Also try with fortran -O3 just to see how mush of a range you have in term of perf
this is what I get

PS C:\Users\yolha\Desktop\juju_tests\MITNote> gfortran -O3 -ffree-line-length-none main.f90 -o fb.exe
PS C:\Users\yolha\Desktop\juju_tests\MITNote> ./fb
 Median execution time (s):    7.9800000000000002E-005
PS C:\Users\yolha\Desktop\juju_tests\MITNote> ./fb
 Median execution time (s):    8.4400000000000005E-005
PS C:\Users\yolha\Desktop\juju_tests\MITNote> gfortran -O3 -march=native -funroll-loops -ftree-vectorize -ffast-math -ffree-line-length-none  main.f90 -o fb.exe
PS C:\Users\yolha\Desktop\juju_tests\MITNote> ./fb
 Samples (       10000 ), evals per sample: 1
 Min (s):      2.3899999999999998E-005
 Median (s):   2.5100000000000000E-005
 Mean (s):     2.6278129999999912E-005
 Max (s):      4.2549999999999999E-004

I’m similar with julia here in single threaded, I would continue with

function fb(u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”)
           jrange = (1:100) .+ 4
           irange = (1:100) .+ 4
           for jj = jrange ;
               @inbounds @fastmath @simd ivdep for  ii = irange
                   dpy_dt_Ξ”[ii, jj] = (((((((((((((((((((((((((((-120.0 * Ο„_xy_Ξ”[-1 + ii, jj] + 30.0 * Ο„_xy_Ξ”[-2 + ii, jj]) - 5.714285714285714 * Ο„_xy_Ξ”[-3 + ii, jj]) + 0.5357142857142857 * Ο„_xy_Ξ”[-4 + ii, jj] + 120.0 * Ο„_xy_Ξ”[1 + ii, jj]) - 30.0 * Ο„_xy_Ξ”[2 + ii, jj]) + 5.714285714285714 * Ο„_xy_Ξ”[3 + ii, jj]) - 0.5357142857142857 * Ο„_xy_Ξ”[4 + ii, jj]) - 120.0 * Ο„_yy_Ξ”[ii, -1 + jj]) + 30.0 * Ο„_yy_Ξ”[ii, -2 + jj]) - 5.714285714285714 * Ο„_yy_Ξ”[ii, -3 + jj]) + 0.5357142857142857 * Ο„_yy_Ξ”[ii, -4 + jj] + 120.0 * Ο„_yy_Ξ”[ii, 1 + jj]) - 30.0 * Ο„_yy_Ξ”[ii, 2 + jj]) + 5.714285714285714 * Ο„_yy_Ξ”[ii, 3 + jj]) - 0.5357142857142857 * Ο„_yy_Ξ”[ii, 4 + jj]) + (120.0 * py_Ξ”[-1 + ii, jj]) * u_Ξ”[-1 + ii, jj]) - (30.0 * py_Ξ”[-2 + ii, jj]) * u_Ξ”[-2 + ii, jj]) + (5.714285714285714 * py_Ξ”[-3 + ii, jj]) * u_Ξ”[-3 + ii, jj]) - (0.5357142857142857 * py_Ξ”[-4 + ii, jj]) * u_Ξ”[-4 + ii, jj]) - (120.0 * py_Ξ”[1 + ii, jj]) * u_Ξ”[1 + ii, jj]) + (30.0 * py_Ξ”[2 + ii, jj]) * u_Ξ”[2 + ii, jj]) - (5.714285714285714 * py_Ξ”[3 + ii, jj]) * u_Ξ”[3 + ii, jj]) + (0.5357142857142857 * py_Ξ”[4 + ii, jj]) * u_Ξ”[4 + ii, jj] + 120.0 * (p_Ξ”[ii, -1 + jj] + py_Ξ”[ii, -1 + jj] * v_Ξ”[ii, -1 + jj])) - 30.0 * (p_Ξ”[ii, -2 + jj] + py_Ξ”[ii, -2 + jj] * v_Ξ”[ii, -2 + jj])) + 5.714285714285714 * (p_Ξ”[ii, -3 + jj] + py_Ξ”[ii, -3 + jj] * v_Ξ”[ii, -3 + jj])) - 0.5357142857142857 * (p_Ξ”[ii, -4 + jj] + py_Ξ”[ii, -4 + jj] * v_Ξ”[ii, -4 + jj])) - 120.0 * (p_Ξ”[ii, 1 + jj] + py_Ξ”[ii, 1 + jj] * v_Ξ”[ii, 1 + jj])) + 30.0 * (p_Ξ”[ii, 2 + jj] + py_Ξ”[ii, 2 + jj] * v_Ξ”[ii, 2 + jj])) - 5.714285714285714 * (p_Ξ”[ii, 3 + jj] + py_Ξ”[ii, 3 + jj] * v_Ξ”[ii, 3 + jj])) + 0.5357142857142857 * (p_Ξ”[ii, 4 + jj] + py_Ξ”[ii, 4 + jj] * v_Ξ”[ii, 4 + jj])
               end;
           end
       end

I think you probably meant to write (1:100) .+ 4 instead, what you wrote is equivalent to 1:(100 .+ 4)

Are you sure your baseline isn’t already vectorized? Looking at @code_llvm f(u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”); on my zen 5 machine, I already see a bunch of <8 x double>s:

vector.body:                                      ; preds = %vector.body, %L2
; β”‚ @ simdloop.jl:77 within `macro expansion` @ REPL[27]:6
; β”‚β”Œ @ array.jl:930 within `getindex`
; β”‚β”‚β”Œ @ abstractarray.jl:1347 within `_to_linear_index`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:3048 within `_sub2ind` @ abstractarray.jl:3064
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:3080 within `_sub2ind_recurse` @ abstractarray.jl:3080
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:87 within `+`
        %index = phi i64 [ %323, %vector.body ], [ 0, %L2 ]
        %290 = or i64 %index, 3
; β”‚β”‚β””β””β””β””
; β”‚β”‚ @ array.jl:930 within `getindex` @ essentials.jl:917
    %291 = add i64 %290, %149
    %292 = getelementptr inbounds double, ptr %14, i64 %291
    %wide.load = load <8 x double>, ptr %292, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %293 = fmul <8 x double> %wide.load, <double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex`
; β”‚β”‚β”Œ @ abstractarray.jl:1347 within `_to_linear_index`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:3048 within `_sub2ind` @ abstractarray.jl:3064
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:3080 within `_sub2ind_recurse` @ abstractarray.jl:3080
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:87 within `+`
        %294 = or i64 %index, 2
; β”‚β”‚β””β””β””β””
; β”‚β”‚ @ array.jl:930 within `getindex` @ essentials.jl:917
    %295 = add i64 %294, %149
    %296 = getelementptr inbounds double, ptr %14, i64 %295
    %wide.load3189 = load <8 x double>, ptr %296, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %297 = fmul <8 x double> %wide.load3189, <double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01>
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %298 = fsub <8 x double> %297, %293
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex`
; β”‚β”‚β”Œ @ abstractarray.jl:1347 within `_to_linear_index`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:3048 within `_sub2ind` @ abstractarray.jl:3064
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:3080 within `_sub2ind_recurse` @ abstractarray.jl:3080
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:87 within `+`
        %299 = or i64 %index, 1
; β”‚β”‚β””β””β””β””
; β”‚β”‚ @ array.jl:930 within `getindex` @ essentials.jl:917
    %300 = add i64 %299, %149
    %301 = getelementptr inbounds double, ptr %14, i64 %300
    %wide.load3190 = load <8 x double>, ptr %301, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %302 = fmul <8 x double> %wide.load3190, <double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %303 = fsub <8 x double> %298, %302
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %304 = add i64 %149, %index
    %305 = getelementptr inbounds double, ptr %14, i64 %304
    %wide.load3191 = load <8 x double>, ptr %305, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %306 = fmul <8 x double> %wide.load3191, <double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex`
; β”‚β”‚β”Œ @ abstractarray.jl:1347 within `_to_linear_index`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:3048 within `_sub2ind` @ abstractarray.jl:3064
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:3080 within `_sub2ind_recurse` @ abstractarray.jl:3080
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:87 within `+`
        %307 = or i64 %index, 5
; β”‚β”‚β””β””β””β””
; β”‚β”‚ @ array.jl:930 within `getindex` @ essentials.jl:917
    %308 = add i64 %307, %149
    %309 = getelementptr inbounds double, ptr %14, i64 %308
    %wide.load3192 = load <8 x double>, ptr %309, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %310 = fmul <8 x double> %wide.load3192, <double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02>
; β”‚β””
; β”‚β”Œ @ operators.jl:596 within `+` @ float.jl:491
    %311 = fadd <8 x double> %303, %306
    %312 = fadd <8 x double> %311, %310
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex`
; β”‚β”‚β”Œ @ abstractarray.jl:1347 within `_to_linear_index`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:3048 within `_sub2ind` @ abstractarray.jl:3064
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:3080 within `_sub2ind_recurse` @ abstractarray.jl:3080
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:87 within `+`
        %313 = or i64 %index, 6
; β”‚β”‚β””β””β””β””
; β”‚β”‚ @ array.jl:930 within `getindex` @ essentials.jl:917
    %314 = add i64 %313, %149
    %315 = getelementptr inbounds double, ptr %14, i64 %314
    %wide.load3193 = load <8 x double>, ptr %315, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %316 = fmul <8 x double> %wide.load3193, <double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %317 = fsub <8 x double> %312, %316
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex`
; β”‚β”‚β”Œ @ abstractarray.jl:1347 within `_to_linear_index`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:3048 within `_sub2ind` @ abstractarray.jl:3064
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:3080 within `_sub2ind_recurse` @ abstractarray.jl:3080
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:87 within `+`
        %318 = or i64 %index, 7
; β”‚β”‚β””β””β””β””
; β”‚β”‚ @ array.jl:930 within `getindex` @ essentials.jl:917
    %319 = add i64 %318, %149
    %320 = getelementptr inbounds double, ptr %14, i64 %319
    %wide.load3194 = load <8 x double>, ptr %320, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %321 = fmul <8 x double> %wide.load3194, <double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7>
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %322 = fadd <8 x double> %317, %321
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex`
; β”‚β”‚β”Œ @ abstractarray.jl:1347 within `_to_linear_index`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:3048 within `_sub2ind` @ abstractarray.jl:3064
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:3080 within `_sub2ind_recurse` @ abstractarray.jl:3080
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:87 within `+`
        %323 = add nuw i64 %index, 8
; β”‚β”‚β””β””β””β””
; β”‚β”‚ @ array.jl:930 within `getindex` @ essentials.jl:917
    %324 = add i64 %323, %149
    %325 = getelementptr inbounds double, ptr %14, i64 %324
    %wide.load3195 = load <8 x double>, ptr %325, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %326 = fmul <8 x double> %wide.load3195, <double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %327 = fsub <8 x double> %322, %326
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex`
; β”‚β”‚β”Œ @ abstractarray.jl:1347 within `_to_linear_index`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:3048 within `_sub2ind` @ abstractarray.jl:3064
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:3080 within `_sub2ind_recurse` @ abstractarray.jl:3080
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:87 within `+`
        %328 = or i64 %index, 4
; β”‚β”‚β””β””β””β””
; β”‚β”‚ @ array.jl:930 within `getindex` @ essentials.jl:917
    %329 = add i64 %328, %151
    %330 = getelementptr inbounds double, ptr %16, i64 %329
    %wide.load3196 = load <8 x double>, ptr %330, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %331 = fmul <8 x double> %wide.load3196, <double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %332 = fsub <8 x double> %327, %331
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %333 = add i64 %328, %153
    %334 = getelementptr inbounds double, ptr %16, i64 %333
    %wide.load3197 = load <8 x double>, ptr %334, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %335 = fmul <8 x double> %wide.load3197, <double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01>
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %336 = fadd <8 x double> %332, %335
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %337 = add i64 %328, %155
    %338 = getelementptr inbounds double, ptr %16, i64 %337
    %wide.load3198 = load <8 x double>, ptr %338, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %339 = fmul <8 x double> %wide.load3198, <double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %340 = fsub <8 x double> %336, %339
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %341 = add i64 %328, %157
    %342 = getelementptr inbounds double, ptr %16, i64 %341
    %wide.load3199 = load <8 x double>, ptr %342, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %343 = fmul <8 x double> %wide.load3199, <double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %344 = add i64 %328, %158
    %345 = getelementptr inbounds double, ptr %16, i64 %344
    %wide.load3200 = load <8 x double>, ptr %345, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %346 = fmul <8 x double> %wide.load3200, <double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02>
; β”‚β””
; β”‚β”Œ @ operators.jl:596 within `+` @ float.jl:491
    %347 = fadd <8 x double> %340, %343
    %348 = fadd <8 x double> %347, %346
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %349 = add i64 %328, %160
    %350 = getelementptr inbounds double, ptr %16, i64 %349
    %wide.load3201 = load <8 x double>, ptr %350, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %351 = fmul <8 x double> %wide.load3201, <double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %352 = fsub <8 x double> %348, %351
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %353 = add i64 %328, %162
    %354 = getelementptr inbounds double, ptr %16, i64 %353
    %wide.load3202 = load <8 x double>, ptr %354, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %355 = fmul <8 x double> %wide.load3202, <double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7>
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %356 = fadd <8 x double> %352, %355
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %357 = add i64 %328, %164
    %358 = getelementptr inbounds double, ptr %16, i64 %357
    %wide.load3203 = load <8 x double>, ptr %358, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %359 = fmul <8 x double> %wide.load3203, <double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %360 = fsub <8 x double> %356, %359
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %361 = add i64 %290, %165
    %362 = getelementptr inbounds double, ptr %18, i64 %361
    %wide.load3204 = load <8 x double>, ptr %362, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %363 = fmul <8 x double> %wide.load3204, <double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %364 = add i64 %290, %166
    %365 = getelementptr inbounds double, ptr %20, i64 %364
    %wide.load3205 = load <8 x double>, ptr %365, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %366 = fmul <8 x double> %363, %wide.load3205
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %367 = fadd <8 x double> %360, %366
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %368 = add i64 %294, %165
    %369 = getelementptr inbounds double, ptr %18, i64 %368
    %wide.load3206 = load <8 x double>, ptr %369, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %370 = fmul <8 x double> %wide.load3206, <double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %371 = add i64 %294, %166
    %372 = getelementptr inbounds double, ptr %20, i64 %371
    %wide.load3207 = load <8 x double>, ptr %372, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %373 = fmul <8 x double> %370, %wide.load3207
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %374 = fsub <8 x double> %367, %373
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %375 = add i64 %299, %165
    %376 = getelementptr inbounds double, ptr %18, i64 %375
    %wide.load3208 = load <8 x double>, ptr %376, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %377 = fmul <8 x double> %wide.load3208, <double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %378 = add i64 %299, %166
    %379 = getelementptr inbounds double, ptr %20, i64 %378
    %wide.load3209 = load <8 x double>, ptr %379, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %380 = fmul <8 x double> %377, %wide.load3209
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %381 = fadd <8 x double> %374, %380
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %382 = add i64 %165, %index
    %383 = getelementptr inbounds double, ptr %18, i64 %382
    %wide.load3210 = load <8 x double>, ptr %383, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %384 = fmul <8 x double> %wide.load3210, <double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %385 = add i64 %166, %index
    %386 = getelementptr inbounds double, ptr %20, i64 %385
    %wide.load3211 = load <8 x double>, ptr %386, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %387 = fmul <8 x double> %384, %wide.load3211
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %388 = fsub <8 x double> %381, %387
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %389 = add i64 %307, %165
    %390 = getelementptr inbounds double, ptr %18, i64 %389
    %wide.load3212 = load <8 x double>, ptr %390, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %391 = fmul <8 x double> %wide.load3212, <double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %392 = add i64 %307, %166
    %393 = getelementptr inbounds double, ptr %20, i64 %392
    %wide.load3213 = load <8 x double>, ptr %393, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %394 = fmul <8 x double> %391, %wide.load3213
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %395 = fsub <8 x double> %388, %394
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %396 = add i64 %313, %165
    %397 = getelementptr inbounds double, ptr %18, i64 %396
    %wide.load3214 = load <8 x double>, ptr %397, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %398 = fmul <8 x double> %wide.load3214, <double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %399 = add i64 %313, %166
    %400 = getelementptr inbounds double, ptr %20, i64 %399
    %wide.load3215 = load <8 x double>, ptr %400, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %401 = fmul <8 x double> %398, %wide.load3215
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %402 = fadd <8 x double> %395, %401
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %403 = add i64 %318, %165
    %404 = getelementptr inbounds double, ptr %18, i64 %403
    %wide.load3216 = load <8 x double>, ptr %404, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %405 = fmul <8 x double> %wide.load3216, <double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %406 = add i64 %318, %166
    %407 = getelementptr inbounds double, ptr %20, i64 %406
    %wide.load3217 = load <8 x double>, ptr %407, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %408 = fmul <8 x double> %405, %wide.load3217
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %409 = fsub <8 x double> %402, %408
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %410 = add i64 %323, %165
    %411 = getelementptr inbounds double, ptr %18, i64 %410
    %wide.load3218 = load <8 x double>, ptr %411, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %412 = fmul <8 x double> %wide.load3218, <double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249>
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %413 = add i64 %323, %166
    %414 = getelementptr inbounds double, ptr %20, i64 %413
    %wide.load3219 = load <8 x double>, ptr %414, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %415 = fmul <8 x double> %412, %wide.load3219
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %416 = add i64 %328, %167
    %417 = getelementptr inbounds double, ptr %22, i64 %416
    %wide.load3220 = load <8 x double>, ptr %417, align 8
    %418 = add i64 %328, %168
    %419 = getelementptr inbounds double, ptr %18, i64 %418
    %wide.load3221 = load <8 x double>, ptr %419, align 8
    %420 = add i64 %328, %169
    %421 = getelementptr inbounds double, ptr %24, i64 %420
    %wide.load3222 = load <8 x double>, ptr %421, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %422 = fmul <8 x double> %wide.load3221, %wide.load3222
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %423 = fadd <8 x double> %wide.load3220, %422
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %424 = fmul <8 x double> %423, <double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02>
; β”‚β””
; β”‚β”Œ @ operators.jl:596 within `+` @ float.jl:491
    %425 = fadd <8 x double> %409, %415
    %426 = fadd <8 x double> %425, %424
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %427 = add i64 %328, %170
    %428 = getelementptr inbounds double, ptr %22, i64 %427
    %wide.load3223 = load <8 x double>, ptr %428, align 8
    %429 = add i64 %328, %171
    %430 = getelementptr inbounds double, ptr %18, i64 %429
    %wide.load3224 = load <8 x double>, ptr %430, align 8
    %431 = add i64 %328, %172
    %432 = getelementptr inbounds double, ptr %24, i64 %431
    %wide.load3225 = load <8 x double>, ptr %432, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %433 = fmul <8 x double> %wide.load3224, %wide.load3225
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %434 = fadd <8 x double> %wide.load3223, %433
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %435 = fmul <8 x double> %434, <double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %436 = fsub <8 x double> %426, %435
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %437 = add i64 %328, %173
    %438 = getelementptr inbounds double, ptr %22, i64 %437
    %wide.load3226 = load <8 x double>, ptr %438, align 8
    %439 = add i64 %328, %174
    %440 = getelementptr inbounds double, ptr %18, i64 %439
    %wide.load3227 = load <8 x double>, ptr %440, align 8
    %441 = add i64 %328, %175
    %442 = getelementptr inbounds double, ptr %24, i64 %441
    %wide.load3228 = load <8 x double>, ptr %442, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %443 = fmul <8 x double> %wide.load3227, %wide.load3228
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %444 = fadd <8 x double> %wide.load3226, %443
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %445 = fmul <8 x double> %444, <double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7>
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %446 = fadd <8 x double> %436, %445
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %447 = add i64 %328, %176
    %448 = getelementptr inbounds double, ptr %22, i64 %447
    %wide.load3229 = load <8 x double>, ptr %448, align 8
    %449 = add i64 %328, %177
    %450 = getelementptr inbounds double, ptr %18, i64 %449
    %wide.load3230 = load <8 x double>, ptr %450, align 8
    %451 = add i64 %328, %178
    %452 = getelementptr inbounds double, ptr %24, i64 %451
    %wide.load3231 = load <8 x double>, ptr %452, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %453 = fmul <8 x double> %wide.load3230, %wide.load3231
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %454 = fadd <8 x double> %wide.load3229, %453
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %455 = fmul <8 x double> %454, <double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %456 = fsub <8 x double> %446, %455
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %457 = add i64 %328, %179
    %458 = getelementptr inbounds double, ptr %22, i64 %457
    %wide.load3232 = load <8 x double>, ptr %458, align 8
    %459 = add i64 %328, %180
    %460 = getelementptr inbounds double, ptr %18, i64 %459
    %wide.load3233 = load <8 x double>, ptr %460, align 8
    %461 = add i64 %328, %181
    %462 = getelementptr inbounds double, ptr %24, i64 %461
    %wide.load3234 = load <8 x double>, ptr %462, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %463 = fmul <8 x double> %wide.load3233, %wide.load3234
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %464 = fadd <8 x double> %wide.load3232, %463
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %465 = fmul <8 x double> %464, <double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02, double 1.200000e+02>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %466 = fsub <8 x double> %456, %465
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %467 = add i64 %328, %182
    %468 = getelementptr inbounds double, ptr %22, i64 %467
    %wide.load3235 = load <8 x double>, ptr %468, align 8
    %469 = add i64 %328, %183
    %470 = getelementptr inbounds double, ptr %18, i64 %469
    %wide.load3236 = load <8 x double>, ptr %470, align 8
    %471 = add i64 %328, %184
    %472 = getelementptr inbounds double, ptr %24, i64 %471
    %wide.load3237 = load <8 x double>, ptr %472, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %473 = fmul <8 x double> %wide.load3236, %wide.load3237
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %474 = fadd <8 x double> %wide.load3235, %473
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %475 = fmul <8 x double> %474, <double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01, double 3.000000e+01>
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %476 = fadd <8 x double> %466, %475
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %477 = add i64 %328, %185
    %478 = getelementptr inbounds double, ptr %22, i64 %477
    %wide.load3238 = load <8 x double>, ptr %478, align 8
    %479 = add i64 %328, %186
    %480 = getelementptr inbounds double, ptr %18, i64 %479
    %wide.load3239 = load <8 x double>, ptr %480, align 8
    %481 = add i64 %328, %187
    %482 = getelementptr inbounds double, ptr %24, i64 %481
    %wide.load3240 = load <8 x double>, ptr %482, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %483 = fmul <8 x double> %wide.load3239, %wide.load3240
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %484 = fadd <8 x double> %wide.load3238, %483
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %485 = fmul <8 x double> %484, <double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7, double 0x4016DB6DB6DB6DB7>
; β”‚β””
; β”‚β”Œ @ float.jl:492 within `-`
    %486 = fsub <8 x double> %476, %485
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex` @ essentials.jl:917
    %487 = add i64 %328, %188
    %488 = getelementptr inbounds double, ptr %22, i64 %487
    %wide.load3241 = load <8 x double>, ptr %488, align 8
    %489 = add i64 %328, %189
    %490 = getelementptr inbounds double, ptr %18, i64 %489
    %wide.load3242 = load <8 x double>, ptr %490, align 8
    %491 = add i64 %328, %190
    %492 = getelementptr inbounds double, ptr %24, i64 %491
    %wide.load3243 = load <8 x double>, ptr %492, align 8
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %493 = fmul <8 x double> %wide.load3242, %wide.load3243
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %494 = fadd <8 x double> %wide.load3241, %493
; β”‚β””
; β”‚β”Œ @ float.jl:493 within `*`
    %495 = fmul <8 x double> %494, <double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249, double 0x3FE1249249249249>
; β”‚β””
; β”‚β”Œ @ float.jl:491 within `+`
    %496 = fadd <8 x double> %486, %495
; β”‚β””
; β”‚β”Œ @ array.jl:994 within `setindex!`
    %497 = add i64 %328, %191
    %498 = getelementptr inbounds double, ptr %25, i64 %497
    store <8 x double> %496, ptr %498, align 8
; β”‚β””
; β”‚β”Œ @ array.jl:930 within `getindex`
; β”‚β”‚β”Œ @ abstractarray.jl:1347 within `_to_linear_index`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:3048 within `_sub2ind` @ abstractarray.jl:3064
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:3080 within `_sub2ind_recurse` @ abstractarray.jl:3080
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:87 within `+`
        %499 = icmp eq i64 %323, 9600
        br i1 %499, label %L3410, label %vector.body

I saw quite a lot in the baselind @simd method. That’s system-dependent though because I’m seeing <4 x double>. The thing is there’s a vectorized (%vector.body) and a non-vectorized branch (%L9), and I don’t know what the check is doing or which branch is running for the example’s inputs.

You’re making an anonymous function and assigning it to a non-const global variable, which is not the behavior of the idiomatic function f(.... In general, Julia cannot optimize non-const global variables without helpful type annotations, it’s the 2nd entry of the Performance Tips. The method runs long enough that this doesn’t skew the benchmarks much for me, but shorter methods or calling this method in another method would incur hefty performance penalties.

1 Like

This is expected. The scalar code is for the tail which isn’t a multiple of the vectorization size.

1 Like

ah… that can be the case. My terminal truncated the code_llvm output, and I think I was only seeing the tail. I spotted a few 4x doubles now!

Cool ! Can you point me to some reference for MLIR? I want a serial version of the function (I am parallelizing calls to it), but a 2x speed up (what you got compared from parallel with julia) would be cool.

Sounds feasible. I will give it a try at a later stage to compare at least! Thanks for the suggestion!

One last overall question: any idea why turbo fails so catastrophically?

Reactant.jl, I used KernelAbstraction.jl to write the kernel and used Reactant.jl to @compile raise=true

Nice! I actually tried reactant.jl (without kernel abstraction), but didn’t got any speed up.

Would you mind sharing your script? I am actually intending to scale up the code, so a factor of 2 in performance is definitely something I would like to look into!!!

yep

using Reactant,KernelAbstractions
using CUDA # this is only to be able to raise the kernel the gpu isn't used in this example
using BenchmarkTools
Reactant.set_default_backend("cpu")

@kernel inbounds=true function fooker!(u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”)
    i,j = @index(Global,NTuple)
    ii = i+4
    jj = j +4
    @fastmath dpy_dt_Ξ”[ii, jj] = (((((((((((((((((((((((((((-120.0 * Ο„_xy_Ξ”[-1 + ii, jj] + 30.0 * Ο„_xy_Ξ”[-2 + ii, jj]) - 5.714285714285714 * Ο„_xy_Ξ”[-3 + ii, jj]) + 0.5357142857142857 * Ο„_xy_Ξ”[-4 + ii, jj] + 120.0 * Ο„_xy_Ξ”[1 + ii, jj]) - 30.0 * Ο„_xy_Ξ”[2 + ii, jj]) + 5.714285714285714 * Ο„_xy_Ξ”[3 + ii, jj]) - 0.5357142857142857 * Ο„_xy_Ξ”[4 + ii, jj]) - 120.0 * Ο„_yy_Ξ”[ii, -1 + jj]) + 30.0 * Ο„_yy_Ξ”[ii, -2 + jj]) - 5.714285714285714 * Ο„_yy_Ξ”[ii, -3 + jj]) + 0.5357142857142857 * Ο„_yy_Ξ”[ii, -4 + jj] + 120.0 * Ο„_yy_Ξ”[ii, 1 + jj]) - 30.0 * Ο„_yy_Ξ”[ii, 2 + jj]) + 5.714285714285714 * Ο„_yy_Ξ”[ii, 3 + jj]) - 0.5357142857142857 * Ο„_yy_Ξ”[ii, 4 + jj]) + (120.0 * py_Ξ”[-1 + ii, jj]) * u_Ξ”[-1 + ii, jj]) - (30.0 * py_Ξ”[-2 + ii, jj]) * u_Ξ”[-2 + ii, jj]) + (5.714285714285714 * py_Ξ”[-3 + ii, jj]) * u_Ξ”[-3 + ii, jj]) - (0.5357142857142857 * py_Ξ”[-4 + ii, jj]) * u_Ξ”[-4 + ii, jj]) - (120.0 * py_Ξ”[1 + ii, jj]) * u_Ξ”[1 + ii, jj]) + (30.0 * py_Ξ”[2 + ii, jj]) * u_Ξ”[2 + ii, jj]) - (5.714285714285714 * py_Ξ”[3 + ii, jj]) * u_Ξ”[3 + ii, jj]) + (0.5357142857142857 * py_Ξ”[4 + ii, jj]) * u_Ξ”[4 + ii, jj] + 120.0 * (p_Ξ”[ii, -1 + jj] + py_Ξ”[ii, -1 + jj] * v_Ξ”[ii, -1 + jj])) - 30.0 * (p_Ξ”[ii, -2 + jj] + py_Ξ”[ii, -2 + jj] * v_Ξ”[ii, -2 + jj])) + 5.714285714285714 * (p_Ξ”[ii, -3 + jj] + py_Ξ”[ii, -3 + jj] * v_Ξ”[ii, -3 + jj])) - 0.5357142857142857 * (p_Ξ”[ii, -4 + jj] + py_Ξ”[ii, -4 + jj] * v_Ξ”[ii, -4 + jj])) - 120.0 * (p_Ξ”[ii, 1 + jj] + py_Ξ”[ii, 1 + jj] * v_Ξ”[ii, 1 + jj])) + 30.0 * (p_Ξ”[ii, 2 + jj] + py_Ξ”[ii, 2 + jj] * v_Ξ”[ii, 2 + jj])) - 5.714285714285714 * (p_Ξ”[ii, 3 + jj] + py_Ξ”[ii, 3 + jj] * v_Ξ”[ii, 3 + jj])) + 0.5357142857142857 * (p_Ξ”[ii, 4 + jj] + py_Ξ”[ii, 4 + jj] * v_Ξ”[ii, 4 + jj])
end

function f(u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”)
    fooker!(get_backend(dpy_dt_Ξ”))(u_Ξ”, v_Ξ”, p_Ξ”, Ο„_yy_Ξ”, Ο„_xy_Ξ”, py_Ξ”, dpy_dt_Ξ”;ndrange = (100,100))
end


n = 108
u_Ξ” = randn(n,n)    ; v_Ξ”= randn(n,n)       ; p_Ξ”= randn(n,n);
Ο„_yy_Ξ”= randn(n,n)  ; Ο„_xy_Ξ”= randn(n,n)   ; py_Ξ”= randn(n,n);
dpy_dt_Ξ” = zeros(n,n)

@btime f($u_Ξ”, $v_Ξ”, $p_Ξ”, $Ο„_yy_Ξ”, $Ο„_xy_Ξ”, $py_Ξ”, $dpy_dt_Ξ”)

dev = Reactant.to_rarray
ru_Ξ”_d = dev(u_Ξ”)    ; rv_Ξ”_d= dev(v_Ξ”)       ; rp_Ξ”_d= dev(p_Ξ”);
rΟ„_yy_Ξ”_d= dev(Ο„_yy_Ξ”)  ; rΟ„_xy_Ξ”_d= dev(Ο„_xy_Ξ”)   ; rpy_Ξ”_d= dev(py_Ξ”);
rdpy_dt_Ξ”_d = dev(dpy_dt_Ξ”)

fc = @compile raise=true f(ru_Ξ”_d, rv_Ξ”_d, rp_Ξ”_d, rΟ„_yy_Ξ”_d, rΟ„_xy_Ξ”_d, rpy_Ξ”_d, rdpy_dt_Ξ”_d)
@btime fc($ru_Ξ”_d, $rv_Ξ”_d, $rp_Ξ”_d, $rΟ„_yy_Ξ”_d, $rΟ„_xy_Ξ”_d, $rpy_Ξ”_d, $rdpy_dt_Ξ”_d)

for now we need to add CUDA in order to raise the kernel but the gpu isn’t used.
If you don’t want to add CUDA, Reactant.jl can also execute stablehlo code so I give you the one generated by this, also it should be fine with the broadcast one too just not with the loop

julia> @code_hlo raise=true f(ru_Ξ”_d, rv_Ξ”_d, rp_Ξ”_d, rΟ„_yy_Ξ”_d, rΟ„_xy_Ξ”_d, rpy_Ξ”_d, rdpy_dt_Ξ”_d)
module @reactant_f attributes {mhlo.num_partitions = 1 : i64, mhlo.num_replicas = 1 : i64} {
  llvm.module_flags [#llvm.mlir.module_flag<warning, "Dwarf Version", 2 : i32>, #llvm.mlir.module_flag<warning, "Debug Info Version", 3 : i32>]
  func.func @main(%arg0: tensor<108x108xf64>, %arg1: tensor<108x108xf64>, %arg2: tensor<108x108xf64>, %arg3: tensor<108x108xf64>, %arg4: tensor<108x108xf64>, %arg5: tensor<108x108xf64>, %arg6: tensor<108x108xf64> {tf.aliasing_output = 0 : i32}) -> tensor<108x108xf64> attributes {enzymexla.memory_effects = []} {
    %c = stablehlo.constant dense<4> : tensor<i32>
    %cst = stablehlo.constant dense<1.200000e+02> : tensor<100x100xf64>
    %cst_0 = stablehlo.constant dense<5.7142857142857144> : tensor<100x100xf64>
    %cst_1 = stablehlo.constant dense<3.000000e+01> : tensor<100x100xf64>
    %cst_2 = stablehlo.constant dense<0.5357142857142857> : tensor<100x100xf64>
    %0 = stablehlo.slice %arg3 [0:100, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %1 = stablehlo.slice %arg2 [0:100, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %2 = stablehlo.slice %arg5 [0:100, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %3 = stablehlo.slice %arg1 [0:100, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %4 = stablehlo.slice %arg3 [1:101, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %5 = stablehlo.slice %arg2 [1:101, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %6 = stablehlo.slice %arg5 [1:101, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %7 = stablehlo.slice %arg1 [1:101, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %8 = stablehlo.slice %arg3 [2:102, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %9 = stablehlo.slice %arg2 [2:102, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %10 = stablehlo.slice %arg5 [2:102, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %11 = stablehlo.slice %arg1 [2:102, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %12 = stablehlo.slice %arg3 [3:103, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %13 = stablehlo.slice %arg2 [3:103, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %14 = stablehlo.slice %arg5 [3:103, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %15 = stablehlo.slice %arg1 [3:103, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %16 = stablehlo.slice %arg4 [4:104, 0:100] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %17 = stablehlo.slice %arg5 [4:104, 0:100] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %18 = stablehlo.slice %arg0 [4:104, 0:100] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %19 = stablehlo.slice %arg4 [4:104, 1:101] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %20 = stablehlo.slice %arg5 [4:104, 1:101] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %21 = stablehlo.slice %arg0 [4:104, 1:101] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %22 = stablehlo.slice %arg4 [4:104, 2:102] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %23 = stablehlo.slice %arg5 [4:104, 2:102] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %24 = stablehlo.slice %arg0 [4:104, 2:102] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %25 = stablehlo.slice %arg4 [4:104, 3:103] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %26 = stablehlo.slice %arg5 [4:104, 3:103] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %27 = stablehlo.slice %arg0 [4:104, 3:103] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %28 = stablehlo.slice %arg4 [4:104, 5:105] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %29 = stablehlo.slice %arg5 [4:104, 5:105] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %30 = stablehlo.slice %arg0 [4:104, 5:105] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %31 = stablehlo.slice %arg4 [4:104, 6:106] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %32 = stablehlo.slice %arg5 [4:104, 6:106] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %33 = stablehlo.slice %arg0 [4:104, 6:106] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %34 = stablehlo.slice %arg4 [4:104, 7:107] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %35 = stablehlo.slice %arg5 [4:104, 7:107] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %36 = stablehlo.slice %arg0 [4:104, 7:107] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %37 = stablehlo.slice %arg4 [4:104, 8:108] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %38 = stablehlo.slice %arg5 [4:104, 8:108] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %39 = stablehlo.slice %arg0 [4:104, 8:108] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %40 = stablehlo.slice %arg3 [5:105, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %41 = stablehlo.slice %arg2 [5:105, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %42 = stablehlo.slice %arg5 [5:105, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %43 = stablehlo.slice %arg1 [5:105, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %44 = stablehlo.slice %arg3 [6:106, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %45 = stablehlo.slice %arg2 [6:106, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %46 = stablehlo.slice %arg5 [6:106, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %47 = stablehlo.slice %arg1 [6:106, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %48 = stablehlo.slice %arg3 [7:107, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %49 = stablehlo.slice %arg2 [7:107, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %50 = stablehlo.slice %arg5 [7:107, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %51 = stablehlo.slice %arg1 [7:107, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %52 = stablehlo.slice %arg3 [8:108, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %53 = stablehlo.slice %arg2 [8:108, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %54 = stablehlo.slice %arg5 [8:108, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %55 = stablehlo.slice %arg1 [8:108, 4:104] : (tensor<108x108xf64>) -> tensor<100x100xf64>
    %56 = stablehlo.multiply %27, %26 : tensor<100x100xf64>
    %57 = stablehlo.multiply %24, %23 : tensor<100x100xf64>
    %58 = stablehlo.multiply %21, %20 : tensor<100x100xf64>
    %59 = stablehlo.multiply %18, %17 : tensor<100x100xf64>
    %60 = stablehlo.multiply %30, %29 : tensor<100x100xf64>
    %61 = stablehlo.multiply %33, %32 : tensor<100x100xf64>
    %62 = stablehlo.multiply %36, %35 : tensor<100x100xf64>
    %63 = stablehlo.multiply %39, %38 : tensor<100x100xf64>
    %64 = stablehlo.multiply %15, %14 : tensor<100x100xf64>
    %65 = stablehlo.multiply %11, %10 : tensor<100x100xf64>
    %66 = stablehlo.add %65, %9 : tensor<100x100xf64>
    %67 = stablehlo.multiply %7, %6 : tensor<100x100xf64>
    %68 = stablehlo.multiply %3, %2 : tensor<100x100xf64>
    %69 = stablehlo.add %68, %1 : tensor<100x100xf64>
    %70 = stablehlo.multiply %43, %42 : tensor<100x100xf64>
    %71 = stablehlo.add %70, %41 : tensor<100x100xf64>
    %72 = stablehlo.multiply %47, %46 : tensor<100x100xf64>
    %73 = stablehlo.multiply %51, %50 : tensor<100x100xf64>
    %74 = stablehlo.add %73, %49 : tensor<100x100xf64>
    %75 = stablehlo.multiply %55, %54 : tensor<100x100xf64>
    %76 = stablehlo.subtract %28, %25 : tensor<100x100xf64>
    %77 = stablehlo.subtract %16, %37 : tensor<100x100xf64>
    %78 = stablehlo.add %77, %0 : tensor<100x100xf64>
    %79 = stablehlo.subtract %78, %52 : tensor<100x100xf64>
    %80 = stablehlo.subtract %79, %59 : tensor<100x100xf64>
    %81 = stablehlo.add %80, %63 : tensor<100x100xf64>
    %82 = stablehlo.subtract %81, %69 : tensor<100x100xf64>
    %83 = stablehlo.add %82, %53 : tensor<100x100xf64>
    %84 = stablehlo.add %83, %75 : tensor<100x100xf64>
    %85 = stablehlo.multiply %84, %cst_2 : tensor<100x100xf64>
    %86 = stablehlo.subtract %22, %31 : tensor<100x100xf64>
    %87 = stablehlo.add %86, %8 : tensor<100x100xf64>
    %88 = stablehlo.subtract %87, %44 : tensor<100x100xf64>
    %89 = stablehlo.subtract %88, %57 : tensor<100x100xf64>
    %90 = stablehlo.add %89, %61 : tensor<100x100xf64>
    %91 = stablehlo.subtract %90, %66 : tensor<100x100xf64>
    %92 = stablehlo.add %91, %45 : tensor<100x100xf64>
    %93 = stablehlo.add %92, %72 : tensor<100x100xf64>
    %94 = stablehlo.multiply %93, %cst_1 : tensor<100x100xf64>
    %95 = stablehlo.subtract %34, %19 : tensor<100x100xf64>
    %96 = stablehlo.subtract %95, %4 : tensor<100x100xf64>
    %97 = stablehlo.add %96, %48 : tensor<100x100xf64>
    %98 = stablehlo.add %97, %58 : tensor<100x100xf64>
    %99 = stablehlo.subtract %98, %62 : tensor<100x100xf64>
    %100 = stablehlo.add %99, %5 : tensor<100x100xf64>
    %101 = stablehlo.add %100, %67 : tensor<100x100xf64>
    %102 = stablehlo.subtract %101, %74 : tensor<100x100xf64>
    %103 = stablehlo.multiply %102, %cst_0 : tensor<100x100xf64>
    %104 = stablehlo.subtract %76, %12 : tensor<100x100xf64>
    %105 = stablehlo.add %104, %40 : tensor<100x100xf64>
    %106 = stablehlo.add %105, %56 : tensor<100x100xf64>
    %107 = stablehlo.subtract %106, %60 : tensor<100x100xf64>
    %108 = stablehlo.add %107, %13 : tensor<100x100xf64>
    %109 = stablehlo.add %108, %64 : tensor<100x100xf64>
    %110 = stablehlo.subtract %109, %71 : tensor<100x100xf64>
    %111 = stablehlo.multiply %110, %cst : tensor<100x100xf64>
    %112 = stablehlo.add %94, %111 : tensor<100x100xf64>
    %113 = stablehlo.add %112, %103 : tensor<100x100xf64>
    %114 = stablehlo.add %113, %85 : tensor<100x100xf64>
    %115 = stablehlo.dynamic_update_slice %arg6, %114, %c, %c : (tensor<108x108xf64>, tensor<100x100xf64>, tensor<i32>, tensor<i32>) -> tensor<108x108xf64>
    return %115 : tensor<108x108xf64>
  }
}

Again if you planned to parallel a bigger process you shouldn’t really use that, it was mainly to show the biggest perf you can get but it doesn’t mean its what you want