Loop vectorization depends on number of terms in stencil

samtkaplan · July 29, 2019, 4:46pm

Hello,

I was motivated by the recent blog post about the new threading infrastructure in Julia 1.3-alpha to try writing a Julia stencil code that I would normally end up writing in C.

However, before getting a chance to work on threading, I ran into a loop vectorization issue. In particular, whether or not the loop vectorizes seems to depend on the size of the stencil.

To illustrate, I made two methods: 1) does_vectorize!(w,u) and 2) does_not_vectorize!(w,u). In both methods, we apply a stencil to u, saving the result in w. The only difference is that in the does_vectorize! case there are 18 points in the stencil while in the does_not_vectorize! case there are 19 points in the stencil.

Here is the code run with julia -O3 on both Julia 1.1.1 and Julia 1.3.0-alpha:

function does_vectorize!(w::Array{Float32,3},u::Array{Float32,3})
    nz,ny,nx = size(u)
    s = ntuple(i->rand(Float32),9)
    @inbounds begin
        for ix = 5:(nx-5), iy = 5:(ny-5)
            @simd for iz = 5:(nz-5)
                w[iz,iy,ix] =
                    s[1]*u[iz-4,iy,  ix  ] + s[2]*u[iz-3,iy,  ix  ] + s[3]*u[iz-2,iy,  ix  ] + s[4]*u[iz-1,iy,  ix  ] + s[5]*u[iz,iy,ix] + s[6]*u[iz+1,iy,  ix  ] + s[7]*u[iz+2,iy,  ix  ] + s[8]*u[iz+3,iy,ix    ] + s[9]*u[iz+4,iy,  ix  ] +
                    s[1]*u[iz,  iy-4,ix  ] + s[2]*u[iz,  iy-3,ix  ] + s[3]*u[iz,  iy-2,ix  ] + s[4]*u[iz,  iy-1,ix  ] + s[5]*u[iz,iy,ix] + s[6]*u[iz,  iy+1,ix  ] + s[7]*u[iz,  iy+2,ix  ] + s[8]*u[iz,  iy+3,ix  ]
            end
        end
    end
    nothing
end
io = open("does_vectorize.txt", "w")
code_llvm(io, does_vectorize!, (Array{Float32,3}, Array{Float32,3}))
close(io)

function does_not_vectorize!(w::Array{Float32,3},u::Array{Float32,3})
    nz,ny,nx = size(u)
    s = ntuple(i->rand(Float32),9)
    @inbounds begin
        for ix = 5:(nx-5), iy = 5:(ny-5)
            @simd for iz = 5:(nz-5)
                w[iz,iy,ix] =
                    s[1]*u[iz-4,iy,  ix  ] + s[2]*u[iz-3,iy,  ix  ] + s[3]*u[iz-2,iy,  ix  ] + s[4]*u[iz-1,iy,  ix  ] + s[5]*u[iz,iy,ix] + s[6]*u[iz+1,iy,  ix  ] + s[7]*u[iz+2,iy,  ix  ] + s[8]*u[iz+3,iy,ix    ] + s[9]*u[iz+4,iy,  ix  ] +
                    s[1]*u[iz,  iy-4,ix  ] + s[2]*u[iz,  iy-3,ix  ] + s[3]*u[iz,  iy-2,ix  ] + s[4]*u[iz,  iy-1,ix  ] + s[5]*u[iz,iy,ix] + s[6]*u[iz,  iy+1,ix  ] + s[7]*u[iz,  iy+2,ix  ] + s[8]*u[iz,  iy+3,ix  ] + s[9]*u[iz,  iy+4,ix  ]
            end
        end
    end
    nothing
end
io = open("does_not_vectorize.txt", "w")
code_llvm(io, does_not_vectorize!, (Array{Float32,3}, Array{Float32,3}))
close(io)

nz,ny,nx = 512,512,512
u = rand(Float32,nz,ny,nx)
w = zeros(Float32,nz,ny,nx)

using BenchmarkTools
@btime does_vectorize!(w,u)
@btime does_not_vectorize!(w,u)

and the output (similar results are obtained on Julia 1.1.1 and 1.3.0-alpha) from the @btime macro:

  217.150 ms (1 allocation: 48 bytes)
  1.304 s (1 allocation: 48 bytes)

I put the output from code_llvm into a gist[1]. The point is that does_vectorize.txt contains the expected vector.body: section whereas does_not_vectorize.txt does not. Can anyone reproduce this? Should I file a GitHub issue?

Thanks!

Sam

[1] https://gist.github.com/samtkaplan/0565ef0bc6d5007ed88635270950aaee

samtkaplan · July 30, 2019, 12:57pm

If I replace @simd with @simd ivdep the loop gets vectorized. I guess that means that LLVM fails to prove that the loop can be vectorized on its own? It is still not clear to me why LLVM can prove that the first loop can be vectorized where-as in the second case it can’t. So, I’m still wondering if this is a bug, or expected behavior?

I found https://github.com/JuliaLang/julia/issues/31482 useful when trying to understand the issue.

Topic		Replies	Views
LoopVectorization triggers segfault or deadlock for complex finite difference stencil Performance loopvectorization	1	459	July 27, 2021
Help understanding vectorization (or lack thereof) Performance	15	1212	June 8, 2018
LoopVectorization almost doubles execution time? Performance loopvectorization	6	655	July 9, 2021
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1880	June 27, 2021
LoopVectorization for sparse matrix operation : @turbo, LoopVectorization.check_args Performance question	1	375	March 3, 2023

Loop vectorization depends on number of terms in stencil

Related topics