# Loop vectorization depends on number of terms in stencil

Hello,

I was motivated by the recent blog post about the new threading infrastructure in Julia 1.3-alpha to try writing a Julia stencil code that I would normally end up writing in C.

However, before getting a chance to work on threading, I ran into a loop vectorization issue. In particular, whether or not the loop vectorizes seems to depend on the size of the stencil.

To illustrate, I made two methods: 1) `does_vectorize!(w,u)` and 2) `does_not_vectorize!(w,u)`. In both methods, we apply a stencil to `u`, saving the result in `w`. The only difference is that in the `does_vectorize!` case there are 18 points in the stencil while in the `does_not_vectorize!` case there are 19 points in the stencil.

Here is the code run with `julia -O3` on both Julia 1.1.1 and Julia 1.3.0-alpha:

``````function does_vectorize!(w::Array{Float32,3},u::Array{Float32,3})
nz,ny,nx = size(u)
s = ntuple(i->rand(Float32),9)
@inbounds begin
for ix = 5:(nx-5), iy = 5:(ny-5)
@simd for iz = 5:(nz-5)
w[iz,iy,ix] =
s*u[iz-4,iy,  ix  ] + s*u[iz-3,iy,  ix  ] + s*u[iz-2,iy,  ix  ] + s*u[iz-1,iy,  ix  ] + s*u[iz,iy,ix] + s*u[iz+1,iy,  ix  ] + s*u[iz+2,iy,  ix  ] + s*u[iz+3,iy,ix    ] + s*u[iz+4,iy,  ix  ] +
s*u[iz,  iy-4,ix  ] + s*u[iz,  iy-3,ix  ] + s*u[iz,  iy-2,ix  ] + s*u[iz,  iy-1,ix  ] + s*u[iz,iy,ix] + s*u[iz,  iy+1,ix  ] + s*u[iz,  iy+2,ix  ] + s*u[iz,  iy+3,ix  ]
end
end
end
nothing
end
io = open("does_vectorize.txt", "w")
code_llvm(io, does_vectorize!, (Array{Float32,3}, Array{Float32,3}))
close(io)

function does_not_vectorize!(w::Array{Float32,3},u::Array{Float32,3})
nz,ny,nx = size(u)
s = ntuple(i->rand(Float32),9)
@inbounds begin
for ix = 5:(nx-5), iy = 5:(ny-5)
@simd for iz = 5:(nz-5)
w[iz,iy,ix] =
s*u[iz-4,iy,  ix  ] + s*u[iz-3,iy,  ix  ] + s*u[iz-2,iy,  ix  ] + s*u[iz-1,iy,  ix  ] + s*u[iz,iy,ix] + s*u[iz+1,iy,  ix  ] + s*u[iz+2,iy,  ix  ] + s*u[iz+3,iy,ix    ] + s*u[iz+4,iy,  ix  ] +
s*u[iz,  iy-4,ix  ] + s*u[iz,  iy-3,ix  ] + s*u[iz,  iy-2,ix  ] + s*u[iz,  iy-1,ix  ] + s*u[iz,iy,ix] + s*u[iz,  iy+1,ix  ] + s*u[iz,  iy+2,ix  ] + s*u[iz,  iy+3,ix  ] + s*u[iz,  iy+4,ix  ]
end
end
end
nothing
end
io = open("does_not_vectorize.txt", "w")
code_llvm(io, does_not_vectorize!, (Array{Float32,3}, Array{Float32,3}))
close(io)

nz,ny,nx = 512,512,512
u = rand(Float32,nz,ny,nx)
w = zeros(Float32,nz,ny,nx)

using BenchmarkTools
@btime does_vectorize!(w,u)
@btime does_not_vectorize!(w,u)
``````

and the output (similar results are obtained on Julia 1.1.1 and 1.3.0-alpha) from the `@btime` macro:

``````  217.150 ms (1 allocation: 48 bytes)
1.304 s (1 allocation: 48 bytes)
``````

I put the output from `code_llvm` into a gist. The point is that `does_vectorize.txt` contains the expected `vector.body:` section whereas `does_not_vectorize.txt` does not. Can anyone reproduce this? Should I file a GitHub issue?

Thanks!

Sam

1 Like

If I replace `@simd` with `@simd ivdep` the loop gets vectorized. I guess that means that LLVM fails to prove that the loop can be vectorized on its own? It is still not clear to me why LLVM can prove that the first loop can be vectorized where-as in the second case it can’t. So, I’m still wondering if this is a bug, or expected behavior?

I found https://github.com/JuliaLang/julia/issues/31482 useful when trying to understand the issue.