I implemented three versions of a 1D diffusion kernel, two with Tullio. The third version makes errors near the boundary. Is this illegal syntax and if so, can I fix this? I would like to avoid the second method in my actual use case, because it would require me to write out an enormous amount of views.
## Packages.
using BenchmarkTools
using LoopVectorization
using Tullio
## Data generation.
itot = 32
a = rand(itot)
at1 = zeros(itot)
at2 = zeros(itot)
at3 = zeros(itot)
## Test 1: loopvec
function diff1!(at, a, itot)
@tturbo for i in 2:itot-1
at[i] = a[i-1] - 2a[i] + a[i+1]
end
end
@btime diff1!($at1, $a, $itot)
## Test 2: Tullio
at_c = @view at2[2:itot-1]
a_w = @view a[1:itot-2]
a_c = @view a[2:itot-1]
a_e = @view a[3:itot ]
@btime @tullio at_c[i] = a_w[i] - 2a_c[i] + a_e[i]
## Test 3: Tullio
at_c = @view at3[2:itot-1]
a_c = @view a[2:itot-1]
@btime @tullio at_c[i] = a_c[i-1] - 2a_c[i] + a_c[i+1]
## Plot results
using Plots
plot(at1[2:itot-1])
plot!(at2[2:itot-1])
plot!(at3[2:itot-1])
I think that to match diff1! you simply want to use the same indices, with no views. Using shifted views is an alternative to shifted indices, but using both together is pretty confusing. If you write verbose=true, the macro will print what ranges it infers, which can be a useful check:
Ah, great, that works indeed. So how would I solve the problem if I had more ghost cells than the width of the stencil? So imagine I wanted to run the kernel on 4:itot-3.
it takes a nontrivial amount of time to fork off a thread, not sure exactly but it could be 100us easily, you could maybe benchmark that, but for something that takes 10ns to run, the overhead of threading is enormous.
I did:
julia> Threads.nthreads()
5
julia> function forkthreads()
Threads.@threads for j in 1:1000
end
end
end
@btime forkthreads()
4.519 ms (26010 allocations: 2.56 MiB)
If this worked right, then 5000 threads took 4.5ms so it’s like 1us for forking a thread. compared to 10ns that’s 100x longer than doing your calc.
julia> function forkthreads()
for i in 1:1000
Threads.@threads for j in 1:1000
end
end
end
forkthreads (generic function with 1 method)
julia> forkthreads()
julia> @btime forkthreads()
4.036 ms (26003 allocations: 2.56 MiB)
It’s the outer i loop that makes it fork 1000 times.
So it looks like forking a thread takes on the order of 1us
These arrays are small enough (length 32) so that neither Tullio or LoopVectorization will thread them, even when set to true. Setting to false removes the check and reduces the amount of code generated.
Yes. It’s somehow spending 8ns thinking about whether it ought to spawn other threads; actually doing so would take much longer. (It uses Threads.@spawn for this, unlike @tturbo.)
The “threading threshold” it prints is its estimate of how many iterations would be needed before it’s worth multi-threading; if the expression contains say log this number will be lower. It’s not very sophisticated but many problems are very far below or above the boundary.
You can provide explicit ranges for indices, like so:
julia> @tullio at1[i] = a[i-1] - 2a[i] + a[i+1] i in 4:itot-3 verbose=true;
┌ Info: threading threshold (from cost = 1)
â”” block = 262144
┌ Info: left index ranges
â”” i = 4:29
Be warned that explicitly provided ones aren’t currently treated as being any more definite — it infers a range from every term containing i, and then (for indices with any shifts in them) takes the intersection of all of those.
Alright, that is great to know. How does this work for multidimensional arrays? I tried @tullio at[i, j] = a[i-1, j] - 2a[i, j] + a[i+1, j] i in 4:itot-3 verbose=true; for a 2D example, but I would like to extend this to 4:jtot-3 for j as well.