Tullio boundary error for shifted stencil

I implemented three versions of a 1D diffusion kernel, two with Tullio. The third version makes errors near the boundary. Is this illegal syntax and if so, can I fix this? I would like to avoid the second method in my actual use case, because it would require me to write out an enormous amount of views.

## Packages.
using BenchmarkTools
using LoopVectorization
using Tullio

## Data generation.
itot = 32

a = rand(itot)
at1 = zeros(itot)
at2 = zeros(itot)
at3 = zeros(itot)

## Test 1: loopvec
function diff1!(at, a, itot)
    @tturbo for i in 2:itot-1
        at[i] = a[i-1] - 2a[i] + a[i+1]
@btime diff1!($at1, $a, $itot)

## Test 2: Tullio
at_c = @view at2[2:itot-1]
a_w = @view a[1:itot-2]
a_c = @view a[2:itot-1]
a_e = @view a[3:itot  ]
@btime @tullio at_c[i] = a_w[i] - 2a_c[i] + a_e[i]

## Test 3: Tullio
at_c = @view at3[2:itot-1]
a_c = @view a[2:itot-1]
@btime @tullio at_c[i] = a_c[i-1] - 2a_c[i] + a_c[i+1]

## Plot results
using Plots

I think that to match diff1! you simply want to use the same indices, with no views. Using shifted views is an alternative to shifted indices, but using both together is pretty confusing. If you write verbose=true, the macro will print what ranges it infers, which can be a useful check:

julia> @tullio at1[i] = a[i-1] - 2a[i] + a[i+1]  verbose=true;
┌ Info: threading threshold (from cost = 1)
└   block = 262144
┌ Info: left index ranges
└   i = 2:31

julia> 2:itot-1

# compare options 2 & 3:

julia> @tullio at_c[i] = a_w[i] - 2a_c[i] + a_e[i]  verbose=true;
┌ Info: left index ranges
└   i = Base.OneTo(30)

julia> @tullio at_c[i] = a_c[i-1] - 2a_c[i] + a_c[i+1]  verbose=true;
┌ Info: left index ranges
└   i = 2:29

# times

julia> @btime diff1!($at1, $a, $itot)
  10.219 ns (0 allocations: 0 bytes)

julia> @btime @tullio $at1[i] = $a[i-1] - 2($a[i]) + $a[i+1];  # dollars are removed by BenchmarkTools
  18.849 ns (0 allocations: 0 bytes)

julia> @btime @tullio $at1[i] = $a[i-1] - 2($a[i]) + $a[i+1]  threads=false;
  10.291 ns (0 allocations: 0 bytes)
1 Like

Ah, great, that works indeed. So how would I solve the problem if I had more ghost cells than the width of the stencil? So imagine I wanted to run the kernel on 4:itot-3.

Sorry for the dumb question, why is the threads=false option much faster than threads=true?

it takes a nontrivial amount of time to fork off a thread, not sure exactly but it could be 100us easily, you could maybe benchmark that, but for something that takes 10ns to run, the overhead of threading is enormous.

I did:

julia> Threads.nthreads()

julia> function forkthreads()
       Threads.@threads for j in 1:1000
@btime forkthreads()
  4.519 ms (26010 allocations: 2.56 MiB)

If this worked right, then 5000 threads took 4.5ms so it’s like 1us for forking a thread. compared to 10ns that’s 100x longer than doing your calc.


@dlakelan, thank you for the insight.
In Win10 laptop, the result of your code is quite different:

julia> @btime forkthreads()
  4.450 μs (40 allocations: 4.85 KiB)

NB: the Julia number of threads used was 8:

julia> Threads.nthreads()

I think I miscopied my code, hold on:

julia> function forkthreads()
       for i in 1:1000
       Threads.@threads for j in 1:1000
forkthreads (generic function with 1 method)

julia> forkthreads()

julia> @btime forkthreads()
  4.036 ms (26003 allocations: 2.56 MiB)

It’s the outer i loop that makes it fork 1000 times.

So it looks like forking a thread takes on the order of 1us

1 Like

@dlakelan, so this means that the @btime macro does not (fully) reflect that massive overhead?

no it means that the code you ran didn’t include the outer loop because I didn’t copy it correctly :wink:

1 Like

These arrays are small enough (length 32) so that neither Tullio or LoopVectorization will thread them, even when set to true. Setting to false removes the check and reduces the amount of code generated.


Yes. It’s somehow spending 8ns thinking about whether it ought to spawn other threads; actually doing so would take much longer. (It uses Threads.@spawn for this, unlike @tturbo.)

The “threading threshold” it prints is its estimate of how many iterations would be needed before it’s worth multi-threading; if the expression contains say log this number will be lower. It’s not very sophisticated but many problems are very far below or above the boundary.

You can provide explicit ranges for indices, like so:

julia> @tullio at1[i] = a[i-1] - 2a[i] + a[i+1]   i in 4:itot-3  verbose=true;
┌ Info: threading threshold (from cost = 1)
└   block = 262144
┌ Info: left index ranges
└   i = 4:29

Be warned that explicitly provided ones aren’t currently treated as being any more definite — it infers a range from every term containing i, and then (for indices with any shifts in them) takes the intersection of all of those.

1 Like

Alright, that is great to know. How does this work for multidimensional arrays? I tried @tullio at[i, j] = a[i-1, j] - 2a[i, j] + a[i+1, j] i in 4:itot-3 verbose=true; for a 2D example, but I would like to extend this to 4:jtot-3 for j as well.