# Tullio boundary error for shifted stencil

I implemented three versions of a 1D diffusion kernel, two with Tullio. The third version makes errors near the boundary. Is this illegal syntax and if so, can I fix this? I would like to avoid the second method in my actual use case, because it would require me to write out an enormous amount of views.

``````## Packages.
using BenchmarkTools
using LoopVectorization
using Tullio

## Data generation.
itot = 32

a = rand(itot)
at1 = zeros(itot)
at2 = zeros(itot)
at3 = zeros(itot)

## Test 1: loopvec
function diff1!(at, a, itot)
@tturbo for i in 2:itot-1
at[i] = a[i-1] - 2a[i] + a[i+1]
end
end
@btime diff1!(\$at1, \$a, \$itot)

## Test 2: Tullio
at_c = @view at2[2:itot-1]
a_w = @view a[1:itot-2]
a_c = @view a[2:itot-1]
a_e = @view a[3:itot  ]
@btime @tullio at_c[i] = a_w[i] - 2a_c[i] + a_e[i]

## Test 3: Tullio
at_c = @view at3[2:itot-1]
a_c = @view a[2:itot-1]
@btime @tullio at_c[i] = a_c[i-1] - 2a_c[i] + a_c[i+1]

## Plot results
using Plots
plot(at1[2:itot-1])
plot!(at2[2:itot-1])
plot!(at3[2:itot-1])
``````

I think that to match `diff1!` you simply want to use the same indices, with no `view`s. Using shifted views is an alternative to shifted indices, but using both together is pretty confusing. If you write `verbose=true`, the macro will print what ranges it infers, which can be a useful check:

``````julia> @tullio at1[i] = a[i-1] - 2a[i] + a[i+1]  verbose=true;
â”Ś Info: threading threshold (from cost = 1)
â””   block = 262144
â”Ś Info: left index ranges
â””   i = 2:31

julia> 2:itot-1
2:31

# compare options 2 & 3:

julia> @tullio at_c[i] = a_w[i] - 2a_c[i] + a_e[i]  verbose=true;
â”Ś Info: left index ranges
â””   i = Base.OneTo(30)

julia> @tullio at_c[i] = a_c[i-1] - 2a_c[i] + a_c[i+1]  verbose=true;
â”Ś Info: left index ranges
â””   i = 2:29

# times

julia> @btime diff1!(\$at1, \$a, \$itot)
10.219 ns (0 allocations: 0 bytes)

julia> @btime @tullio \$at1[i] = \$a[i-1] - 2(\$a[i]) + \$a[i+1];  # dollars are removed by BenchmarkTools
18.849 ns (0 allocations: 0 bytes)

julia> @btime @tullio \$at1[i] = \$a[i-1] - 2(\$a[i]) + \$a[i+1]  threads=false;
10.291 ns (0 allocations: 0 bytes)
``````
1 Like

Ah, great, that works indeed. So how would I solve the problem if I had more ghost cells than the width of the stencil? So imagine I wanted to run the kernel on `4:itot-3`.

Sorry for the dumb question, why is the `threads=false` option much faster than `threads=true`?

it takes a nontrivial amount of time to fork off a thread, not sure exactly but it could be 100us easily, you could maybe benchmark that, but for something that takes 10ns to run, the overhead of threading is enormous.

I did:

``````julia> Threads.nthreads()
5

end
end
end
4.519 ms (26010 allocations: 2.56 MiB)
``````

If this worked right, then 5000 threads took 4.5ms so itâ€™s like 1us for forking a thread. compared to 10ns thatâ€™s 100x longer than doing your calc.

2 Likes

@dlakelan, thank you for the insight.
In Win10 laptop, the result of your code is quite different:

``````julia> @btime forkthreads()
4.450 ÎĽs (40 allocations: 4.85 KiB)
``````

NB: the Julia number of threads used was 8:

``````julia> Threads.nthreads()
8
``````

I think I miscopied my code, hold on:

``````julia> function forkthreads()
for i in 1:1000
end
end
end
forkthreads (generic function with 1 method)

4.036 ms (26003 allocations: 2.56 MiB)

``````

Itâ€™s the outer i loop that makes it fork 1000 times.

So it looks like forking a thread takes on the order of 1us

1 Like

@dlakelan, so this means that the @btime macro does not (fully) reflect that massive overhead?

no it means that the code you ran didnâ€™t include the outer loop because I didnâ€™t copy it correctly

1 Like

These arrays are small enough (length `32`) so that neither Tullio or LoopVectorization will thread them, even when set to true. Setting to false removes the check and reduces the amount of code generated.

2 Likes

Yes. Itâ€™s somehow spending 8ns thinking about whether it ought to spawn other threads; actually doing so would take much longer. (It uses `Threads.@spawn` for this, unlike `@tturbo`.)

The â€śthreading thresholdâ€ť it prints is its estimate of how many iterations would be needed before itâ€™s worth multi-threading; if the expression contains say `log` this number will be lower. Itâ€™s not very sophisticated but many problems are very far below or above the boundary.

You can provide explicit ranges for indices, like so:

``````julia> @tullio at1[i] = a[i-1] - 2a[i] + a[i+1]   i in 4:itot-3  verbose=true;
â”Ś Info: threading threshold (from cost = 1)
â””   block = 262144
â”Ś Info: left index ranges
â””   i = 4:29
``````

Be warned that explicitly provided ones arenâ€™t currently treated as being any more definite â€” it infers a range from every term containing `i`, and then (for indices with any shifts in them) takes the intersection of all of those.

1 Like

Alright, that is great to know. How does this work for multidimensional arrays? I tried `@tullio at[i, j] = a[i-1, j] - 2a[i, j] + a[i+1, j] i in 4:itot-3 verbose=true;` for a 2D example, but I would like to extend this to 4:jtot-3 for j as well.