Tullio boundary error for shifted stencil

Chiil · July 28, 2021, 2:36pm

I implemented three versions of a 1D diffusion kernel, two with Tullio. The third version makes errors near the boundary. Is this illegal syntax and if so, can I fix this? I would like to avoid the second method in my actual use case, because it would require me to write out an enormous amount of views.

## Packages.
using BenchmarkTools
using LoopVectorization
using Tullio

## Data generation.
itot = 32

a = rand(itot)
at1 = zeros(itot)
at2 = zeros(itot)
at3 = zeros(itot)

## Test 1: loopvec
function diff1!(at, a, itot)
    @tturbo for i in 2:itot-1
        at[i] = a[i-1] - 2a[i] + a[i+1]
    end
end
@btime diff1!($at1, $a, $itot)

## Test 2: Tullio
at_c = @view at2[2:itot-1]
a_w = @view a[1:itot-2]
a_c = @view a[2:itot-1]
a_e = @view a[3:itot  ]
@btime @tullio at_c[i] = a_w[i] - 2a_c[i] + a_e[i]

## Test 3: Tullio
at_c = @view at3[2:itot-1]
a_c = @view a[2:itot-1]
@btime @tullio at_c[i] = a_c[i-1] - 2a_c[i] + a_c[i+1]

## Plot results
using Plots
plot(at1[2:itot-1])
plot!(at2[2:itot-1])
plot!(at3[2:itot-1])

mcabbott · July 28, 2021, 4:24pm

I think that to match diff1! you simply want to use the same indices, with no views. Using shifted views is an alternative to shifted indices, but using both together is pretty confusing. If you write verbose=true, the macro will print what ranges it infers, which can be a useful check:

julia> @tullio at1[i] = a[i-1] - 2a[i] + a[i+1]  verbose=true;
┌ Info: threading threshold (from cost = 1)
└   block = 262144
┌ Info: left index ranges
└   i = 2:31

julia> 2:itot-1
2:31

# compare options 2 & 3:

julia> @tullio at_c[i] = a_w[i] - 2a_c[i] + a_e[i]  verbose=true;
┌ Info: left index ranges
└   i = Base.OneTo(30)

julia> @tullio at_c[i] = a_c[i-1] - 2a_c[i] + a_c[i+1]  verbose=true;
┌ Info: left index ranges
└   i = 2:29

# times

julia> @btime diff1!($at1, $a, $itot)
  10.219 ns (0 allocations: 0 bytes)

julia> @btime @tullio $at1[i] = $a[i-1] - 2($a[i]) + $a[i+1];  # dollars are removed by BenchmarkTools
  18.849 ns (0 allocations: 0 bytes)

julia> @btime @tullio $at1[i] = $a[i-1] - 2($a[i]) + $a[i+1]  threads=false;
  10.291 ns (0 allocations: 0 bytes)

Chiil · July 28, 2021, 6:13pm

Ah, great, that works indeed. So how would I solve the problem if I had more ghost cells than the width of the stencil? So imagine I wanted to run the kernel on 4:itot-3.

rafael.guerra · July 28, 2021, 7:00pm

Sorry for the dumb question, why is the threads=false option much faster than threads=true?

dlakelan · July 28, 2021, 7:41pm

it takes a nontrivial amount of time to fork off a thread, not sure exactly but it could be 100us easily, you could maybe benchmark that, but for something that takes 10ns to run, the overhead of threading is enormous.

I did:

julia> Threads.nthreads()
5

julia> function forkthreads()
       Threads.@threads for j in 1:1000
       end
       end
       end
@btime forkthreads()
  4.519 ms (26010 allocations: 2.56 MiB)

If this worked right, then 5000 threads took 4.5ms so it’s like 1us for forking a thread. compared to 10ns that’s 100x longer than doing your calc.

rafael.guerra · July 28, 2021, 7:57pm

@dlakelan, thank you for the insight.
In Win10 laptop, the result of your code is quite different:

julia> @btime forkthreads()
  4.450 μs (40 allocations: 4.85 KiB)

NB: the Julia number of threads used was 8:

julia> Threads.nthreads()
8

dlakelan · July 28, 2021, 8:04pm

I think I miscopied my code, hold on:

julia> function forkthreads()
       for i in 1:1000
       Threads.@threads for j in 1:1000
       end
       end
       end
forkthreads (generic function with 1 method)

julia> forkthreads()

julia> @btime forkthreads()
  4.036 ms (26003 allocations: 2.56 MiB)

It’s the outer i loop that makes it fork 1000 times.

So it looks like forking a thread takes on the order of 1us

rafael.guerra · July 28, 2021, 8:08pm

@dlakelan, so this means that the @btime macro does not (fully) reflect that massive overhead?

dlakelan · July 28, 2021, 8:09pm

no it means that the code you ran didn’t include the outer loop because I didn’t copy it correctly

Elrod · July 28, 2021, 8:40pm

These arrays are small enough (length 32) so that neither Tullio or LoopVectorization will thread them, even when set to true. Setting to false removes the check and reduces the amount of code generated.

mcabbott · July 28, 2021, 10:37pm

Yes. It’s somehow spending 8ns thinking about whether it ought to spawn other threads; actually doing so would take much longer. (It uses Threads.@spawn for this, unlike @tturbo.)

The “threading threshold” it prints is its estimate of how many iterations would be needed before it’s worth multi-threading; if the expression contains say log this number will be lower. It’s not very sophisticated but many problems are very far below or above the boundary.

You can provide explicit ranges for indices, like so:

julia> @tullio at1[i] = a[i-1] - 2a[i] + a[i+1]   i in 4:itot-3  verbose=true;
┌ Info: threading threshold (from cost = 1)
└   block = 262144
┌ Info: left index ranges
└   i = 4:29

Be warned that explicitly provided ones aren’t currently treated as being any more definite — it infers a range from every term containing i, and then (for indices with any shifts in them) takes the intersection of all of those.

Chiil · July 29, 2021, 7:05am

Alright, that is great to know. How does this work for multidimensional arrays? I tried @tullio at[i, j] = a[i-1, j] - 2a[i, j] + a[i+1, j] i in 4:itot-3 verbose=true; for a 2D example, but I would like to extend this to 4:jtot-3 for j as well.

Topic		Replies	Views
Tullio seems two times slower than basic LoopVectorization Performance question , tullio , loopvectorization	3	1061	April 10, 2022
Oddities with loop with dummy variable in Tullio General Usage question , bug , tullio	2	364	December 9, 2022
Tullio with pre-computed index broadcasting General Usage question , indexing , tullio	2	361	April 11, 2023
LoopVectorization multithreading for multidimensional arrays Numerics loopvectorization	24	1233	March 17, 2022
Unexpected Tullio behavior Performance macros , tullio	6	582	October 31, 2020

Tullio boundary error for shifted stencil

Related topics