LoopVectorization multithreading for multidimensional arrays

Gregstrq · February 25, 2022, 1:04pm

What is the correct way to use @tturbo to parallelize iteration over multidimensional array?
I can think of several variants:

Parallelize the iteration over cartesian indices: @tturbo for I in eachindex(a)
Parallelize the outermost loop
Parallelize the innermost loop

Can one tell what is the best approach without doing the benchmarks?

goerch · February 25, 2022, 1:23pm

Maybe Tullio is better suited for your task then?

Gregstrq · February 25, 2022, 1:31pm

I don’t know.
I looked at the webpage, and it says that it uses Threads.@spawn. I remember there were posts on discourse which discuss that Threads.@spawn has the problems with thread load imbalance. But may be those were old posts and this problem has been solved by now.

goerch · February 25, 2022, 1:34pm

My idea was that @mcabbott did the benchmarks for you already…

More generally speaking:

your tasks should have a size which justifies the overhead of parallelization
the tasks should be sufficiently data independent to not require extensive locking

So, difficult to know without more details…

Gregstrq · February 25, 2022, 1:59pm

What I am doing looks like this

for ci in eachindex(lattice)
    ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci,2]
    ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
    ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
end

The arrays s and h store local spins and magnetic fields – three-dimensional vectors that live at the nodes of a lattice. So last dimension of s and h has three components and the rest is the cartesian index into the lattice.
What I am doing is computing cross-product at each node of the lattice.

goerch · February 25, 2022, 3:17pm

A baseline could be

using Tullio, BenchmarkTools

function serial1!(ds, s, h)
    for ci in 1:size(ds, 1)
        ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci, 2]
        ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
        ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
    end
end

function parallel1!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[ci, S[cj]]*h[ci, H[cj]] - s[ci, S[cj]]*h[ci, H[cj]]
end

function serial2!(ds, s, h)
    for cj in 1:size(ds, 2)
        ds[1, cj] = s[2, cj]*h[3, cj] - s[3, cj]*h[2, cj]
        ds[2, cj] = s[3, cj]*h[1, cj] - s[1, cj]*h[3, cj]
        ds[3, cj] = s[1, cj]*h[2, cj] - s[2, cj]*h[1, cj]
    end
end

function parallel2!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[S[ci], cj]*h[H[ci], cj] - s[S[ci], cj]*h[H[ci], cj]
end

for n in [1_000_000, 10_000_000]
    @btime serial1!(ds, s, h) setup=(ds = Matrix{Float64}(undef, $n, 3); s = rand($n, 3); h = rand($n, 3))
    @btime parallel1!(ds, s, h) setup=(ds = Matrix{Float64}(undef, $n, 3); s = rand($n, 3); h = rand($n, 3))

    @btime serial2!(ds, s, h) setup=(ds = Matrix{Float64}(undef, 3, $n); s = rand(3, $n); h = rand(3, $n))
    @btime parallel2!(ds, s, h) setup=(ds = Matrix{Float64}(undef, 3, $n); s = rand(3, $n); h = rand(3, $n))
end

yielding

  6.259 ms (0 allocations: 0 bytes)
  1.833 ms (46 allocations: 2.91 KiB)
  6.611 ms (0 allocations: 0 bytes)
  1.792 ms (46 allocations: 2.91 KiB)
  63.227 ms (0 allocations: 0 bytes)
  19.339 ms (46 allocations: 2.91 KiB)
  71.845 ms (0 allocations: 0 bytes)
  19.561 ms (46 allocations: 2.91 KiB)

a 3x speedup on a 6 core machine. Let us see if someone has a better idea.

Edit: just checked that @tturbo - annotated loops result in 2x speedup only.

Gregstrq · February 25, 2022, 5:05pm

I just understood that I can always flatten over the lattice dimensions using reshape and then it boils down to the examples that you considered.

Is there a way to estimate this overhead of parallelization?

By the way why do you think the speed up is only 3x and not close to 6x?

goerch · February 25, 2022, 5:13pm

I just made a synthetic example from what you gave us and don’t know anything about the complexity of operations involved.

Good question! In this form the task involved resembles BLAS level 1, which is AFAIU more difficult to parallelize (more memory bound than CPU bound). But exactly this is why I’m interested in others chiming in.

Gregstrq · February 25, 2022, 5:18pm

So, you say BLAS level 2 operations are easier to parallelize than BLAS level 1? Interesting.

I am asking about the overhead estimation because I actually think about rather small arrays of size 63x63x3.

LaurentPlagne · February 25, 2022, 5:33pm

My attempt with basic Julia and StaticArrays

using Tullio, BenchmarkTools
using StaticArrays
using LoopVectorization

function serial1!(ds, s, h)
    for ci in 1:size(ds, 1)
        ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci, 2]
        ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
        ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
    end
end

function parallel1!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[ci, S[cj]]*h[ci, H[cj]] - s[ci, S[cj]]*h[ci, H[cj]]
end

function serial2!(ds, s, h)
    for cj in 1:size(ds, 2)
        ds[1, cj] = s[2, cj]*h[3, cj] - s[3, cj]*h[2, cj]
        ds[2, cj] = s[3, cj]*h[1, cj] - s[1, cj]*h[3, cj]
        ds[3, cj] = s[1, cj]*h[2, cj] - s[2, cj]*h[1, cj]
    end
end

function parallel2!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[S[ci], cj]*h[H[ci], cj] - s[S[ci], cj]*h[H[ci], cj]
end

function serialSA!(ds, s, h)
    @. ds = cross(s,h)
end




function parallelSA!(ds, s, h)
    Threads.@threads for i ∈ 1:length(s)
         @inbounds ds[i] = cross(s[i],h[i])
    end
    # @turbo @. ds = cross(s,h)
end

for n in [1_000_000, 10_000_000]

    println("n=$n")

    for f ∈ (serial1!,parallel1!)
        print(string(f)," ")
        @btime $f(ds, s, h) setup=(ds = Matrix{Float64}(undef, $n, 3); s = rand($n, 3); h = rand($n, 3))
    end

    for f ∈ (serial2!,parallel2!)
        print(string(f)," ")
        @btime $f(ds, s, h) setup=(ds = Matrix{Float64}(undef, 3, $n); s = rand(3, $n); h = rand(3, $n))
    end

    for f ∈ (serialSA!,parallelSA!)
        print(string(f))
        @btime $f(ds, s, h) setup=(ds =  rand(SVector{3, Float64},$n);  s = rand(SVector{3, Float64},$n);  h = rand(SVector{3, Float64},$n))
    end

end

gives:

n=1000000
serial1!   2.881 ms (0 allocations: 0 bytes)
parallel1!   512.292 μs (103 allocations: 5.69 KiB)
serial2!   2.536 ms (0 allocations: 0 bytes)
parallel2!   767.583 μs (103 allocations: 5.69 KiB)
serialSA!  853.750 μs (0 allocations: 0 bytes)
parallelSA!  486.208 μs (41 allocations: 3.88 KiB)
n=10000000
serial1!   38.810 ms (0 allocations: 0 bytes)
parallel1!   4.770 ms (104 allocations: 5.72 KiB)
serial2!   35.345 ms (0 allocations: 0 bytes)
parallel2!   7.290 ms (104 allocations: 5.72 KiB)
serialSA!  8.907 ms (0 allocations: 0 bytes)
parallelSA!  4.693 ms (41 allocations: 3.88 KiB)

Gregstrq · February 25, 2022, 6:30pm

Strange. I have tried to combine tullio with reshaping. However, the resulting array ds has zero values:

using Tullio, BenchmarkTools

function parallel1!(ds, s, h)
   n = prod(size(ds)[1:end-1])
   dsr = reshape(ds, n,3)
   hr = reshape(h, n,3)
   sr = reshape(s, n,3)
   S = [2, 3, 1]
   H = [3, 1, 2]
   @tullio dsr[ci, cj] = sr[ci, S[cj]]*hr[ci, H[cj]] - sr[ci, S[cj]]*hr[ci, H[cj]]
end

h = randn(63,63,3)
s = randn(63,63,3)
ds = similar(s)

@btime parallel1!(ds, s, h)

As I said, resulting ds has purely zero entries.

Gregstrq · February 25, 2022, 6:44pm

Forget about reshapes. There is simply a problem with Tullio version

using Tullio, BenchmarkTools

function serial1!(ds, s, h)
    for ci in 1:size(ds, 1)
        ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci, 2]
        ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
        ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
    end
end

function parallel1!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[ci, S[cj]]*h[ci, H[cj]] - s[ci, S[cj]]*h[ci, H[cj]]
end

h = randn(63^2,3)
s = randn(63^2,3)
ds = similar(s)

@btime serial1!(ds, s, h)

@btime parallel1!(ds, s, h)

For some reason, parallel1! returns purely zero ds while serial1! does not.

mcabbott · February 25, 2022, 6:47pm

The two terms are the same here, so the result is zero. The serial version has S & H the other way around in one term.

Changing to tuples like S = (2, 3, 1) is worth a factor of 2 in speed.

Gregstrq · February 25, 2022, 7:07pm

Thanks to everyone. Especially to @mcabbott for the nice package.

goerch · February 25, 2022, 7:41pm

Fixing an embarrassing indexing bug, adding validation, incorporating the improvements from @LaurentPlagne and @mcabbott (big thanks!) and changing the bench marking range downwards

using LinearAlgebra, StaticArrays, LoopVectorization, Tullio, BenchmarkTools

function serial1!(ds, s, h)
    for ci in 1:size(ds, 1)
        ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci, 2]
        ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
        ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
    end
end

function parallel1!(ds, s, h)
    idx1 = (2, 3, 1)
    idx2 = (3, 1, 2)
    @tullio ds[ci, cj] = s[ci, idx1[cj]]*h[ci, idx2[cj]] - s[ci, idx2[cj]]*h[ci, idx1[cj]]
end

function serial2!(ds, s, h)
    for cj in 1:size(ds, 2)
        ds[1, cj] = s[2, cj]*h[3, cj] - s[3, cj]*h[2, cj]
        ds[2, cj] = s[3, cj]*h[1, cj] - s[1, cj]*h[3, cj]
        ds[3, cj] = s[1, cj]*h[2, cj] - s[2, cj]*h[1, cj]
    end
end

function parallel2!(ds, s, h)
    idx1 = (2, 3, 1)
    idx2 = (3, 1, 2)
    @tullio ds[ci, cj] = s[idx1[ci], cj]*h[idx2[ci], cj] - s[idx2[ci], cj]*h[idx1[ci], cj]
end

function serialSA!(ds, s, h)
    @. ds = cross(s,h)
end

function parallelSA!(ds, s, h)
    Threads.@threads for i ∈ 1:length(s)
         @inbounds ds[i] = cross(s[i],h[i])
    end
end

s = rand(10, 3)
h = rand(10, 3)
ds11 = Matrix{Float64}(undef, 10, 3)
serial1!(ds11, s, h)
ds12 = Matrix{Float64}(undef, 10, 3)
parallel1!(ds12, s, h)
@assert ds12 ≈ ds11

# s = rand(3, 10)
# h = rand(3, 10)
ds21 = Matrix{Float64}(undef, 3, 10)
serial2!(ds21, s', h')
@assert ds21 ≈ ds12'
ds22 = Matrix{Float64}(undef, 3, 10)
parallel2!(ds22, s', h')
@assert ds22 ≈ ds21

 for n in [100, 10_000, 1_000_000]
    println("n=$n")

    for f ∈ (serial1!,parallel1!)
        print(string(f)," ")
        @btime $f(ds, s, h) setup=(ds = Matrix{Float64}(undef, $n, 3); s = rand($n, 3); h = rand($n, 3))
    end

    for f ∈ (serial2!,parallel2!)
        print(string(f)," ")
        @btime $f(ds, s, h) setup=(ds = Matrix{Float64}(undef, 3, $n); s = rand(3, $n); h = rand(3, $n))
    end

    for f ∈ (serialSA!,parallelSA!)
        print(string(f))
        @btime $f(ds, s, h) setup=(ds = rand(SVector{3, Float64},$n);  s = rand(SVector{3, Float64},$n);  h = rand(SVector{3, Float64},$n))
    end
end

I see now

n=100
serial1!   279.452 ns (0 allocations: 0 bytes)
parallel1!   148.317 ns (0 allocations: 0 bytes)
serial2!   257.307 ns (0 allocations: 0 bytes)
parallel2!   481.026 ns (0 allocations: 0 bytes)
serialSA!  147.768 ns (0 allocations: 0 bytes)
parallelSA!  2.712 μs (31 allocations: 3.38 KiB)
n=10000
serial1!   27.200 μs (0 allocations: 0 bytes)
parallel1!   22.500 μs (0 allocations: 0 bytes)
serial2!   26.900 μs (0 allocations: 0 bytes)
parallel2!   54.600 μs (0 allocations: 0 bytes)
serialSA!  16.600 μs (0 allocations: 0 bytes)
parallelSA!  16.800 μs (30 allocations: 3.34 KiB)
n=1000000
serial1!   6.023 ms (0 allocations: 0 bytes)
parallel1!   3.329 ms (44 allocations: 3.03 KiB)
serial2!   6.669 ms (0 allocations: 0 bytes)
parallel2!   3.521 ms (44 allocations: 3.03 KiB)
serialSA!  4.624 ms (0 allocations: 0 bytes)
parallelSA!  3.197 ms (31 allocations: 3.38 KiB)

so the speedup is more like 2x for 6 cores on my machine.

Gregstrq · February 25, 2022, 7:43pm

I observe a 2x speedup on my 8 core machine as well.

Elrod · February 25, 2022, 8:21pm

n=100
serial1!   138.134 ns (0 allocations: 0 bytes)
tturbo1!   30.236 ns (0 allocations: 0 bytes)
parallel1!   126.870 ns (0 allocations: 0 bytes)
serial2!   140.211 ns (0 allocations: 0 bytes)
tturbo2!   288.726 ns (0 allocations: 0 bytes)
parallel2!   353.702 ns (0 allocations: 0 bytes)
serialSA!  57.311 ns (0 allocations: 0 bytes)
parallelSA!  5.393 μs (49 allocations: 4.37 KiB)
n=10000
serial1!   17.153 μs (0 allocations: 0 bytes)
tturbo1!   2.922 μs (0 allocations: 0 bytes)
parallel1!   18.699 μs (0 allocations: 0 bytes)
serial2!   20.470 μs (0 allocations: 0 bytes)
tturbo2!   4.430 μs (0 allocations: 0 bytes)
parallel2!   42.604 μs (0 allocations: 0 bytes)
serialSA!  5.888 μs (0 allocations: 0 bytes)
parallelSA!  26.970 μs (49 allocations: 4.36 KiB)
n=1000000
serial1!   2.561 ms (0 allocations: 0 bytes)
tturbo1!   1.786 ms (0 allocations: 0 bytes)
parallel1!   1.735 ms (101 allocations: 6.19 KiB)
serial2!   3.352 ms (0 allocations: 0 bytes)
tturbo2!   1.764 ms (0 allocations: 0 bytes)
parallel2!   1.857 ms (102 allocations: 6.22 KiB)
serialSA!  3.064 ms (0 allocations: 0 bytes)
parallelSA!  1.635 ms (50 allocations: 4.39 KiB)

I added @inbounds @fastmath to the serial versions. @tturbo is the serial versions but with @tturbo.
This is on a laptop with 4 cores/8 threads.
I’ll try on an 18 core desktop in a few minutes, when it is done running something else.

Gregstrq · February 25, 2022, 8:28pm

@Elrod What is the difference between tturbo1! and tturbo2!?

I see, so @tturbo is much better at small array sizes, while @tullio catches up with that for larger arrays.

I defined (in addition to reshape versions of serial1! and parallel1!)

function tturbo!(ds, s, h)
   n = prod(size(ds)[1:end-1])
   dsr = reshape(ds, n,3)
   hr = reshape(h, n,3)
   sr = reshape(s, n,3)
   @tturbo for ci in 1:size(dsr, 1)
       dsr[ci, 1] = sr[ci, 2]*hr[ci, 3] - sr[ci, 3]*hr[ci, 2]
       dsr[ci, 2] = sr[ci, 3]*hr[ci, 1] - sr[ci, 1]*hr[ci, 3]
       dsr[ci, 3] = sr[ci, 1]*hr[ci, 2] - sr[ci, 2]*hr[ci, 1]
   end 
end

Now, if I do

h = randn(63,63,3); s = randn(63,63,3); ds = similar(s)

@btime serial1!(ds, s, h)
@btime parallel1!(ds, s, h)
@btime tturbol1!(ds, s, h)

leads to the timings

12.082 μs
5.238 μs
865.490 ns

But for h = randn(1000,1000,3); s = randn(1000,1000,3); ds = similar(s) I get

3.866 ms
1.951 ms
2.896 ms

So, on my machine with 8 cores @tturbo schreds everything for small arrays but loses to @tullio for large ones

goerch · February 25, 2022, 8:31pm

Probably memory layout?

Elrod · February 25, 2022, 8:44pm

Yes, the memory layout is different and will make performance worse, but apparently by not nearly as much as LoopVectorization’s threading heuristics think.
Try removing the t (just @turbo), and you’ll notice it performs much better. Apparently LV has decided to start threading too soon.

julia> ds = Matrix{Float64}(undef, 3, n); s = rand(3, n); h = rand(3, n);

julia> @benchmark turbo2!($ds, $s, $h)
BenchmarkTools.Trial: 10000 samples with 898 evaluations.
 Range (min … max):  125.242 ns … 320.422 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     128.955 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   130.111 ns ±   4.505 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▃ █▁          █
  ▂▂▃▂▁▁▁▁▁▁▁█▆██▂▂▃▂▁▁▁▄█▄█▂▂▂▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  125 ns           Histogram: frequency by time          141 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark tturbo2!($ds, $s, $h)
BenchmarkTools.Trial: 10000 samples with 276 evaluations.
 Range (min … max):  289.217 ns … 734.725 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     301.228 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   302.903 ns ±   7.861 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▃▆▇██▆▄▃▂▁
  ▂▁▂▂▂▂▂▂▂▂▃▃▄▆███████████▇▆▅▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  289 ns           Histogram: frequency by time          326 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I’m not sure while Tullio catches up for larger sizes.
LV will limit itself to at most 1 thread per core, and also ramps thread usage up fairly slowly.
May be a good idea to monitor htop while running the benchmarks.