LoopVectorization multithreading for multidimensional arrays

What is the correct way to use @tturbo to parallelize iteration over multidimensional array?
I can think of several variants:

  • Parallelize the iteration over cartesian indices: @tturbo for I in eachindex(a)
  • Parallelize the outermost loop
  • Parallelize the innermost loop

Can one tell what is the best approach without doing the benchmarks?

Maybe Tullio is better suited for your task then?

I don’t know.
I looked at the webpage, and it says that it uses Threads.@spawn. I remember there were posts on discourse which discuss that Threads.@spawn has the problems with thread load imbalance. But may be those were old posts and this problem has been solved by now.

My idea was that @mcabbott did the benchmarks for you already…

More generally speaking:

  • your tasks should have a size which justifies the overhead of parallelization
  • the tasks should be sufficiently data independent to not require extensive locking

So, difficult to know without more details…

What I am doing looks like this

for ci in eachindex(lattice)
    ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci,2]
    ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
    ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
end

The arrays s and h store local spins and magnetic fields – three-dimensional vectors that live at the nodes of a lattice. So last dimension of s and h has three components and the rest is the cartesian index into the lattice.
What I am doing is computing cross-product at each node of the lattice.

A baseline could be

using Tullio, BenchmarkTools

function serial1!(ds, s, h)
    for ci in 1:size(ds, 1)
        ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci, 2]
        ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
        ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
    end
end

function parallel1!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[ci, S[cj]]*h[ci, H[cj]] - s[ci, S[cj]]*h[ci, H[cj]]
end

function serial2!(ds, s, h)
    for cj in 1:size(ds, 2)
        ds[1, cj] = s[2, cj]*h[3, cj] - s[3, cj]*h[2, cj]
        ds[2, cj] = s[3, cj]*h[1, cj] - s[1, cj]*h[3, cj]
        ds[3, cj] = s[1, cj]*h[2, cj] - s[2, cj]*h[1, cj]
    end
end

function parallel2!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[S[ci], cj]*h[H[ci], cj] - s[S[ci], cj]*h[H[ci], cj]
end

for n in [1_000_000, 10_000_000]
    @btime serial1!(ds, s, h) setup=(ds = Matrix{Float64}(undef, $n, 3); s = rand($n, 3); h = rand($n, 3))
    @btime parallel1!(ds, s, h) setup=(ds = Matrix{Float64}(undef, $n, 3); s = rand($n, 3); h = rand($n, 3))

    @btime serial2!(ds, s, h) setup=(ds = Matrix{Float64}(undef, 3, $n); s = rand(3, $n); h = rand(3, $n))
    @btime parallel2!(ds, s, h) setup=(ds = Matrix{Float64}(undef, 3, $n); s = rand(3, $n); h = rand(3, $n))
end

yielding

  6.259 ms (0 allocations: 0 bytes)
  1.833 ms (46 allocations: 2.91 KiB)
  6.611 ms (0 allocations: 0 bytes)
  1.792 ms (46 allocations: 2.91 KiB)
  63.227 ms (0 allocations: 0 bytes)
  19.339 ms (46 allocations: 2.91 KiB)
  71.845 ms (0 allocations: 0 bytes)
  19.561 ms (46 allocations: 2.91 KiB)

a 3x speedup on a 6 core machine. Let us see if someone has a better idea.

Edit: just checked that @tturbo - annotated loops result in 2x speedup only.

1 Like

I just understood that I can always flatten over the lattice dimensions using reshape and then it boils down to the examples that you considered.

Is there a way to estimate this overhead of parallelization?

By the way why do you think the speed up is only 3x and not close to 6x?

I just made a synthetic example from what you gave us and don’t know anything about the complexity of operations involved.

Good question! In this form the task involved resembles BLAS level 1, which is AFAIU more difficult to parallelize (more memory bound than CPU bound). But exactly this is why I’m interested in others chiming in.

So, you say BLAS level 2 operations are easier to parallelize than BLAS level 1? Interesting.

I am asking about the overhead estimation because I actually think about rather small arrays of size 63x63x3.

My attempt with basic Julia and StaticArrays

using Tullio, BenchmarkTools
using StaticArrays
using LoopVectorization

function serial1!(ds, s, h)
    for ci in 1:size(ds, 1)
        ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci, 2]
        ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
        ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
    end
end

function parallel1!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[ci, S[cj]]*h[ci, H[cj]] - s[ci, S[cj]]*h[ci, H[cj]]
end

function serial2!(ds, s, h)
    for cj in 1:size(ds, 2)
        ds[1, cj] = s[2, cj]*h[3, cj] - s[3, cj]*h[2, cj]
        ds[2, cj] = s[3, cj]*h[1, cj] - s[1, cj]*h[3, cj]
        ds[3, cj] = s[1, cj]*h[2, cj] - s[2, cj]*h[1, cj]
    end
end

function parallel2!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[S[ci], cj]*h[H[ci], cj] - s[S[ci], cj]*h[H[ci], cj]
end

function serialSA!(ds, s, h)
    @. ds = cross(s,h)
end




function parallelSA!(ds, s, h)
    Threads.@threads for i ∈ 1:length(s)
         @inbounds ds[i] = cross(s[i],h[i])
    end
    # @turbo @. ds = cross(s,h)
end

for n in [1_000_000, 10_000_000]

    println("n=$n")

    for f ∈ (serial1!,parallel1!)
        print(string(f)," ")
        @btime $f(ds, s, h) setup=(ds = Matrix{Float64}(undef, $n, 3); s = rand($n, 3); h = rand($n, 3))
    end

    for f ∈ (serial2!,parallel2!)
        print(string(f)," ")
        @btime $f(ds, s, h) setup=(ds = Matrix{Float64}(undef, 3, $n); s = rand(3, $n); h = rand(3, $n))
    end

    for f ∈ (serialSA!,parallelSA!)
        print(string(f))
        @btime $f(ds, s, h) setup=(ds =  rand(SVector{3, Float64},$n);  s = rand(SVector{3, Float64},$n);  h = rand(SVector{3, Float64},$n))
    end

end

gives:

n=1000000
serial1!   2.881 ms (0 allocations: 0 bytes)
parallel1!   512.292 ΞΌs (103 allocations: 5.69 KiB)
serial2!   2.536 ms (0 allocations: 0 bytes)
parallel2!   767.583 ΞΌs (103 allocations: 5.69 KiB)
serialSA!  853.750 ΞΌs (0 allocations: 0 bytes)
parallelSA!  486.208 ΞΌs (41 allocations: 3.88 KiB)
n=10000000
serial1!   38.810 ms (0 allocations: 0 bytes)
parallel1!   4.770 ms (104 allocations: 5.72 KiB)
serial2!   35.345 ms (0 allocations: 0 bytes)
parallel2!   7.290 ms (104 allocations: 5.72 KiB)
serialSA!  8.907 ms (0 allocations: 0 bytes)
parallelSA!  4.693 ms (41 allocations: 3.88 KiB)
3 Likes

Strange. I have tried to combine tullio with reshaping. However, the resulting array ds has zero values:

using Tullio, BenchmarkTools

function parallel1!(ds, s, h)
   n = prod(size(ds)[1:end-1])
   dsr = reshape(ds, n,3)
   hr = reshape(h, n,3)
   sr = reshape(s, n,3)
   S = [2, 3, 1]
   H = [3, 1, 2]
   @tullio dsr[ci, cj] = sr[ci, S[cj]]*hr[ci, H[cj]] - sr[ci, S[cj]]*hr[ci, H[cj]]
end

h = randn(63,63,3)
s = randn(63,63,3)
ds = similar(s)

@btime parallel1!(ds, s, h)

As I said, resulting ds has purely zero entries.

Forget about reshapes. There is simply a problem with Tullio version

using Tullio, BenchmarkTools

function serial1!(ds, s, h)
    for ci in 1:size(ds, 1)
        ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci, 2]
        ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
        ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
    end
end

function parallel1!(ds, s, h)
    S = [2, 3, 1]
    H = [3, 1, 2]
    @tullio ds[ci, cj] = s[ci, S[cj]]*h[ci, H[cj]] - s[ci, S[cj]]*h[ci, H[cj]]
end

h = randn(63^2,3)
s = randn(63^2,3)
ds = similar(s)

@btime serial1!(ds, s, h)

@btime parallel1!(ds, s, h)

For some reason, parallel1! returns purely zero ds while serial1! does not.

The two terms are the same here, so the result is zero. The serial version has S & H the other way around in one term.

Changing to tuples like S = (2, 3, 1) is worth a factor of 2 in speed.

4 Likes

Thanks to everyone. Especially to @mcabbott for the nice package.

Fixing an embarrassing indexing bug, adding validation, incorporating the improvements from @LaurentPlagne and @mcabbott (big thanks!) and changing the bench marking range downwards

using LinearAlgebra, StaticArrays, LoopVectorization, Tullio, BenchmarkTools

function serial1!(ds, s, h)
    for ci in 1:size(ds, 1)
        ds[ci, 1] = s[ci, 2]*h[ci, 3] - s[ci, 3]*h[ci, 2]
        ds[ci, 2] = s[ci, 3]*h[ci, 1] - s[ci, 1]*h[ci, 3]
        ds[ci, 3] = s[ci, 1]*h[ci, 2] - s[ci, 2]*h[ci, 1]
    end
end

function parallel1!(ds, s, h)
    idx1 = (2, 3, 1)
    idx2 = (3, 1, 2)
    @tullio ds[ci, cj] = s[ci, idx1[cj]]*h[ci, idx2[cj]] - s[ci, idx2[cj]]*h[ci, idx1[cj]]
end

function serial2!(ds, s, h)
    for cj in 1:size(ds, 2)
        ds[1, cj] = s[2, cj]*h[3, cj] - s[3, cj]*h[2, cj]
        ds[2, cj] = s[3, cj]*h[1, cj] - s[1, cj]*h[3, cj]
        ds[3, cj] = s[1, cj]*h[2, cj] - s[2, cj]*h[1, cj]
    end
end

function parallel2!(ds, s, h)
    idx1 = (2, 3, 1)
    idx2 = (3, 1, 2)
    @tullio ds[ci, cj] = s[idx1[ci], cj]*h[idx2[ci], cj] - s[idx2[ci], cj]*h[idx1[ci], cj]
end

function serialSA!(ds, s, h)
    @. ds = cross(s,h)
end

function parallelSA!(ds, s, h)
    Threads.@threads for i ∈ 1:length(s)
         @inbounds ds[i] = cross(s[i],h[i])
    end
end

s = rand(10, 3)
h = rand(10, 3)
ds11 = Matrix{Float64}(undef, 10, 3)
serial1!(ds11, s, h)
ds12 = Matrix{Float64}(undef, 10, 3)
parallel1!(ds12, s, h)
@assert ds12 β‰ˆ ds11

# s = rand(3, 10)
# h = rand(3, 10)
ds21 = Matrix{Float64}(undef, 3, 10)
serial2!(ds21, s', h')
@assert ds21 β‰ˆ ds12'
ds22 = Matrix{Float64}(undef, 3, 10)
parallel2!(ds22, s', h')
@assert ds22 β‰ˆ ds21

 for n in [100, 10_000, 1_000_000]
    println("n=$n")

    for f ∈ (serial1!,parallel1!)
        print(string(f)," ")
        @btime $f(ds, s, h) setup=(ds = Matrix{Float64}(undef, $n, 3); s = rand($n, 3); h = rand($n, 3))
    end

    for f ∈ (serial2!,parallel2!)
        print(string(f)," ")
        @btime $f(ds, s, h) setup=(ds = Matrix{Float64}(undef, 3, $n); s = rand(3, $n); h = rand(3, $n))
    end

    for f ∈ (serialSA!,parallelSA!)
        print(string(f))
        @btime $f(ds, s, h) setup=(ds = rand(SVector{3, Float64},$n);  s = rand(SVector{3, Float64},$n);  h = rand(SVector{3, Float64},$n))
    end
end

I see now

n=100
serial1!   279.452 ns (0 allocations: 0 bytes)
parallel1!   148.317 ns (0 allocations: 0 bytes)
serial2!   257.307 ns (0 allocations: 0 bytes)
parallel2!   481.026 ns (0 allocations: 0 bytes)
serialSA!  147.768 ns (0 allocations: 0 bytes)
parallelSA!  2.712 ΞΌs (31 allocations: 3.38 KiB)
n=10000
serial1!   27.200 ΞΌs (0 allocations: 0 bytes)
parallel1!   22.500 ΞΌs (0 allocations: 0 bytes)
serial2!   26.900 ΞΌs (0 allocations: 0 bytes)
parallel2!   54.600 ΞΌs (0 allocations: 0 bytes)
serialSA!  16.600 ΞΌs (0 allocations: 0 bytes)
parallelSA!  16.800 ΞΌs (30 allocations: 3.34 KiB)
n=1000000
serial1!   6.023 ms (0 allocations: 0 bytes)
parallel1!   3.329 ms (44 allocations: 3.03 KiB)
serial2!   6.669 ms (0 allocations: 0 bytes)
parallel2!   3.521 ms (44 allocations: 3.03 KiB)
serialSA!  4.624 ms (0 allocations: 0 bytes)
parallelSA!  3.197 ms (31 allocations: 3.38 KiB)

so the speedup is more like 2x for 6 cores on my machine.

I observe a 2x speedup on my 8 core machine as well.

1 Like
n=100
serial1!   138.134 ns (0 allocations: 0 bytes)
tturbo1!   30.236 ns (0 allocations: 0 bytes)
parallel1!   126.870 ns (0 allocations: 0 bytes)
serial2!   140.211 ns (0 allocations: 0 bytes)
tturbo2!   288.726 ns (0 allocations: 0 bytes)
parallel2!   353.702 ns (0 allocations: 0 bytes)
serialSA!  57.311 ns (0 allocations: 0 bytes)
parallelSA!  5.393 ΞΌs (49 allocations: 4.37 KiB)
n=10000
serial1!   17.153 ΞΌs (0 allocations: 0 bytes)
tturbo1!   2.922 ΞΌs (0 allocations: 0 bytes)
parallel1!   18.699 ΞΌs (0 allocations: 0 bytes)
serial2!   20.470 ΞΌs (0 allocations: 0 bytes)
tturbo2!   4.430 ΞΌs (0 allocations: 0 bytes)
parallel2!   42.604 ΞΌs (0 allocations: 0 bytes)
serialSA!  5.888 ΞΌs (0 allocations: 0 bytes)
parallelSA!  26.970 ΞΌs (49 allocations: 4.36 KiB)
n=1000000
serial1!   2.561 ms (0 allocations: 0 bytes)
tturbo1!   1.786 ms (0 allocations: 0 bytes)
parallel1!   1.735 ms (101 allocations: 6.19 KiB)
serial2!   3.352 ms (0 allocations: 0 bytes)
tturbo2!   1.764 ms (0 allocations: 0 bytes)
parallel2!   1.857 ms (102 allocations: 6.22 KiB)
serialSA!  3.064 ms (0 allocations: 0 bytes)
parallelSA!  1.635 ms (50 allocations: 4.39 KiB)

I added @inbounds @fastmath to the serial versions. @tturbo is the serial versions but with @tturbo.
This is on a laptop with 4 cores/8 threads.
I’ll try on an 18 core desktop in a few minutes, when it is done running something else.

2 Likes

@Elrod What is the difference between tturbo1! and tturbo2!?

I see, so @tturbo is much better at small array sizes, while @tullio catches up with that for larger arrays.

I defined (in addition to reshape versions of serial1! and parallel1!)

function tturbo!(ds, s, h)
   n = prod(size(ds)[1:end-1])
   dsr = reshape(ds, n,3)
   hr = reshape(h, n,3)
   sr = reshape(s, n,3)
   @tturbo for ci in 1:size(dsr, 1)
       dsr[ci, 1] = sr[ci, 2]*hr[ci, 3] - sr[ci, 3]*hr[ci, 2]
       dsr[ci, 2] = sr[ci, 3]*hr[ci, 1] - sr[ci, 1]*hr[ci, 3]
       dsr[ci, 3] = sr[ci, 1]*hr[ci, 2] - sr[ci, 2]*hr[ci, 1]
   end 
end

Now, if I do

h = randn(63,63,3); s = randn(63,63,3); ds = similar(s)

@btime serial1!(ds, s, h)
@btime parallel1!(ds, s, h)
@btime tturbol1!(ds, s, h)

leads to the timings

12.082 ΞΌs
5.238 ΞΌs
865.490 ns

But for h = randn(1000,1000,3); s = randn(1000,1000,3); ds = similar(s) I get

3.866 ms
1.951 ms
2.896 ms

So, on my machine with 8 cores @tturbo schreds everything for small arrays but loses to @tullio for large ones

Probably memory layout?

Yes, the memory layout is different and will make performance worse, but apparently by not nearly as much as LoopVectorization’s threading heuristics think.
Try removing the t (just @turbo), and you’ll notice it performs much better. Apparently LV has decided to start threading too soon.

julia> ds = Matrix{Float64}(undef, 3, n); s = rand(3, n); h = rand(3, n);

julia> @benchmark turbo2!($ds, $s, $h)
BenchmarkTools.Trial: 10000 samples with 898 evaluations.
 Range (min … max):  125.242 ns … 320.422 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     128.955 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   130.111 ns Β±   4.505 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

             β–ƒ β–ˆβ–          β–ˆ
  β–‚β–‚β–ƒβ–‚β–β–β–β–β–β–β–β–ˆβ–†β–ˆβ–ˆβ–‚β–‚β–ƒβ–‚β–β–β–β–„β–ˆβ–„β–ˆβ–‚β–‚β–‚β–ƒβ–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  125 ns           Histogram: frequency by time          141 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark tturbo2!($ds, $s, $h)
BenchmarkTools.Trial: 10000 samples with 276 evaluations.
 Range (min … max):  289.217 ns … 734.725 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     301.228 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   302.903 ns Β±   7.861 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                 β–ƒβ–†β–‡β–ˆβ–ˆβ–†β–„β–ƒβ–‚β–
  β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–„
  289 ns           Histogram: frequency by time          326 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I’m not sure while Tullio catches up for larger sizes.
LV will limit itself to at most 1 thread per core, and also ramps thread usage up fairly slowly.
May be a good idea to monitor htop while running the benchmarks.