Improving the performance of a sum over a vector of vectors

I want to consider the sum over a vector of vectors. Consider the following benchmarks for a problem involving the sum over some data (getting a log-likelihood):

using BenchmarkTools, Random, LoopVectorization
Random.seed!(292991)
function vector_sum(x, μ, σ, n)
    ℓ = -0.5n * log(2π * σ^2)
    s = zero(eltype(x))
    @turbo for i ∈ eachindex(x, μ)
        s += (x[i] - μ[i])^2
    end
    ℓ -= 0.5s / σ^2
end
function vector_of_vector_sum(x, μ, σ, n)
    ℓ = -0.5n * log(2π * σ^2)
    s = zero(eltype(eltype(x)))
    for i ∈ eachindex(x, μ)
        _x, _μ = x[i], μ[i]
        @turbo for j ∈ eachindex(_x, _μ)
            s += (_x[j] - _μ[j])^2
        end
    end
    ℓ -= 0.5s / σ^2
end
function conv_to_vector_sum(x, μ, σ, n)
    vector_sum(reduce(vcat, x), reduce(vcat, μ), σ, n)
end
n₁, n₂ = 130, 9 
n = n₁ * n₂
x = [rand(n₁) for _ in 1:n₂]
flat_x = reduce(vcat, x)
μ = [rand(n₁) for _ in 1:n₂]
flat_μ = reduce(vcat, μ)
σ = 0.1 
res1 = @benchmark vector_sum($flat_x, $flat_μ, $σ, $n)
res2 = @benchmark vector_of_vector_sum($x, $μ, $σ, $n)
res3 = @benchmark conv_to_vector_sum($x, $μ, $σ, $n)
julia> res1
BenchmarkTools.Trial: 10000 samples with 955 evaluations.
 Range (min … max):  94.555 ns … 321.885 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     95.812 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   99.662 ns ±  14.266 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▆▅▄▂                                                       ▁
  ███████▇▆▆▅▆▇▆▅▅▅▄▅▄▆▄▅▄▆▆▆▆▆▄▅▅▆▆▆▅▄▅▅▅▄▃▅▄▆▅▆▆▅▆▅▄▄▄▃▄▅▅▄▄ █
  94.6 ns       Histogram: log(frequency) by time       171 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> res2
BenchmarkTools.Trial: 10000 samples with 796 evaluations.
 Range (min … max):  154.648 ns … 486.432 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     162.186 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   171.037 ns ±  27.694 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆▆█▇▅▄▂▁▃▂▁                                      ▁▂▁         ▂
  █████████████▇▆▇▆▆▆▅▄▆▆▅▆▆▆▇▅▆▅▅▅▅▄▂▃▅▅▂▅▅▄▅▄▆▇▆▇██████▇██▇▆▅ █
  155 ns        Histogram: log(frequency) by time        272 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> res3
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.530 μs … 251.670 μs  ┊ GC (min … max):  0.00% … 95.73%
 Time  (median):     2.120 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.861 μs ±  10.145 μs  ┊ GC (mean ± σ):  18.29% ±  5.15%

  ▂▇█▇██▇▅▃▂▁                                                 ▂
  ████████████▇▇█▇▇▇▆▆▆▇▇▆▇▅▇▇▇▇▇▅▆▅▅▅▅▅▄▄▅▄▂▅▅▂▄▃▃▄▃▄▅▆▇█▇▇▇ █
  1.53 μs      Histogram: log(frequency) by time       9.1 μs <

 Memory estimate: 18.88 KiB, allocs estimate: 4.

The method which operates on the reduced vector (res1) is about 0.66 times faster than that which operates on each vector at a time (res2). The method which flattens and then uses the first method is much slower, obviously (res3). How can I get the best performance out of this?

Some extra notes: My data comes in as vectors of vectors, so the first method isn’t too useful for my case, but I figure there must be some way to get closer to it without already having flattened the vectors. The inputs x and μ I’m using will always have the same length in the individual vectors (i.e. length(x[i]) == length(μ[i]) == length(x[1]) for all i in 1:length(x)), though I can’t use a matrix since μ is given to me as a vector of vectors to start with.

For the second version I would add @inbounds on the first loop.

Does @turbo on the outer loop works ?
You could also try to use MT on the outer loop.

1 Like

Thanks for your comment. @turbo does not work on the outer loop, it gives a _x not defined error. I don’t know how to get around that. @inbounds didn’t change anything here, I think the compiler has already sorted that out.

What is MT? Multithreading? I tried that initially, but I don’t think I really understand enough about multithreading to actually get it to be as fast. I think it also messes with @turbo, but maybe that’s my inexperience. Using @tturbo made it slower, too. This is the multithreading version I tried:

function vector_of_vector_sum(x, μ, σ, n)
    ℓ = -0.5n * log(2π * σ^2)
    s = Threads.Atomic{Float64}(0)
    Threads.@threads for i ∈ eachindex(x, μ)
        _x, _μ = x[i], μ[i]
        for j ∈ eachindex(_x, _μ)
            Threads.atomic_add!(s, (_x[j] - _μ[j])^2)
        end
    end
    ℓ -= 0.5s[] / σ^2
end
julia> res2 = @benchmark vector_of_vector_sum($x, $μ, $σ, $n)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  30.200 μs … 208.900 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     68.500 μs               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   70.736 μs ±  15.647 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▂▃▆▆▇█▇▇▅▆▄▃▃▁
  ▁▁▁▁▁▁▁▂▂▃▃▅▆████████████████▇▅▄▄▄▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  30.2 μs         Histogram: frequency by time          139 μs <

 Memory estimate: 5.19 KiB, allocs estimate: 48.

OK, I did not look at the actual vector sizes which are way to small to allow efficient MT.

Also you have to be careful about timing with BenchmarkTools in this case because the data will fit in cache (hot cache) and this may or may not correspond to your use case. If you want to test cold cache you will have to use the setup variant.

An essential difference between the vector and vectior of vector case is that the memory layout is not contiguous is the latter case and probably involves more cache misses.

If it possible you may consider to revert the storage from x[i][j] to y[j][i] where y is a vector of StaticVector of size 9. Doing so, the memory layout will be contiguous.

3 Likes

I’d never heard of that BenchmarkTools issue before. Is this what you mean?

res1 = @benchmark vector_sum(flat_x, flat_μ, σ, n) setup=(flat_x = copy($flat_x), flat_μ = copy($flat_μ), σ = copy($σ), n = copy($n))
res2 = @benchmark vector_of_vector_sum(x, μ, σ, n) setup=(x = copy($x), μ = copy($μ), σ = copy($σ), n = copy($n))
julia> res1
BenchmarkTools.Trial: 10000 samples with 914 evaluations.
 Range (min … max):  117.834 ns …  2.996 μs  ┊ GC (min … max): 0.00% … 95.02%
 Time  (median):     121.882 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   133.761 ns ± 50.992 ns  ┊ GC (mean ± σ):  0.42% ±  1.34%

  ██▆▄▅▅▁▃▂                                         ▁▃ ▂▁▁     ▂
  ██████████▇▇▇▆▆▆▆▇▇▇▇▇▇▆▆▇▆▅▆▆▄▆▆▆▅▆▆▅▅▅▆▅▄▄▄▆▅▄▅▅██████▇█▇▅ █
  118 ns        Histogram: log(frequency) by time       241 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> res2
BenchmarkTools.Trial: 10000 samples with 681 evaluations.
 Range (min … max):  185.903 ns …  6.893 μs  ┊ GC (min … max): 0.00% … 96.55%
 Time  (median):     194.860 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   210.890 ns ± 80.587 ns  ┊ GC (mean ± σ):  0.32% ±  0.97%

  ▄█▇█▅▃                                      ▂▁▁▂             ▂
  ███████▇██▇▇▇▆▇▇█▇███▇▇▇▆▇▆▆▆▇▅▆▆▆▆▆▄▅▆▆▅▅▆▇█████████▇▆▅▅▅▃▅ █
  186 ns        Histogram: log(frequency) by time       379 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

I also tried wrapping the variables in Ref, first method is still good.

Unfortunately I can’t switch the memory layout like that. (The actual data is going to be coming from a solution to a differential equation indexed at multiple times.) Maybe this is all just cache misses, then.

You should pass eval=1 in order to evaluate the cold cache performance (i.e. including the time to read the data from the RAM). Note that It does not necessarily correspond to your use case (temporal locality).

I run out of idea to improve your code… maybe @Elrod can have extra suggestions :wink:

1 Like

Thanks for the benchmark tip - the cold performance of the second case is 3x as slow as the first case! Thanks for the help :slight_smile: Hopefully there are some other good ways to improve

1 Like

BTW, do you have any reason to believe that your implementation is not yet optimal (e.g. implementation if another language) ?
On Intel machine, I would suggest to use VTune (from the free oneAPI suite) to instrument the code with comparison to HW limits.

No particular reason - I just assumed that there could be a way to make the vector of vectors case equivalent to the vector case because I thought that the memory layout was contiguous in my case. But that’s wrong, so maybe there’s not as much reason to think that there’s room for improvement now.

1 Like

These help a little, but not much:

julia> function vector_of_vector_sum_v2(x, μ, σ, n)
           ℓ = -0.5n * log(2π * σ^2)
           s = zero(eltype(eltype(x)))
           fi = firstindex(x)
           li = lastindex(x)
           @assert (firstindex(μ) == fi) & (lastindex(μ) == li)
           len = li - (fi-1)
           for i = 0:(len>>2)-1
               _x0 = x[4i+fi]
               _μ0 = μ[4i+fi]
               _x1 = x[4i+fi+1]
               _μ1 = μ[4i+fi+1]
               _x2 = x[4i+fi+2]
               _μ2 = μ[4i+fi+2]
               _x3 = x[4i+fi+3]
               _μ3 = μ[4i+fi+3]
               @turbo for j ∈ eachindex(_x0)
                   s += (_x0[j] - _μ0[j])^2 + (_x1[j] - _μ1[j])^2 + (_x2[j] - _μ2[j])^2 + (_x3[j] - _μ3[j])^2
               end
           end
           for i = fi+(len&-4):li
               _x, _μ = x[i], μ[i]
               @turbo for j ∈ eachindex(_x, _μ)
                   s += (_x[j] - _μ[j])^2
               end
           end
           ℓ -= 0.5s / σ^2
       end
vector_of_vector_sum_v2 (generic function with 1 method)

julia> function vector_of_vector_sum_v3(x, μ, σ, n)
           ℓ = -0.5n * log(2π * σ^2)
           T = eltype(eltype(x))
           s = vzero(pick_vector_width(T), T)
           fi = firstindex(x)
           li = lastindex(x)
           @assert (firstindex(μ) == fi) & (lastindex(μ) == li)
           len = li - (fi-1)
           for i = 0:(len>>2)-1
               _x0 = x[4i+fi]
               _μ0 = μ[4i+fi]
               _x1 = x[4i+fi+1]
               _μ1 = μ[4i+fi+1]
               _x2 = x[4i+fi+2]
               _μ2 = μ[4i+fi+2]
               _x3 = x[4i+fi+3]
               _μ3 = μ[4i+fi+3]
               @turbo for j ∈ eachindex(_x0)
                   s += (_x0[j] - _μ0[j])^2 + (_x1[j] - _μ1[j])^2 + (_x2[j] - _μ2[j])^2 + (_x3[j] - _μ3[j])^2
               end
           end
           for i = fi+(len&-4):li
               _x, _μ = x[i], μ[i]
               @turbo for j ∈ eachindex(_x, _μ)
                   s += (_x[j] - _μ[j])^2
               end
           end
           ℓ -= 0.5vsum(s) / σ^2
       end
vector_of_vector_sum_v3 (generic function with 1 method)

julia> @btime vector_of_vector_sum($x, $μ, $σ, $n)
  164.905 ns (0 allocations: 0 bytes)
-8672.369366928524

julia> @btime vector_of_vector_sum_v2($x, $μ, $σ, $n)
  147.849 ns (0 allocations: 0 bytes)
-8672.369366928524

julia> @btime vector_of_vector_sum_v3($x, $μ, $σ, $n)
  138.980 ns (0 allocations: 0 bytes)
-8672.36936692852

julia> @btime vector_sum($flat_x, $flat_μ, $σ, $n)
  83.857 ns (0 allocations: 0 bytes)
-8672.369366928524

vector_sum requires many fewer synchronization steps.

4 Likes

Amazing! Impressive, thank you.