I want to consider the sum over a vector of vectors. Consider the following benchmarks for a problem involving the sum over some data (getting a log-likelihood):

```
using BenchmarkTools, Random, LoopVectorization
Random.seed!(292991)
function vector_sum(x, μ, σ, n)
ℓ = -0.5n * log(2π * σ^2)
s = zero(eltype(x))
@turbo for i ∈ eachindex(x, μ)
s += (x[i] - μ[i])^2
end
ℓ -= 0.5s / σ^2
end
function vector_of_vector_sum(x, μ, σ, n)
ℓ = -0.5n * log(2π * σ^2)
s = zero(eltype(eltype(x)))
for i ∈ eachindex(x, μ)
_x, _μ = x[i], μ[i]
@turbo for j ∈ eachindex(_x, _μ)
s += (_x[j] - _μ[j])^2
end
end
ℓ -= 0.5s / σ^2
end
function conv_to_vector_sum(x, μ, σ, n)
vector_sum(reduce(vcat, x), reduce(vcat, μ), σ, n)
end
n₁, n₂ = 130, 9
n = n₁ * n₂
x = [rand(n₁) for _ in 1:n₂]
flat_x = reduce(vcat, x)
μ = [rand(n₁) for _ in 1:n₂]
flat_μ = reduce(vcat, μ)
σ = 0.1
res1 = @benchmark vector_sum($flat_x, $flat_μ, $σ, $n)
res2 = @benchmark vector_of_vector_sum($x, $μ, $σ, $n)
res3 = @benchmark conv_to_vector_sum($x, $μ, $σ, $n)
```

```
julia> res1
BenchmarkTools.Trial: 10000 samples with 955 evaluations.
Range (min … max): 94.555 ns … 321.885 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 95.812 ns ┊ GC (median): 0.00%
Time (mean ± σ): 99.662 ns ± 14.266 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅▆▅▄▂ ▁
███████▇▆▆▅▆▇▆▅▅▅▄▅▄▆▄▅▄▆▆▆▆▆▄▅▅▆▆▆▅▄▅▅▅▄▃▅▄▆▅▆▆▅▆▅▄▄▄▃▄▅▅▄▄ █
94.6 ns Histogram: log(frequency) by time 171 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> res2
BenchmarkTools.Trial: 10000 samples with 796 evaluations.
Range (min … max): 154.648 ns … 486.432 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 162.186 ns ┊ GC (median): 0.00%
Time (mean ± σ): 171.037 ns ± 27.694 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▆█▇▅▄▂▁▃▂▁ ▁▂▁ ▂
█████████████▇▆▇▆▆▆▅▄▆▆▅▆▆▆▇▅▆▅▅▅▅▄▂▃▅▅▂▅▅▄▅▄▆▇▆▇██████▇██▇▆▅ █
155 ns Histogram: log(frequency) by time 272 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> res3
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.530 μs … 251.670 μs ┊ GC (min … max): 0.00% … 95.73%
Time (median): 2.120 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.861 μs ± 10.145 μs ┊ GC (mean ± σ): 18.29% ± 5.15%
▂▇█▇██▇▅▃▂▁ ▂
████████████▇▇█▇▇▇▆▆▆▇▇▆▇▅▇▇▇▇▇▅▆▅▅▅▅▅▄▄▅▄▂▅▅▂▄▃▃▄▃▄▅▆▇█▇▇▇ █
1.53 μs Histogram: log(frequency) by time 9.1 μs <
Memory estimate: 18.88 KiB, allocs estimate: 4.
```

The method which operates on the reduced vector (`res1`

) is about 0.66 times faster than that which operates on each vector at a time (`res2`

). The method which flattens and then uses the first method is much slower, obviously (`res3`

). **How can I get the best performance out of this?**

Some extra notes: My data comes in as vectors of vectors, so the first method isn’t too useful for my case, but I figure there must be some way to get closer to it without already having flattened the vectors. The inputs `x`

and `μ`

I’m using will always have the same length in the individual vectors (i.e. `length(x[i]) == length(μ[i]) == length(x[1])`

for all `i in 1:length(x)`

), though I can’t use a matrix since `μ`

is given to me as a vector of vectors to start with.