Unless the array-of-arrays input (`vv`

) is the hard-requirement, why not write it with more high-level indexing notation like GitHub - mcabbott/Tullio.jl: ⅀?

Having said that, of course, you can use Folds.jl etc. from JuliaFolds for this type of calculation.

On my machine, `K = 100`

takes about 14 μs which is very close to task spawn+wait overhead (a few μs). So, I’ll use `K = 1000`

for clearer demonstration. Here’s the single-thread run-time for me:

```
julia> N,M,K = 16,10,1000;
julia> A = rand(N,M,K);
julia> vv = [rand(M) for k in 1:K];
julia> @btime let A = A, vv = vv, K = K
mapreduce(k->@view(A[:,:,k])*vv[k], +, 1:K)
end;
139.949 μs (2002 allocations: 406.14 KiB)
```

With Folds, I get:

```
julia> Threads.nthreads()
4
julia> @btime let A = A, vv = vv, N = N
Folds.mapreduce(k-> @view(A[:,:,k]) * vv[k], +, 1:K)
end;
46.030 μs (2022 allocations: 408.50 KiB)
```

We get 3.1x speedup which is not perfect for 4 threads but not so bad since it’s pretty close to task spawn/wait overhead.

But, we can do better. Actually, the first thing to do before going multi-threading is to make single-thread code allocate less (and fast).

For example, `+`

on vectors always allocate new one. We can avoid this by doing `a .+= b`

instead:

```
julia> add!(a, b) = a .+= b
add! (generic function with 1 method)
julia> @btime let A = A, vv = vv
mapfoldl(k-> @view(A[:,:,k]) * vv[k], add!, 1:K; init = zeros(N))
end;
117.569 μs (1007 allocations: 203.47 KiB)
```

It’s slightly faster. Note that I switched to `mapfoldl`

since `Base`

's `mapreduce`

does not define behavior for `op`

doing in-place mutation.

However, there are still allocations due to `*`

. We can avoid this by lazily evaluating the second argument:

```
ulia> using LinearAlgebra
julia> add2!(a, b::AbstractArray) = a .+= b
add2! (generic function with 1 method)
julia> add2!(a, (M, b)) = mul!(a, M, b, 1.0, 1.0)
add2! (generic function with 2 methods)
julia> @btime let A = A, vv = vv
mapfoldl(k-> (@view(A[:,:,k]), vv[k]), add2!, 1:K; init = zeros(N))
end;
68.940 μs (7 allocations: 352 bytes)
```

Here, we define `add2!`

which is almost like `add!`

but `add2(a, (M, b))`

is equivalent to `add2(a, M * b)`

. This fuses the addition and the matrix multiplication so that there’s no more big allocations.

Now, finally, let’s try using `Folds.mapreduce`

:

```
julia> using Folds, Transducers # Transducers for OnInit
julia> @btime let A = A, vv = vv, N = N
Folds.mapreduce(add2!, 1:K; init = OnInit(() -> zeros(N))) do k
(@view(A[:, :, k]), vv[k])
end
end;
26.130 μs (32 allocations: 3.41 KiB)
```

We get 2.5x speedup, compared to the equivalent single-thread code. If I bump up `K`

once more:

```
julia> N,M,K = 16,10,10000;
julia> A = rand(N,M,K);
julia> vv = [rand(M) for k in 1:K];
julia> @btime let A = A, vv = vv
mapfoldl(k-> (@view(A[:,:,k]), vv[k]), add2!, 1:K; init = zeros(N))
end;
815.991 μs (7 allocations: 352 bytes)
julia> @btime let A = A, vv = vv, N = N
Folds.mapreduce(add2!, 1:K; init = OnInit(() -> zeros(N))) do k
(@view(A[:, :, k]), vv[k])
end
end;
187.358 μs (31 allocations: 3.38 KiB)
```

We get 4.3x speedup! (which is actually very strange… I’m not sure why it’s larger than `Threads.nthreads()`

).

Note that I can use `add2!`

(or `add!`

) for `Folds.mapreduce`

since `mapreduce`

/`reduce`

in JuliaFolds have clear ownership convention for `op`

; it is allowed to mutate the first argument. See: Folds.jl · Folds

Note also that `OnInit`

is essential for using in-place mutation in multi-threaded reduction. Otherwise, multiple tasks will be mutating the same array concurrently.