# How to best parallelize matrix operations?

I need to do some operations on 3-dimensional matrices and would like to explore getting some speed ups with parallelization. I have a version working using `map` and `@spawn` as suggestetd here, and I get the right values, but the wrong output type. Here is a MWE:

``````using Threads: nthreads, @spawn

@views some_func(x; a=0.3, b=0.4, c=0.3, d =0.5) = @. a*x[:, :, 1] + b*x[:, : 2] + c*x[:, :,3] + d

chunks = Iterators.partition(axes(x, 2), nchunks)
@spawn some_func(x[:, chunk, :])
end
return chunk_results
end

input = rand(1000, 1000, 3)

some_func(x)  # returns Matrix
some_func_spawn(x)  # returns a Vector of Matrix
``````

On my computer, the parallel version is about 2x faster, which is great, but I’m not sure what to about about the `Vector{Matrix}` result. So two questions: Is there a better way to parallelize this operation? If not, what’s the best way to collect the results into a single 1000x1000 matrix, instead of a `Vector{Matrix}`?

After giving this some thought, and incorporating some suggestions from this suggestion on my other thread, I’ve come up with this:

``````using BenchmarkTools

@views function imsum!(x, y, N::Val)
t = ntuple(N) do i
x[:,:,i]
end
@. y = (+)(t...)
return y
end

function imsum(x::Array{T, 3}) where T
y = Array{T, 2}(undef, size(x)[1:2])
return imsum!(x, y, Val(size(x, 3)))
end

@views function imsum_spawn(x::Array{T, 3}; chunksize=256) where T
chunks = Iterators.partition(axes(x, 2), chunksize)
output = Array{T, 2}(undef, size(x)[1:2])
ch = Val(size(x, 3))
@sync for chunk in chunks
@spawn imsum!(x[:, chunk, :], output[:, chunk], ch)
end
return output
end

a = rand(1000, 4000, 3);
sum(imsum(a) .!= imsum_spawn(a))  # should be 0
@benchmark imsum(\$a)  # 23.7 ms
@benchmark imsum_spawn(\$a)  #6.4 ms
``````

The function `imsum_spawn` is the fastest I’ve been able to get with this code, but I’m still not sure if I’m doing the parallelization in the optimum way. In my original post, changing the return value to `reduce(hcat, chunk_results)` solved the problem, but was also slower because of additional memory allocations.

So my question now is, can I do anything else to speed up `imsum_spawn` or more efficiently parralize it?