# Multithreading in many nested loops

I am trying to optimize a function which performs a convolution-operation known from neuronal networks.
I have already tried to optimize the function as much as possible (e.g. code_warntype etc.), now I would like to give multithreading a try. However, I am not sure what would be the best method for parallelization here because of the many nested loops. Do you have any indeas on how to further optimize the function?

A simplified version of my code:

``````# not really important
function get_input_position(output_position::Tuple, stride::Tuple)
m = output_position + (stride - 1) * (output_position - 1)
n = output_position + (stride - 1) * (output_position - 1)

return m, n
end

# not really important
function calculate_output_shape(input_height::Int, input_width::Int, kernel_height::Int, kernel_width::Int; stride::Tuple=(1, 1), padding::Tuple=(0, 0))
output_height = (input_height + 2 * padding - kernel_height) / stride + 1
output_width = (input_width + 2 * padding - kernel_width) / stride + 1
output_height = convert(Int, trunc(output_height))
output_width = convert(Int, trunc(output_width))

return output_height, output_width
end

# the function to optimize
function multichannel_conv(inputs::Array{Float64, 4}, kernels::Array{Float64, 4}; stride::Tuple{Int, Int}=(1, 1), padding::Tuple{Int, Int}=(0, 0))
# inputs = copy(inputs)
# storing all the necessary shapes
current_batch_size::Int, in_channels::Int, input_height::Int, input_width::Int = size(inputs)
out_channels::Int, in_channels, kernel_height::Int, kernel_width::Int = size(kernels)
# calculating shape of output
output_height::Int, output_width::Int = calculate_output_shape(input_height, input_width, kernel_height, kernel_width, stride=stride, padding=padding)

output::Array{Float64, 4} = Array{Float64, 4}(undef, current_batch_size, out_channels, output_height, output_width)

# actual computation
for index_batch in 1:current_batch_size
for y_out in 1:output_height, x_out in 1:output_width
m, n = get_input_position((y_out, x_out), stride)
value::Float64 = 0.00
for in_channel in 1:in_channels

for y_w in 1:kernel_height, x_w in 1:kernel_width
y_in::Int = m + y_w - 1
x_in::Int = n + x_w - 1
value += inputs[index_batch, in_channel, y_in, x_in] * kernels[out_channel, in_channel, y_w, x_w]
# end
end
end
output[index_batch, out_channel, y_out, x_out] = value
end
# end
end
end

return output
end
``````

Just for measuring runtime:

``````using BenchmarkTools
output = @btime multichannel_conv(rand(64, 20, 150, 150), rand(40, 20, 3, 3))
>>> 37.964 s (7 allocations: 647.59 MiB)
``````

For multithreading I would try and block the most work together, into the largest blocks possible, but making sure you have just enough for all your threads. So parallelising over as few loops as possible (and outer ones) to take advantage of cache locality instead of going around all over the place. Basically, keeping each thread accessing the memory of the array in order should help performance.

You can also combine your nested for loops into a single one using `product`

``````using Iterators: product

for a, b, c in product(iterA, iterB, iterC)
...
end

# is the same as
for a in iterA
for b in iterB
for c in iterC
... Etc
``````

You can stick a `Threads.@threads` on the first loop and parallelise over that. I often find the above very helpful in these situations.

The most important thing is to benchmark threaded vs serial relative performance against the size of your problem. Ideally, you want be as close to an n factor increase, where n is the number of threads.