Multithreading in many nested loops

Jonas208 · September 9, 2022, 7:59pm

I am trying to optimize a function which performs a convolution-operation known from neuronal networks.
I have already tried to optimize the function as much as possible (e.g. code_warntype etc.), now I would like to give multithreading a try. However, I am not sure what would be the best method for parallelization here because of the many nested loops. Do you have any indeas on how to further optimize the function?

A simplified version of my code:

# not really important
function get_input_position(output_position::Tuple, stride::Tuple)
    m = output_position[1] + (stride[1] - 1) * (output_position[1] - 1)
    n = output_position[2] + (stride[2] - 1) * (output_position[2] - 1)

    return m, n
end

# not really important
function calculate_output_shape(input_height::Int, input_width::Int, kernel_height::Int, kernel_width::Int; stride::Tuple=(1, 1), padding::Tuple=(0, 0))
    output_height = (input_height + 2 * padding[1] - kernel_height) / stride[1] + 1
    output_width = (input_width + 2 * padding[2] - kernel_width) / stride[2] + 1
    output_height = convert(Int, trunc(output_height))
    output_width = convert(Int, trunc(output_width))

    return output_height, output_width
end

# the function to optimize
function multichannel_conv(inputs::Array{Float64, 4}, kernels::Array{Float64, 4}; stride::Tuple{Int, Int}=(1, 1), padding::Tuple{Int, Int}=(0, 0))
    # inputs = copy(inputs)
    # storing all the necessary shapes
    current_batch_size::Int, in_channels::Int, input_height::Int, input_width::Int = size(inputs)
    out_channels::Int, in_channels, kernel_height::Int, kernel_width::Int = size(kernels)
    # calculating shape of output
    output_height::Int, output_width::Int = calculate_output_shape(input_height, input_width, kernel_height, kernel_width, stride=stride, padding=padding)

    output::Array{Float64, 4} = Array{Float64, 4}(undef, current_batch_size, out_channels, output_height, output_width)
    
    # actual computation
    for index_batch in 1:current_batch_size
        for out_channel in 1:out_channels # Threads.@threads
            for y_out in 1:output_height, x_out in 1:output_width
            # Threads.@threads for y_out in 1:output_height
            # Threads.@threads for x_out in 1:output_width
                m, n = get_input_position((y_out, x_out), stride)
                value::Float64 = 0.00
                for in_channel in 1:in_channels

                    for y_w in 1:kernel_height, x_w in 1:kernel_width
                    # Threads.@threads for y_w in 1:kernel_height
                    # Threads.@threads for x_w in 1:kernel_width
                        y_in::Int = m + y_w - 1
                        x_in::Int = n + x_w - 1
                        value += inputs[index_batch, in_channel, y_in, x_in] * kernels[out_channel, in_channel, y_w, x_w]
                    # end
                    end
                end
                output[index_batch, out_channel, y_out, x_out] = value
            end
            # end
        end
    end

    return output
end

Just for measuring runtime:

using BenchmarkTools
output = @btime multichannel_conv(rand(64, 20, 150, 150), rand(40, 20, 3, 3))
>>> 37.964 s (7 allocations: 647.59 MiB)

jmair · September 9, 2022, 9:46pm

For multithreading I would try and block the most work together, into the largest blocks possible, but making sure you have just enough for all your threads. So parallelising over as few loops as possible (and outer ones) to take advantage of cache locality instead of going around all over the place. Basically, keeping each thread accessing the memory of the array in order should help performance.

You can also combine your nested for loops into a single one using product

using Iterators: product

for a, b, c in product(iterA, iterB, iterC)
...
end

# is the same as
for a in iterA
    for b in iterB
        for c in iterC
... Etc

You can stick a Threads.@threads on the first loop and parallelise over that. I often find the above very helpful in these situations.

The most important thing is to benchmark threaded vs serial relative performance against the size of your problem. Ideally, you want be as close to an n factor increase, where n is the number of threads.

Topic		Replies	Views
Trying to make multithreading to work on applying a function number of times Performance multithreading , speed-optimization	11	503	August 25, 2023
Improving perfomance on nested loops Performance	5	566	October 25, 2020
Multithreading for nested loops Performance multithreading	42	12328	January 20, 2022
Executing array of functions much slower than executing functions in series Performance	10	899	October 3, 2020
Multi-threading for nested loops with inner loop collection depending on outer loop Performance parallel , multithreading , io	7	692	April 18, 2023

Multithreading in many nested loops

Related topics