I am trying to optimize a function which mainly operates on arrays that are (nearly all) fields of a given struct. Yet, the performance of the function is NOT really fast. My first idea was to check type stability and type predictability with the @code_warntype macro. There are a lot of red marked warnings but I don’t really know how to avoid them and how to annotate the types of the variables correctly.
Could anyone please help me to optimize (performance) the following function, e.g. how to get rid of type instability and type predictability?
The original function (I already tried multithreading at some points):
conv_layer is an isntance of a struct (see below)
# Computes the derivative of the kernels/weights on the given layer, the results are used to optimize the kernels/weights
function multichannel_conv_gradients(conv_layer)
# storing all the necessary shapes
current_batch_size::Int, in_channels::Int, input_height::Int, input_width::Int = size(conv_layer.inputs)
current_batch_size, out_channels::Int, output_height::Int, output_width::Int = size(conv_layer.outputs)
kernel_height::Int, kernel_width::Int = size(conv_layer.kernels)[3:4]
# storing often used data which will be modified
inputs = conv_layer.inputs_padded
losses = conv_layer.losses
# calculating derivative of the activation function
if conv_layer.df != 1
df = conv_layer.df(conv_layer.outputs_no_activation)
else
df = ones(current_batch_size, out_channels, output_height, output_width)
end
gradients = conv_layer.gradients
# going throw all data in batch
for index_batch in 1:current_batch_size
# going throw all out_channels (because each out_channel can be viewed seperatly)
# for out_channel in 1:out_channels
Threads.@threads for out_channel in 1:out_channels
# going throw each gradient (respectively every weight, same shape as kernels)
for in_channel in 1:in_channels, y_w in 1:kernel_height, x_w in 1:kernel_width
value = 0.00
# going throw each output (beacuse each weight has influence on every output)
# for y_out in 1:output_height, x_out in 1:output_width
Threads.@threads for y_out in 1:output_height # @inbounds
Threads.@threads for x_out in 1:output_width # @inbounds
m, n = get_input_position((y_out, x_out), conv_layer.stride)
value += inputs[index_batch, in_channel, m + y_w - 1, n + x_w - 1] * losses[index_batch, out_channel, y_out, x_out] * df[index_batch, out_channel, y_out, x_out] # .*
end
end
gradients[out_channel, in_channel, y_w, x_w] += value
end
end
end
if current_batch_size != 1
gradients /= current_batch_size
end
return gradients
end
I also think the given struct (conv_layer) is also necesary to see because some of the fields can be nothing
or an Array
.
The struct (mutable):
# returns the position in an input matrix given by the position in output (e.g. usefull for conv, pool and diffrent backward-passes)
# output_position and stride must be tuples
function get_input_position(output_position::Tuple, stride::Tuple)
m = output_position[1] + (stride[1] - 1) * (output_position[1] - 1)
n = output_position[2] + (stride[2] - 1) * (output_position[2] - 1)
return m, n
end
mutable struct Conv
# characteristics of the layer
in_channels::Int
out_channels::Int
kernel_size::Tuple
stride::Tuple
padding::Tuple
activation_function # can be nothing
df # derivative of activation function
# data
inputs # can be nothing
inputs_padded # saved for performence optimization
kernels::Array # weights
outputs_no_activation # can be nothing
outputs # can be nothing
losses # can be nothing
previous_losses # losses for the previous layer, can be nothing
gradients::Array # gradients of the kernels/weights
derivative_cache::Array
# custom constructor
function Conv(in_channels::Int, out_channels::Int, kernel_size::Tuple; stride::Tuple=(1, 1), padding::Tuple=(0, 0), activation_function=nothing)
# setting up the activation function
if isnothing(activation_function)
new_activation_function = nothing
df = 1
gain = 1
#=
else
new_activation_function = sigmoid
df = d_sigmoid
gain = 1
=#
end
# initialize kernels/weights
kernels_shape = (out_channels, in_channels, kernel_size[1], kernel_size[2])
kernels = randn(kernels_shape)
# initialize gradients of kernels/weights
gradients = zeros(size(kernels))
# placeholders
inputs = nothing
inputs_padded = nothing
outputs_no_activation = nothing
outputs = nothing
losses = nothing
previous_losses = nothing
derivative_cache = Array{Any}(undef) ##
# create new instance/object
new(in_channels,
out_channels,
kernel_size,
stride,
padding,
new_activation_function,
df,
inputs,
inputs_padded,
kernels,
outputs_no_activation,
outputs,
losses,
previous_losses,
gradients,
derivative_cache ##
)
end
end
And finally the function for benchmarking (runned with multiple threads):
If you want, you can run @code_warntype multichannel_conv_gradients(layer)
for checking type-releated problems.
using BenchmarkTools, InteractiveUtils
function benchmark()
input = rand(1, 6, 14, 14)
layer.inputs = input
layer.inputs_padded = input
output = rand(1, 16, 10, 10) # normally computed before but for simplicity just random initialized
layer.outputs_no_activation = output
layer.outputs = output
layer.losses = rand(size(output)...) # normally computed before but for simplicity just random initialized
# println(calculate_output_shape(14, 14, 5, 5, stride=(1, 1), padding=(0, 0)))
# exit()
# interesting part
# @code_warntype multichannel_conv_gradients(layer)
gradients = @btime multichannel_conv_gradients(layer)
end
layer = Conv(6, 16, (5, 5))
benchmark()
>>> 6.292 ms (1960939 allocations: 39.80 MiB)
I am very sorry that it is so much code in total, please let me know if you have any questions considering the code.