# How to improve performances of this multiplication -> sum

I am trying hard to improve the performances of this function that just sum a serie of multiplications, but where all the data is given in terms of indices:

This is my data:

``````using BenchmarkTools, LoopVectorization
x          = [1.0 2.5; 3 4]
y          = zeros(2,3)
w          = [-0.55248;-0.229514;;0.292818;-0.792014;;; -0.107403;-0.0289423;;0.457216;0.735226;;;0.0260589;-0.008588;;0.348731;0.335272;;;]
y_to_x_ids = [ [[(1, 1), (1, 2)]] [[(1, 1), (1, 2)]] [[(1, 1), (1, 2)]]; [[(1, 1), (2, 1), (1, 2), (2, 2)]]  [[(1, 1), (2, 1), (1, 2), (2, 2)]]  [[(1, 1), (2, 1), (1, 2), (2, 2)]] ]
y_to_w_ids = [ [[(2, 1, 1), (2, 2, 1)]] [[(2, 1, 2), (2, 2, 2)]] [[(2, 1, 3), (2, 2, 3)]]; [[(1, 1, 1), (2, 1, 1), (1, 2, 1), (2, 2, 1)]] [[(1, 1, 2), (2, 1, 2), (1, 2, 2), (2, 2, 2)]] [[(1, 1, 3), (2, 1, 3), (1, 2, 3), (2, 2, 3)]]]
``````

And these are the functions I have already attempted. Note I get an error if I try to apply `@turbo` or `@simd` on the first 3 functionsâ€¦

``````function _zComp1!(y,y_to_x_ids,y_to_w_ids,w,x)
for y_idx in CartesianIndices(y)
y_idx_tuple = Tuple(y_idx)
yi = 0.0
n = length(y_to_x_ids[y_idx_tuple...])
@inbounds for idx in 1:n
yi += x[y_to_x_ids[y_idx_tuple...][idx]...] * w[y_to_w_ids[y_idx_tuple...][idx]...]
end
y[y_idx_tuple...] = yi
end
end

function _zComp2!(y,y_to_x_ids,y_to_w_ids,w,x)
for y_idx in CartesianIndices(y)
y_idx_tuple = Tuple(y_idx)
yi = 0.0
@inbounds for (x_id,w_id) in zip(y_to_x_ids[y_idx_tuple...],y_to_w_ids[y_idx_tuple...])
yi += x[x_id...] * w[w_id...]
end
y[y_idx_tuple...] = yi
end
end

function _zComp3!(y,y_to_x_ids,y_to_w_ids,w,x)
for y_idx in CartesianIndices(y)
y_idx_tuple = Tuple(y_idx)
y[y_idx_tuple...] = @inbounds sum(x[x_id...] * w[w_id...] for (x_id,w_id) in zip(y_to_x_ids[y_idx_tuple...],y_to_w_ids[y_idx_tuple...]))
end
end

function _zComp4!(y,y_to_x_ids,y_to_w_ids,w,x)
for y_idx in CartesianIndices(y)
y_idx_tuple = Tuple(y_idx)
xv = [x[x_id...] for x_id in y_to_x_ids[y_idx_tuple...]]
wv = [w[w_id...] for w_id in y_to_w_ids[y_idx_tuple...]]
N = length(xv)
yi = 0.0
@turbo for n in 1:N
yi += xv[n] * wv[n]
end
y[y_idx_tuple...] = yi
end
end

# From @abraemer comment:
function _zComp5!(y,y_to_x_ids,y_to_w_ids,w,x)
for y_idx in eachindex(y)
yi = 0.0
@inbounds for (x_id,w_id) in zip(y_to_x_ids[y_idx], y_to_w_ids[y_idx])
yi += x[x_id...] * w[w_id...]
end
y[y_idx] = yi # this was not guaranteed to be inbounds before
end
end
``````

And here their benchmarksâ€¦ is it possible to do better ?

``````_zComp1!(y,y_to_x_ids,y_to_w_ids,w,x) # [-2.20955  1.80912  0.829592; -3.67703  3.88971  2.21321]
@btime _zComp1!(\$y,\$y_to_x_ids,\$y_to_w_ids,\$w,\$x) #  74.180 ns ( 0 allocations: 0 bytes)
@btime _zComp2!(\$y,\$y_to_x_ids,\$y_to_w_ids,\$w,\$x) #  73.848 ns ( 0 allocations: 0 bytes)
@btime _zComp3!(\$y,\$y_to_x_ids,\$y_to_w_ids,\$w,\$x) # 113.619 ns ( 0 allocations: 0 bytes)
@btime _zComp4!(\$y,\$y_to_x_ids,\$y_to_w_ids,\$w,\$x) # 504.446 ns (12 allocations: 1.03 KiB)
@btime _zComp5!(\$y,\$y_to_x_ids,\$y_to_w_ids,\$w,\$x) #  60.256 ns ( 0 allocations: 0 bytes)
``````

The implementation looks pretty fast (nanoseconds), no allocations. I assume this is a hot function called many times. Do some of the parameters stay constant across many invocations? Specifically, do `y_to_x_ids` or `y_to_w_ids` stay constant? Have constant sizes? The tuple lists have constant lengths?
`w` or `x` are constant across many calls?

1 Like

A comment more on style I guess:
You can use a `CartesianIndex` directly with an array. No need to convert it to a `Tuple` and then splat. If you want to use the `CartesianIndex` like a tuple use the `.I` property of it.

A I found something in the 4th function

You could just use `dot` instead:

``````y[y_idx] = dot(xv, wv)
``````
1 Like

Sorry I just realized that all of these functions achieve the same computation and are just variants you tried. Reading comprehension seems not to be my strong suit today.

Anyways, is there a reason you use `CartesianIndices`? You might as well use `eachindex` instead. This actually allows Julia to infer whether an array access is inbounds or not and gives a bit of free and safe speedup in this case:

``````function _zComp2!(y,y_to_x_ids,y_to_w_ids,w,x)
for y_idx in eachindex(y)
yi = 0.0
@inbounds for (x_id,w_id) in zip(y_to_x_ids[y_idx], y_to_w_ids[y_idx])
yi += x[x_id...] * w[w_id...]
end
y[y_idx] = yi # this was not guaranteed to be inbounds before
end
end
``````

This was also the fastest function I managed to get without further information about the problem.

1 Like

Yes, `y_to_x_ids` and `y_to_w_ids` are constant, they are â€śprecomputedâ€ť ones, and they represent the index of a convolution, given specific input and kernel sizes, padding, strideâ€¦
In my application these ids are properties of the `layer` struct.
Each of the elements of these arrays collect the ids of X and W required to computed that specific Y. They are arrays of tuples, as one Y could require a variable number of X and W. The elements of this inner arrays are tuple, that constitute the index of the X and of the W.

1 Like

Thank you, I have added it to the benchmarkâ€¦