How to improve performances of this multiplication -> sum

sylvaticus · January 18, 2024, 12:13pm

I am trying hard to improve the performances of this function that just sum a serie of multiplications, but where all the data is given in terms of indices:

This is my data:

using BenchmarkTools, LoopVectorization
x          = [1.0 2.5; 3 4]
y          = zeros(2,3)
w          = [-0.55248;-0.229514;;0.292818;-0.792014;;; -0.107403;-0.0289423;;0.457216;0.735226;;;0.0260589;-0.008588;;0.348731;0.335272;;;]
y_to_x_ids = [ [[(1, 1), (1, 2)]] [[(1, 1), (1, 2)]] [[(1, 1), (1, 2)]]; [[(1, 1), (2, 1), (1, 2), (2, 2)]]  [[(1, 1), (2, 1), (1, 2), (2, 2)]]  [[(1, 1), (2, 1), (1, 2), (2, 2)]] ]
y_to_w_ids = [ [[(2, 1, 1), (2, 2, 1)]] [[(2, 1, 2), (2, 2, 2)]] [[(2, 1, 3), (2, 2, 3)]]; [[(1, 1, 1), (2, 1, 1), (1, 2, 1), (2, 2, 1)]] [[(1, 1, 2), (2, 1, 2), (1, 2, 2), (2, 2, 2)]] [[(1, 1, 3), (2, 1, 3), (1, 2, 3), (2, 2, 3)]]]

And these are the functions I have already attempted. Note I get an error if I try to apply @turbo or @simd on the first 3 functions…

function _zComp1!(y,y_to_x_ids,y_to_w_ids,w,x)
    for y_idx in CartesianIndices(y)
        y_idx_tuple = Tuple(y_idx)
        yi = 0.0
        n = length(y_to_x_ids[y_idx_tuple...])
        @inbounds for idx in 1:n
           yi += x[y_to_x_ids[y_idx_tuple...][idx]...] * w[y_to_w_ids[y_idx_tuple...][idx]...]
        end
        y[y_idx_tuple...] = yi
    end
end

function _zComp2!(y,y_to_x_ids,y_to_w_ids,w,x)
    for y_idx in CartesianIndices(y)
        y_idx_tuple = Tuple(y_idx)
        yi = 0.0
        @inbounds for (x_id,w_id) in zip(y_to_x_ids[y_idx_tuple...],y_to_w_ids[y_idx_tuple...]) 
           yi += x[x_id...] * w[w_id...]
        end
        y[y_idx_tuple...] = yi
    end
end

function _zComp3!(y,y_to_x_ids,y_to_w_ids,w,x)
    for y_idx in CartesianIndices(y)
        y_idx_tuple = Tuple(y_idx)
        y[y_idx_tuple...] = @inbounds sum(x[x_id...] * w[w_id...] for (x_id,w_id) in zip(y_to_x_ids[y_idx_tuple...],y_to_w_ids[y_idx_tuple...]))
    end
end

function _zComp4!(y,y_to_x_ids,y_to_w_ids,w,x)
    for y_idx in CartesianIndices(y)
        y_idx_tuple = Tuple(y_idx)
        xv = [x[x_id...] for x_id in y_to_x_ids[y_idx_tuple...]]
        wv = [w[w_id...] for w_id in y_to_w_ids[y_idx_tuple...]]
        N = length(xv)
        yi = 0.0
        @turbo for n in 1:N
           yi += xv[n] * wv[n]
        end
        y[y_idx_tuple...] = yi
    end
end

# From @abraemer comment:
function _zComp5!(y,y_to_x_ids,y_to_w_ids,w,x) 
    for y_idx in eachindex(y)
        yi = 0.0
        @inbounds for (x_id,w_id) in zip(y_to_x_ids[y_idx], y_to_w_ids[y_idx]) 
           yi += x[x_id...] * w[w_id...]
        end
        y[y_idx] = yi # this was not guaranteed to be inbounds before
    end
end

And here their benchmarks… is it possible to do better ?

_zComp1!(y,y_to_x_ids,y_to_w_ids,w,x) # [-2.20955  1.80912  0.829592; -3.67703  3.88971  2.21321]
@btime _zComp1!($y,$y_to_x_ids,$y_to_w_ids,$w,$x) #  74.180 ns ( 0 allocations: 0 bytes)
@btime _zComp2!($y,$y_to_x_ids,$y_to_w_ids,$w,$x) #  73.848 ns ( 0 allocations: 0 bytes)
@btime _zComp3!($y,$y_to_x_ids,$y_to_w_ids,$w,$x) # 113.619 ns ( 0 allocations: 0 bytes)
@btime _zComp4!($y,$y_to_x_ids,$y_to_w_ids,$w,$x) # 504.446 ns (12 allocations: 1.03 KiB)
@btime _zComp5!($y,$y_to_x_ids,$y_to_w_ids,$w,$x) #  60.256 ns ( 0 allocations: 0 bytes)

Dan · January 18, 2024, 12:36pm

The implementation looks pretty fast (nanoseconds), no allocations. I assume this is a hot function called many times. Do some of the parameters stay constant across many invocations? Specifically, do y_to_x_ids or y_to_w_ids stay constant? Have constant sizes? The tuple lists have constant lengths?
w or x are constant across many calls?

abraemer · January 18, 2024, 1:03pm

A comment more on style I guess:
You can use a CartesianIndex directly with an array. No need to convert it to a Tuple and then splat. If you want to use the CartesianIndex like a tuple use the .I property of it.

sylvaticus:

function _zComp3!(y,y_to_x_ids,y_to_w_ids,w,x)
    for y_idx in CartesianIndices(y)
        # y_idx_tuple = Tuple(y_idx)
        y[y_idx_tuple] = @inbounds sum(x[x_id...] * w[w_id...] for (x_id,w_id) in zip(y_to_x_ids[y_idx_tuple],y_to_w_ids[y_idx_tuple]))
    end
end

abraemer · January 18, 2024, 1:05pm

A I found something in the 4th function

You could just use dot instead:

y[y_idx] = dot(xv, wv)

abraemer · January 18, 2024, 1:29pm

Sorry I just realized that all of these functions achieve the same computation and are just variants you tried. Reading comprehension seems not to be my strong suit today.

Anyways, is there a reason you use CartesianIndices? You might as well use eachindex instead. This actually allows Julia to infer whether an array access is inbounds or not and gives a bit of free and safe speedup in this case:

function _zComp2!(y,y_to_x_ids,y_to_w_ids,w,x)
    for y_idx in eachindex(y)
        yi = 0.0
        @inbounds for (x_id,w_id) in zip(y_to_x_ids[y_idx], y_to_w_ids[y_idx]) 
           yi += x[x_id...] * w[w_id...]
        end
        y[y_idx] = yi # this was not guaranteed to be inbounds before
    end
end

This was also the fastest function I managed to get without further information about the problem.

sylvaticus · January 18, 2024, 1:37pm

Yes, y_to_x_ids and y_to_w_ids are constant, they are “precomputed” ones, and they represent the index of a convolution, given specific input and kernel sizes, padding, stride…
In my application these ids are properties of the layer struct.
Each of the elements of these arrays collect the ids of X and W required to computed that specific Y. They are arrays of tuples, as one Y could require a variable number of X and W. The elements of this inner arrays are tuple, that constitute the index of the X and of the W.

sylvaticus · January 18, 2024, 1:48pm

Thank you, I have added it to the benchmark…

Topic		Replies	Views
Improving the performance of a sum over a vector of vectors Performance performance , loops , iterators	10	639	July 13, 2022
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4834	July 19, 2021
Help Improving Performance of a Loop Performance performance , loops	15	1088	February 16, 2021
LoopVectorization almost doubles execution time? Performance loopvectorization	6	655	July 9, 2021
For loop optimization Performance performance , memory-allocation , loops	3	1666	October 31, 2021

How to improve performances of this multiplication -> sum

Related topics