# Fastest algorithm to compute `[x[i]*x[j] for i <= j where 1 <= i, j <= length(x)]`?

I have a vector `x` and I would like to create a new vector where

`new_vec = [x[i]*x[j] for 1 <= i <= j <= length(x)]`

I can do it like this

``````x = rand(Float32, 1000)

create_new_vec(x) = begin
n = length(x)
new_vec = typeof(x)(undef, Int((n*n-n)/2) + n)
upto = 0
@inbounds for i=1:n
for j=i:n
upto += 1
new_vec[upto] = x[i]*x[j]
end
end
new_vec
end

@time new_vec = create_new_vec(x)

using CuArrays
xcu = cu(x)

@time create_new_vec(xcu) #SLOW!!! on GPU
``````

But I am sure there are faster ways to do it and the algorithm is very GPU unfriendly at the moment so I would appreciate GPU friendly and fast methods too!

Something like `UpperTriangular(x .* x')` perhaps? `|> vec` leaves a lot of zero entries but maybe you can ignore them?

1 Like

Seems wasteful to ship a single small operation like this out to a GPU, as opposed to combining it with other operations first so that the inner loop is more substantial.

What do you want to do with this array? Why do you need to construct it at all, as opposed to operating directly on the `x[i]*x[j]` values in a loop that does several things at once?

1 Like

Actually this is the first layer in my neural network. So i want it to be efficient. But i could do it on the cpu and save the result before sending to gpu

You are computing almost nothing. You are limited by memory bandwidth, so it is unlikely that there are simple things you can do to speed this up.

``````julia> using BenchmarkTools
julia> rs = create_new_vec(x);
julia> @btime copy(\$rs);
250.591 μs (2 allocations: 1.91 MiB)
julia> @btime create_new_vec(\$x);
208.151 μs (2 allocations: 1.91 MiB)
``````

Best approach would be to not materialize the result at all: Recomputing x[i]*x[j] on access is faster than accessing `new_vec`, so `new_vec` is pretty useless.

You probably could get this written faster, but who cares? Writing new_vec is already comparably fast as reading it (and you presumably are reading new_vec later on, because why write it otherwise?).

PS:

``````julia> pf = peakflops();
julia> N=2*10^7; a=rand(N); b=copy(a); t=@belapsed broadcast!(identity, b, a);
julia> pf / (N/t)
34.469652372761765
``````

In order to saturate my CPU, I need to run 17-34 floating point operations on each value that is read from main memory. You perform a single operation.

5 Likes

This is very true. The only difficulty I see with this approach is the conversion between linear indices in `new_vec`, and cartesian indices in `x[i]*x[j]`.

A bit of code being clearer than a thousand words, here is the definition of a (very minimalistic) type which proposes the same interface as `new_vec`, while taking much less storage:

``````struct MyVec{T}
x::Vector{T}
end

function getindex(v::MyVec, I)
(i,j) = ind(I, length(x))
v.x[i]*v.x[j]
end
``````

The problem is, how to compute indices `(i,j)` in not too many cycles? (i.e. how to define `ind` efficiently in the mock-up code above ?

I might have missed something trivial, but first “back-of-the-enveloppe” calculations give me something like the following, which involves lots of expensive floating-point computations (and I’m not even sure this is correct )

``````# I: linear index (as in new_vec)
# N: length of x
function ind(I, N)
b = -(2N+3)
c = 2(I+N)
delta = b^2-4*c
i = -0.5*(b + sqrt(delta)) |> floor |> Int
j = I - N*(i-1) + div((i-1)*i, 2)
(i,j)
end
``````

Assuming this to be correct in all cases (it seems to be in the example below), I get:

``````julia> x = rand(Float32, 1000);
julia> y1 = create_new_vec(x);
julia> y2 = MyVec(x);

julia> all(y1[i] == y2[i] for i in 1:length(y1))
true

julia> @btime sum(y1[i] for i in 1:length(y1))
35.545 ms (2001496 allocations: 30.54 MiB)
124158.62f0

julia> @btime sum(y2[i] for i in 1:length(y1))
43.021 ms (2001496 allocations: 30.54 MiB)
124158.62f0
``````

which is an overhead of 20-25%, similar to what you got above between `copy` and `create_new_vec`.

1 Like

These allocations should be a warning sign! There shouldn’t be any. Remember to interpolate your variables when benchmarking. Here’s how to rewrite your test in a more efficient, allocation-free way (including the original code, for a self-containment):

``````create_new_vec(x) = begin
n = length(x)
new_vec = typeof(x)(undef, Int((n*n-n)/2) + n)
upto = 0
@inbounds for i=1:n
for j=i:n
upto += 1
new_vec[upto] = x[i]*x[j]
end
end
new_vec
end

struct MyVec{T} x::Vector{T} end

@inline function Base.getindex(v::MyVec, I)
(i,j) = ind(I, length(v.x))
v.x[i]*v.x[j]
end

@inline function ind(I, N)
b = -(2N+3)
c = 2(I+N)
delta = b^2-4*c
i = -0.5*(b + sqrt(delta)) |> floor |> Int
j = I - N*(i-1) + div((i-1)*i, 2)
(i,j)
end
``````

Testing it:

``````julia> using BenchmarkTools

julia> x = rand(Float32, 1000); y1 = create_new_vec(x); y2 = MyVec(x);

julia> test(v) = @btime (s = 0f0; @inbounds for i=1:length(\$y1) s += \$v[i] end; s);

julia> all(y1[i] == y2[i] for i in 1:length(y1))
true

julia> test(y1)
557.438 μs (0 allocations: 0 bytes)
127086.86f0

julia> test(y2)
2.883 ms (0 allocations: 0 bytes)
127086.86f0
``````

So it’s a lot faster overall, but the difference from the lookup table is now huge – around 5 times slower in this test. One way to speed it up is to build an enormous hard-coded binary tree basically to avoid the costly square root. Bare with me, this is a major hack ``````function bin(a, b, n)
idx(k) = 1+(k-1)*(2n+2-k)÷2
if a == b
"(\$a,k-\$(idx(a)-a))"
else
m = 1 + (a + b) ÷ 2
"k<\$(idx(m)) ? \$(bin(a, m-1, n)) : \$(bin(m, b, n))"
end
end

eval(Meta.parse("ind_bin(k) = \$(bin(1, 1000, 1000))"))
``````

This creates a method which looks like this:

``````ind_bin(k) = k<375251 ? k<218876 ? k<117251 ? k<61048 ? k<31505 ?
k<15881 ? k<7973 ? k<3995 ? k<2000 ? k<1001 ? (1,k-0) : (2,k-999) :
k<2998 ? (3,k-1997) : 4,k-2994) : k<5986 ? k<4991 ... (27500 chars omitted)
``````

``````struct MyVecBin{T} x::Vector{T} end

@inline function Base.getindex(v::MyVecBin, I)
(i,j) = ind_bin(I)
v.x[i]*v.x[j]
end
``````

Testing it:

``````julia> y3 = MyVecBin(x);

julia> all(y1[i] == y3[i] for i in 1:length(y1))
true

julia> test(y3)
2.178 ms (0 allocations: 0 bytes)
127086.86f0
``````

So, faster than square root, but still a lot slower than the lookup table. I’m sure you can do better. (Notably, if you’re actually iterating over all values, as opposed to random access, the indexes can be computed easily.)

1 Like

Oops, my bad! Thanks for the correction.

That’s what I had in mind, too, although I would probably have implemented it with a binary search in a vector of indices. But I like your “hacky” approach I’m disappointed, I would have thought the gains to be more substantial. All in all, although we do have access to plenty “free” computations here, I’m not sure how to use them in a smart enough way to be able to provide random access to this vector in reasonable times (i.e. remaining (at least nearly) memory-bound).

Always learn so much from your insightful code and explanation!