Fastest algorithm to compute `[x[i]*x[j] for i <= j where 1 <= i, j <= length(x)]`?


#1

I have a vector x and I would like to create a new vector where

new_vec = [x[i]*x[j] for 1 <= i <= j <= length(x)]

I can do it like this

x = rand(Float32, 1000)

create_new_vec(x) = begin
   n = length(x)
   new_vec = typeof(x)(undef, Int((n*n-n)/2) + n)
   upto = 0
   @inbounds for i=1:n
      for j=i:n
         upto += 1
         new_vec[upto] = x[i]*x[j]
      end
   end
   new_vec
end

@time new_vec = create_new_vec(x)

using CuArrays
xcu = cu(x)

@time create_new_vec(xcu) #SLOW!!! on GPU

But I am sure there are faster ways to do it and the algorithm is very GPU unfriendly at the moment so I would appreciate GPU friendly and fast methods too!


#2

Something like UpperTriangular(x .* x') perhaps? |> vec leaves a lot of zero entries but maybe you can ignore them?


#3

Seems wasteful to ship a single small operation like this out to a GPU, as opposed to combining it with other operations first so that the inner loop is more substantial.

What do you want to do with this array? Why do you need to construct it at all, as opposed to operating directly on the x[i]*x[j] values in a loop that does several things at once?


#4

Actually this is the first layer in my neural network. So i want it to be efficient. But i could do it on the cpu and save the result before sending to gpu


#5

You are computing almost nothing. You are limited by memory bandwidth, so it is unlikely that there are simple things you can do to speed this up.

julia> using BenchmarkTools
julia> rs = create_new_vec(x);
julia> @btime copy($rs);
  250.591 μs (2 allocations: 1.91 MiB)
julia> @btime create_new_vec($x);
  208.151 μs (2 allocations: 1.91 MiB)

Best approach would be to not materialize the result at all: Recomputing x[i]*x[j] on access is faster than accessing new_vec, so new_vec is pretty useless.

You probably could get this written faster, but who cares? Writing new_vec is already comparably fast as reading it (and you presumably are reading new_vec later on, because why write it otherwise?).

PS:

julia> pf = peakflops();
julia> N=2*10^7; a=rand(N); b=copy(a); t=@belapsed broadcast!(identity, b, a);
julia> pf / (N/t)
34.469652372761765

In order to saturate my CPU, I need to run 17-34 floating point operations on each value that is read from main memory. You perform a single operation.


#6

This is very true. The only difficulty I see with this approach is the conversion between linear indices in new_vec, and cartesian indices in x[i]*x[j].

A bit of code being clearer than a thousand words, here is the definition of a (very minimalistic) type which proposes the same interface as new_vec, while taking much less storage:

struct MyVec{T}
    x::Vector{T}
end

function getindex(v::MyVec, I)
    (i,j) = ind(I, length(x))
    v.x[i]*v.x[j]
end

The problem is, how to compute indices (i,j) in not too many cycles? (i.e. how to define ind efficiently in the mock-up code above ?


I might have missed something trivial, but first “back-of-the-enveloppe” calculations give me something like the following, which involves lots of expensive floating-point computations (and I’m not even sure this is correct :confused:)

# I: linear index (as in new_vec)
# N: length of x
function ind(I, N)
    b = -(2N+3)
    c = 2(I+N)
    delta = b^2-4*c
    i = -0.5*(b + sqrt(delta)) |> floor |> Int
    j = I - N*(i-1) + div((i-1)*i, 2)
    (i,j)
end

Assuming this to be correct in all cases (it seems to be in the example below), I get:

julia> x = rand(Float32, 1000);
julia> y1 = create_new_vec(x);
julia> y2 = MyVec(x);

julia> all(y1[i] == y2[i] for i in 1:length(y1))
true

julia> @btime sum(y1[i] for i in 1:length(y1))
  35.545 ms (2001496 allocations: 30.54 MiB)
124158.62f0

julia> @btime sum(y2[i] for i in 1:length(y1))
  43.021 ms (2001496 allocations: 30.54 MiB)
124158.62f0

which is an overhead of 20-25%, similar to what you got above between copy and create_new_vec.


#7

These allocations should be a warning sign! There shouldn’t be any. Remember to interpolate your variables when benchmarking. Here’s how to rewrite your test in a more efficient, allocation-free way (including the original code, for a self-containment):

create_new_vec(x) = begin
   n = length(x)
   new_vec = typeof(x)(undef, Int((n*n-n)/2) + n)
   upto = 0
   @inbounds for i=1:n
      for j=i:n
         upto += 1
         new_vec[upto] = x[i]*x[j]
      end
   end
   new_vec
end

struct MyVec{T} x::Vector{T} end

@inline function Base.getindex(v::MyVec, I)
    (i,j) = ind(I, length(v.x))
    v.x[i]*v.x[j]
end

@inline function ind(I, N)
    b = -(2N+3)
    c = 2(I+N)
    delta = b^2-4*c
    i = -0.5*(b + sqrt(delta)) |> floor |> Int
    j = I - N*(i-1) + div((i-1)*i, 2)
    (i,j)
end

Testing it:

julia> using BenchmarkTools

julia> x = rand(Float32, 1000); y1 = create_new_vec(x); y2 = MyVec(x);

julia> test(v) = @btime (s = 0f0; @inbounds for i=1:length($y1) s += $v[i] end; s);

julia> all(y1[i] == y2[i] for i in 1:length(y1))
true

julia> test(y1)
  557.438 μs (0 allocations: 0 bytes)
127086.86f0

julia> test(y2)
  2.883 ms (0 allocations: 0 bytes)
127086.86f0

So it’s a lot faster overall, but the difference from the lookup table is now huge – around 5 times slower in this test. One way to speed it up is to build an enormous hard-coded binary tree basically to avoid the costly square root. Bare with me, this is a major hack :slight_smile:

function bin(a, b, n)
   idx(k) = 1+(k-1)*(2n+2-k)÷2
   if a == b
      "($a,k-$(idx(a)-a))"
   else
      m = 1 + (a + b) ÷ 2
      "k<$(idx(m)) ? $(bin(a, m-1, n)) : $(bin(m, b, n))"
   end
end

eval(Meta.parse("ind_bin(k) = $(bin(1, 1000, 1000))"))

This creates a method which looks like this:

ind_bin(k) = k<375251 ? k<218876 ? k<117251 ? k<61048 ? k<31505 ?
k<15881 ? k<7973 ? k<3995 ? k<2000 ? k<1001 ? (1,k-0) : (2,k-999) :
k<2998 ? (3,k-1997) : 4,k-2994) : k<5986 ? k<4991 ... (27500 chars omitted)

Then add the boilerplate code:

struct MyVecBin{T} x::Vector{T} end

@inline function Base.getindex(v::MyVecBin, I)
    (i,j) = ind_bin(I)
    v.x[i]*v.x[j]
end

Testing it:

julia> y3 = MyVecBin(x);

julia> all(y1[i] == y3[i] for i in 1:length(y1))
true

julia> test(y3)
  2.178 ms (0 allocations: 0 bytes)
127086.86f0

So, faster than square root, but still a lot slower than the lookup table. I’m sure you can do better. (Notably, if you’re actually iterating over all values, as opposed to random access, the indexes can be computed easily.)


#8

Oops, my bad! Thanks for the correction.

That’s what I had in mind, too, although I would probably have implemented it with a binary search in a vector of indices. But I like your “hacky” approach :wink:

I’m disappointed, I would have thought the gains to be more substantial. All in all, although we do have access to plenty “free” computations here, I’m not sure how to use them in a smart enough way to be able to provide random access to this vector in reasonable times (i.e. remaining (at least nearly) memory-bound).


#9

Always learn so much from your insightful code and explanation!