These allocations should be a warning sign! There shouldn’t be any. Remember to interpolate your variables when benchmarking. Here’s how to rewrite your test in a more efficient, allocation-free way (including the original code, for a self-containment):
create_new_vec(x) = begin
n = length(x)
new_vec = typeof(x)(undef, Int((n*n-n)/2) + n)
upto = 0
@inbounds for i=1:n
for j=i:n
upto += 1
new_vec[upto] = x[i]*x[j]
end
end
new_vec
end
struct MyVec{T} x::Vector{T} end
@inline function Base.getindex(v::MyVec, I)
(i,j) = ind(I, length(v.x))
v.x[i]*v.x[j]
end
@inline function ind(I, N)
b = -(2N+3)
c = 2(I+N)
delta = b^2-4*c
i = -0.5*(b + sqrt(delta)) |> floor |> Int
j = I - N*(i-1) + div((i-1)*i, 2)
(i,j)
end
Testing it:
julia> using BenchmarkTools
julia> x = rand(Float32, 1000); y1 = create_new_vec(x); y2 = MyVec(x);
julia> test(v) = @btime (s = 0f0; @inbounds for i=1:length($y1) s += $v[i] end; s);
julia> all(y1[i] == y2[i] for i in 1:length(y1))
true
julia> test(y1)
557.438 ÎĽs (0 allocations: 0 bytes)
127086.86f0
julia> test(y2)
2.883 ms (0 allocations: 0 bytes)
127086.86f0
So it’s a lot faster overall, but the difference from the lookup table is now huge – around 5 times slower in this test. One way to speed it up is to build an enormous hard-coded binary tree basically to avoid the costly square root. Bare with me, this is a major hack
function bin(a, b, n)
idx(k) = 1+(k-1)*(2n+2-k)Ă·2
if a == b
"($a,k-$(idx(a)-a))"
else
m = 1 + (a + b) Ă· 2
"k<$(idx(m)) ? $(bin(a, m-1, n)) : $(bin(m, b, n))"
end
end
eval(Meta.parse("ind_bin(k) = $(bin(1, 1000, 1000))"))
This creates a method which looks like this:
ind_bin(k) = k<375251 ? k<218876 ? k<117251 ? k<61048 ? k<31505 ?
k<15881 ? k<7973 ? k<3995 ? k<2000 ? k<1001 ? (1,k-0) : (2,k-999) :
k<2998 ? (3,k-1997) : 4,k-2994) : k<5986 ? k<4991 ... (27500 chars omitted)
Then add the boilerplate code:
struct MyVecBin{T} x::Vector{T} end
@inline function Base.getindex(v::MyVecBin, I)
(i,j) = ind_bin(I)
v.x[i]*v.x[j]
end
Testing it:
julia> y3 = MyVecBin(x);
julia> all(y1[i] == y3[i] for i in 1:length(y1))
true
julia> test(y3)
2.178 ms (0 allocations: 0 bytes)
127086.86f0
So, faster than square root, but still a lot slower than the lookup table. I’m sure you can do better. (Notably, if you’re actually iterating over all values, as opposed to random access, the indexes can be computed easily.)