Reducing getindex allocation [edit: when using an array as index]

I need to access a matrix millions of times, and I noticed that most of the allocations of my code come exactly from this. My code is similar to the following example

const A = rand(10,10)
function get_number(idx)
	@inbounds getindex(A, idx...)
end

Say I need to access index (4,6) 100000 times

idx = [4,6]
@time begin
for i in 1:100000
    get_number(idx)
end
end
0.011408 seconds (200.78 k allocations: 3.099 MiB, 16.23% compilation time)

This, as you can see, allocates 3 MiB. Each call makes 2 16bytes allocations of

 @time  get_number(idx)
0.000004 seconds (2 allocations: 32 bytes)

Is there a way to reduce/eliminate these allocations?

Splatting seems to be the culprit here. The following

using BenchmarkTools

const A = rand(10,10)
function get_number(idx1, idx2)
	@inbounds getindex(A, idx1, idx2)
end

function test()
    for i in 1:100000
        idx = (rand(1:10),rand(1:10))
        get_number(idx[1], idx[2])
    end
end

@btime test()

yields

  1.074 ms (0 allocations: 0 bytes)

Never mind, indexing with a vector instead of a tuple is the problem. The following works, too:

using BenchmarkTools

const A = rand(10,10)
function get_number(idx...)
	@inbounds getindex(A, idx...)
end

function test()
    for i in 1:100000
        idx = (rand(1:10),rand(1:10))
        get_number(idx...)
    end
end

@btime test()

yielding

  1.070 ms (0 allocations: 0 bytes)
1 Like

Unfortunately, I need idx to be a vector of size n rather than a tuple. At least now I know that the cause is, thanks! I will try to look into it.

Interesting, fixed or variable size n? In the second case I’d expect further problems…

In my case, n is a const. It can a number between 2 to say 5 at most, but it is fixed at the beginning and never changes during the experiment. I tried to do the following:

const A = rand(10,10)
function get_number(idx)
	@inbounds getindex(A, (idx[1],idx[2])...)
end

And this largely solves the problem: allocations drop from 14.5Gb to 500Mb and the code is more than 2x faster. However, this does not work as sood as idx is of size 3 or more.

If you can generalize the MWE, we could try to use ntuple?

Can you write out the indexing expression? Like

A[idx[1], idx[2], idx[3]]

That should help. Otherwise, if n is a compile time constant you may be able to turn ind into a tuple in an efficient way.

BTW, you don’t have to use getindex, just write A[idx...]

What is MWE?

I tried to use ntuple but it seems to allocate a lot of memory on its own.

M(inimal)W(orking)E(xample), i.e. your example code…

Yes, maybe something like

tup = ntuple(i->idx[i], Val(4))
A[tup...]

if 4 is the number.

But, @user_231578, when you say n is constant, where and how is it set?

Transforming idx into a tuple in an efficiend way is what I would like to do: but failing. And you are right, there is no need to write get index.

ntuple seems to allocate too much memory for being useful here

 0.006581 seconds (5.21 k allocations: 322.297 KiB, 98.60% compilation time)

n is set at the beginning of the code, take this as an example:

const n = 3
const dims = ntuple(i -> 10, Val(n))
const A = rand(dims...)

function get_number(idx)
	@inbounds A[idx..]
end

are you sure your example doesnt work as intended?
using your code and benchmarking properly, I get in a fresh REPL (Julia v. 1.7.1):

using BenchmarkTools
const n = 3;
const dims = ntuple(i -> 10, Val(n));
const A = rand(dims...);

function get_number(idx)
	@inbounds A[idx...]
end
@btime get_number((1,2,3))

  1.900 ns (0 allocations: 0 bytes)
0.9097530819846809  
1 Like

It does work as intended: as @goerch pointed out, the problem was that I was indexing with a vector rather than a tuple. Thus, my problem has shifted: I need a non-allocating way to convert my idx array to a tuple. For the moment, this workaround seems to do just fine (but it is ugly):

function get_tuple(idx)
	if n == 2 
		tup = (idx[1], idx[2])
	elseif n == 3
		tup = (idx[1], idx[2], idx[3])
	elseif n == 4
		tup = (idx[1], idx[2], idx[3], idx[4])
	elseif n == 5
		tup = (idx[1], idx[2], idx[3], idx[4], idx[5])
	end
	return tup
end

ah, ok then I misunderstood you:

how about the following then?

get_tuple2(idx) = ntuple(i -> idx[i], n)

There is one allocation of 32 bytes as also in your example (possibly one can eliminate that as well if there is a need for it).

1 Like

you are prefectly right! Thanks!

Without a generalized MWE I can only propose

using BenchmarkTools

function get_tuple(::Val{N}, idx) where {N}
    ntuple(i -> idx[i], Val(N))
end

@btime get_tuple(Val(2), vec) setup=(vec=rand(1:2, 2))
@btime get_tuple(Val(3), vec) setup=(vec=rand(1:3, 3))
@btime get_tuple(Val(4), vec) setup=(vec=rand(1:4, 4))

const N = 2
@btime get_tuple(Val(N), vec) setup=(vec=rand(1:N, N))

@btime get_tuple(Val(M), vec) setup=(M=2; vec=rand(1:M, M))

yielding

  2.200 ns (0 allocations: 0 bytes)
  2.700 ns (0 allocations: 0 bytes)
  2.700 ns (0 allocations: 0 bytes)
  2.400 ns (0 allocations: 0 bytes)
  188.406 ns (1 allocation: 32 bytes)

showing how careful we have to be with const vs non-const (I initially tried to benchmark with the last line…).

Edit: corrected ntuple to use the Val method (which could be helpful in other situations)…

2 Likes