Why is `getindex` slower the second time in a vector of vectors?

Why is in this function calling the getindex method the second time slower?

This function (the smallest working example I could make) is a snippet of a bigger project I’m making on agent based models where I choose a random element and his neighbors and then I operate on the data associated to both.

function testing(
) where {T<:Integer}
    for _ in 1:maxtime
        element_1 = rand(1:n_elements)
        element_2 = rand(neighs[element_1])
        data_1 = data[element_1]
        data_2 = data[element_2]

n_elements = 1000
my_data = [rand(1:10, 10) for _ = 1:n_elements]
my_neighs = [rand(1:n_elements, 5) for _ = 1:n_elements]

@profview testing(my_data, my_neighs, n_elements, 1_000_000_000)

When profiling in VS Code I get the following profiling:

  1. 38%: element_2 = rand(neighs[element_1])
  2. 31%: element_1 = rand(1:n_elements)
  3. 21%: data_2 = data[element_2]
  4. 5%: data_1 = data[element_1]

This is the flame graph:

I would like to understand why this second call of the data vector of vector is much slower. Is it a problem in the profiler? This especially seems to happen in vector of vectors, and when I have the second element taken from another vector.

Thanks in advance for anyone that will help me!

Output of @code_warntype
MethodInstance for testing(::Vector{Vector{Int64}}, ::Vector{Vector{Int64}}, ::Int64, ::Int64)
  from testing(data::Array{Vector{T}, 1}, neighs::Array{Vector{T}, 1}, n_elements::Integer, maxtime::Integer) where T<:Integer @ Main c:\Users\aless\My Drive\Dottorato\PhD Code\ABM\axelrod\julia\minimal_test.jl:1
Static Parameters
  T = Int64
  @_6::Union{Nothing, Tuple{Int64, Int64}}
1 ─ %1  = (1:maxtime)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])        
│         (@_6 = Base.iterate(%1))
│   %3  = (@_6 === nothing)::Bool
│   %4  = Base.not_int(%3)::Bool
└──       goto #4 if not %4
2 ┄ %6  = @_6::Tuple{Int64, Int64}
│         Core.getfield(%6, 1)
│   %8  = Core.getfield(%6, 2)::Int64
│   %9  = (1:n_elements)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])     
│         (element_1 = Main.rand(%9))
│   %11 = Base.getindex(neighs, element_1)::Vector{Int64}
│         (element_2 = Main.rand(%11))
│         (data_1 = Base.getindex(data, element_1))
│         (data_2 = Base.getindex(data, element_2))
│         (@_6 = Base.iterate(%1, %8))
│   %16 = (@_6 === nothing)::Bool
│   %17 = Base.not_int(%16)::Bool
└──       goto #4 if not %17
3 ─       goto #2
4 ┄       return nothing

Not a direct answer but if you just want something faster you may want to checkout ArraysOfArrays.jl

1 Like

This is speculation, but the way the indices are calculated, the delay between calculation of index and getindex is much longer for element_1, and shorter for element_2. This allows the CPU to schedule the memory read earlier.

To elaborate:

        element_1 = rand(...)  # 1) takes a while
        element_2 = rand(...)  # 2) takes a while
        data_1 = data[element_1] # 3) pretty quick
        data_2 = data[element_2] # 4) pretty quick

So the delay (1)-(3) is long, but delay (2)-(4) is short.

This might be tested by doing:

        element_1 = rand(...)  # 1) takes a while
        data_1 = data[element_1] # 3) pretty quick
        element_2 = rand(...)  # 2) takes a while
        data_2 = data[element_2] # 4) pretty quick

i.e. reordering to (1)-(3)-(2)-(4) and if the compiler isn’t smart enough to calculate dependencies (which it might be) then both getindexes will be slow.

1 Like
element_1 = rand(1:n_elements)
data_1 = data[element_1]
element_2 = rand(neighs[element_1])
data_2 = data[element_2]

By reordering this way I get this:

  1. 39%: element_2 = rand(neighs[element_1])
  2. 38%: data_2 = data[element_2]
  3. 17%: element_1 = rand(1:n_elements)
  4. 4%: data_1 = data[element_1]

Very weird result…

It’s about the same as the last result. And as I mentioned, reordering here might not be useful since the compiler can/does reordering itself.

Perhaps to test some hypotheses, it would be useful to change the dependence of the statements, I.e. replace the neighs index with another rand(1:n_elements) index.


Actually no! the original result had element_1 second place, here callking getindex in data_2 seems even slower.

Trying without the neighs gives a “normal” result where data_1 and data_2 take the same amount of time.

The benchmarked code is a version of https://en.wikichip.org/wiki/pointer_chasing .

One way to break the dependence is to use a different data structure, for example:

edges = [(v2, v1) for v1 in 1:n_elements for v2 in neighs[v1]]
element_2, element_1 = random(edges)
data_1 = data[element_1]
data_2 = data[element_2]

Some care needs to be taken with regard to the distribution of element_1 and element_2. It isn’t the same as the OP for non degree uniform graphs.

1 Like

My recommendation is to not worry too much about how much time is spent in which line in this case, Modern processors don’t actually execute code line-by-line. While the processor is loading memory for one variable, it might actually be executing something on another variable or perhaps loading another variable. How would you attribute the time consumption of each line? It’s not trivial at all.