# Why is `getindex` slower the second time in a vector of vectors?

Why is in this function calling the `getindex` method the second time slower?

This function (the smallest working example I could make) is a snippet of a bigger project I’m making on agent based models where I choose a random element and his neighbors and then I operate on the data associated to both.

``````function testing(
data::Vector{Vector{T}},
neighs::Vector{Vector{T}},
n_elements::Integer,
maxtime::Integer
) where {T<:Integer}
for _ in 1:maxtime
element_1 = rand(1:n_elements)
element_2 = rand(neighs[element_1])
data_1 = data[element_1]
data_2 = data[element_2]
end
end

n_elements = 1000
my_data = [rand(1:10, 10) for _ = 1:n_elements]
my_neighs = [rand(1:n_elements, 5) for _ = 1:n_elements]

@profview testing(my_data, my_neighs, n_elements, 1_000_000_000)
``````

When profiling in VS Code I get the following profiling:

1. 38%: `element_2 = rand(neighs[element_1])`
2. 31%: `element_1 = rand(1:n_elements)`
3. 21%: `data_2 = data[element_2]`
4. 5%: `data_1 = data[element_1]`

This is the flame graph:

I would like to understand why this second call of the data vector of vector is much slower. Is it a problem in the profiler? This especially seems to happen in vector of vectors, and when I have the second element taken from another vector.

Thanks in advance for anyone that will help me!

Output of @code_warntype
``````MethodInstance for testing(::Vector{Vector{Int64}}, ::Vector{Vector{Int64}}, ::Int64, ::Int64)
from testing(data::Array{Vector{T}, 1}, neighs::Array{Vector{T}, 1}, n_elements::Integer, maxtime::Integer) where T<:Integer @ Main c:\Users\aless\My Drive\Dottorato\PhD Code\ABM\axelrod\julia\minimal_test.jl:1
Static Parameters
T = Int64
Arguments
#self#::Core.Const(testing)
data::Vector{Vector{Int64}}
neighs::Vector{Vector{Int64}}
n_elements::Int64
maxtime::Int64
Locals
@_6::Union{Nothing, Tuple{Int64, Int64}}
data_2::Vector{Int64}
data_1::Vector{Int64}
element_2::Int64
element_1::Int64
Body::Nothing
1 ─ %1  = (1:maxtime)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
│         (@_6 = Base.iterate(%1))
│   %3  = (@_6 === nothing)::Bool
│   %4  = Base.not_int(%3)::Bool
└──       goto #4 if not %4
2 ┄ %6  = @_6::Tuple{Int64, Int64}
│         Core.getfield(%6, 1)
│   %8  = Core.getfield(%6, 2)::Int64
│   %9  = (1:n_elements)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])
│         (element_1 = Main.rand(%9))
│   %11 = Base.getindex(neighs, element_1)::Vector{Int64}
│         (element_2 = Main.rand(%11))
│         (data_1 = Base.getindex(data, element_1))
│         (data_2 = Base.getindex(data, element_2))
│         (@_6 = Base.iterate(%1, %8))
│   %16 = (@_6 === nothing)::Bool
│   %17 = Base.not_int(%16)::Bool
└──       goto #4 if not %17
3 ─       goto #2
4 ┄       return nothing
``````

Not a direct answer but if you just want something faster you may want to checkout ArraysOfArrays.jl

1 Like

This is speculation, but the way the indices are calculated, the delay between calculation of index and `getindex` is much longer for `element_1`, and shorter for `element_2`. This allows the CPU to schedule the memory read earlier.

To elaborate:

``````        element_1 = rand(...)  # 1) takes a while
element_2 = rand(...)  # 2) takes a while
data_1 = data[element_1] # 3) pretty quick
data_2 = data[element_2] # 4) pretty quick
``````

So the delay `(1)-(3)` is long, but delay `(2)-(4)` is short.

This might be tested by doing:

``````        element_1 = rand(...)  # 1) takes a while
data_1 = data[element_1] # 3) pretty quick
element_2 = rand(...)  # 2) takes a while
data_2 = data[element_2] # 4) pretty quick
``````

i.e. reordering to `(1)-(3)-(2)-(4)` and if the compiler isn’t smart enough to calculate dependencies (which it might be) then both `getindex`es will be slow.

1 Like
``````element_1 = rand(1:n_elements)
data_1 = data[element_1]
element_2 = rand(neighs[element_1])
data_2 = data[element_2]
``````

By reordering this way I get this:

1. 39%: `element_2 = rand(neighs[element_1])`
2. 38%: `data_2 = data[element_2]`
3. 17%: `element_1 = rand(1:n_elements)`
4. 4%: `data_1 = data[element_1]`

Very weird result…

It’s about the same as the last result. And as I mentioned, reordering here might not be useful since the compiler can/does reordering itself.

Perhaps to test some hypotheses, it would be useful to change the dependence of the statements, I.e. replace the `neighs` index with another `rand(1:n_elements)` index.

2 Likes

Actually no! the original result had `element_1` second place, here callking getindex in `data_2` seems even slower.

Trying without the `neighs` gives a “normal” result where `data_1` and `data_2` take the same amount of time.

The benchmarked code is a version of https://en.wikichip.org/wiki/pointer_chasing .

One way to break the dependence is to use a different data structure, for example:

``````edges = [(v2, v1) for v1 in 1:n_elements for v2 in neighs[v1]]
element_2, element_1 = random(edges)
data_1 = data[element_1]
data_2 = data[element_2]
``````

Some care needs to be taken with regard to the distribution of `element_1` and `element_2`. It isn’t the same as the OP for non degree uniform graphs.

1 Like

My recommendation is to not worry too much about how much time is spent in which line in this case, Modern processors don’t actually execute code line-by-line. While the processor is loading memory for one variable, it might actually be executing something on another variable or perhaps loading another variable. How would you attribute the time consumption of each line? It’s not trivial at all.

4 Likes