Why is `getindex` slower the second time in a vector of vectors?

alex180500 · May 3, 2024, 10:17pm

Why is in this function calling the getindex method the second time slower?

This function (the smallest working example I could make) is a snippet of a bigger project I’m making on agent based models where I choose a random element and his neighbors and then I operate on the data associated to both.

function testing(
    data::Vector{Vector{T}},
    neighs::Vector{Vector{T}},
    n_elements::Integer,
    maxtime::Integer
) where {T<:Integer}
    for _ in 1:maxtime
        element_1 = rand(1:n_elements)
        element_2 = rand(neighs[element_1])
        data_1 = data[element_1]
        data_2 = data[element_2]
    end
end

n_elements = 1000
my_data = [rand(1:10, 10) for _ = 1:n_elements]
my_neighs = [rand(1:n_elements, 5) for _ = 1:n_elements]

@profview testing(my_data, my_neighs, n_elements, 1_000_000_000)

When profiling in VS Code I get the following profiling:

38%: element_2 = rand(neighs[element_1])
31%: element_1 = rand(1:n_elements)
21%: data_2 = data[element_2]
5%: data_1 = data[element_1]

This is the flame graph:

I would like to understand why this second call of the data vector of vector is much slower. Is it a problem in the profiler? This especially seems to happen in vector of vectors, and when I have the second element taken from another vector.

Thanks in advance for anyone that will help me!

Output of @code_warntype

MethodInstance for testing(::Vector{Vector{Int64}}, ::Vector{Vector{Int64}}, ::Int64, ::Int64)
  from testing(data::Array{Vector{T}, 1}, neighs::Array{Vector{T}, 1}, n_elements::Integer, maxtime::Integer) where T<:Integer @ Main c:\Users\aless\My Drive\Dottorato\PhD Code\ABM\axelrod\julia\minimal_test.jl:1
Static Parameters
  T = Int64
Arguments
  #self#::Core.Const(testing)
  data::Vector{Vector{Int64}}
  neighs::Vector{Vector{Int64}}
  n_elements::Int64
  maxtime::Int64
Locals
  @_6::Union{Nothing, Tuple{Int64, Int64}}
  data_2::Vector{Int64}
  data_1::Vector{Int64}
  element_2::Int64
  element_1::Int64
Body::Nothing
1 ─ %1  = (1:maxtime)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])        
│         (@_6 = Base.iterate(%1))
│   %3  = (@_6 === nothing)::Bool
│   %4  = Base.not_int(%3)::Bool
└──       goto #4 if not %4
2 ┄ %6  = @_6::Tuple{Int64, Int64}
│         Core.getfield(%6, 1)
│   %8  = Core.getfield(%6, 2)::Int64
│   %9  = (1:n_elements)::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])     
│         (element_1 = Main.rand(%9))
│   %11 = Base.getindex(neighs, element_1)::Vector{Int64}
│         (element_2 = Main.rand(%11))
│         (data_1 = Base.getindex(data, element_1))
│         (data_2 = Base.getindex(data, element_2))
│         (@_6 = Base.iterate(%1, %8))
│   %16 = (@_6 === nothing)::Bool
│   %17 = Base.not_int(%16)::Bool
└──       goto #4 if not %17
3 ─       goto #2
4 ┄       return nothing

jling · May 3, 2024, 10:31pm

Not a direct answer but if you just want something faster you may want to checkout ArraysOfArrays.jl

Dan · May 3, 2024, 10:46pm

This is speculation, but the way the indices are calculated, the delay between calculation of index and getindex is much longer for element_1, and shorter for element_2. This allows the CPU to schedule the memory read earlier.

To elaborate:

        element_1 = rand(...)  # 1) takes a while
        element_2 = rand(...)  # 2) takes a while
        data_1 = data[element_1] # 3) pretty quick
        data_2 = data[element_2] # 4) pretty quick

So the delay (1)-(3) is long, but delay (2)-(4) is short.

This might be tested by doing:

        element_1 = rand(...)  # 1) takes a while
        data_1 = data[element_1] # 3) pretty quick
        element_2 = rand(...)  # 2) takes a while
        data_2 = data[element_2] # 4) pretty quick

i.e. reordering to (1)-(3)-(2)-(4) and if the compiler isn’t smart enough to calculate dependencies (which it might be) then both getindexes will be slow.

alex180500 · May 4, 2024, 10:09am

element_1 = rand(1:n_elements)
data_1 = data[element_1]
element_2 = rand(neighs[element_1])
data_2 = data[element_2]

By reordering this way I get this:

39%: element_2 = rand(neighs[element_1])
38%: data_2 = data[element_2]
17%: element_1 = rand(1:n_elements)
4%: data_1 = data[element_1]

Very weird result…

Dan · May 4, 2024, 12:23pm

It’s about the same as the last result. And as I mentioned, reordering here might not be useful since the compiler can/does reordering itself.

Perhaps to test some hypotheses, it would be useful to change the dependence of the statements, I.e. replace the neighs index with another rand(1:n_elements) index.

alex180500 · May 4, 2024, 4:00pm

Actually no! the original result had element_1 second place, here callking getindex in data_2 seems even slower.

Trying without the neighs gives a “normal” result where data_1 and data_2 take the same amount of time.

Dan · May 4, 2024, 4:16pm

The benchmarked code is a version of https://en.wikichip.org/wiki/pointer_chasing .

One way to break the dependence is to use a different data structure, for example:

edges = [(v2, v1) for v1 in 1:n_elements for v2 in neighs[v1]]
element_2, element_1 = random(edges)
data_1 = data[element_1]
data_2 = data[element_2]

Some care needs to be taken with regard to the distribution of element_1 and element_2. It isn’t the same as the OP for non degree uniform graphs.

Tarny_GG_Channie · May 5, 2024, 1:56pm

My recommendation is to not worry too much about how much time is spent in which line in this case, Modern processors don’t actually execute code line-by-line. While the processor is loading memory for one variable, it might actually be executing something on another variable or perhaps loading another variable. How would you attribute the time consumption of each line? It’s not trivial at all.

Topic		Replies	Views
Strange behavior with array calls General Usage question , performance	8	493	January 2, 2019
How to speedup setindex! and getindex? General Usage speed-optimization	3	190	January 3, 2025
Performance issue: Axelrod model of dissemination of culture optimization Performance optimization , vector , agents , abm , vector-of-vectors	6	201	April 22, 2024
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017
Improving performance: large amount of time on getindex and setindex! Performance	1	338	December 26, 2022

Why is `getindex` slower the second time in a vector of vectors?

Related topics