Debugging performance difference between identical arrays

I’m working with arrays of the following type Vector{Vector{Tuple{Int32,Float32}}} to represent an adjacency matrix. These are stored in some struct

mutable struct SomeStruct
  ...
  adj::Vector{Vector{Tuple{Int32,Float32}}}
  ...
end

SomeStruct(len,wid) = SomeStruct(..., fillAdjList(len,wid),...)

function fillAdjList(len,wid)::Vector{Vector{Tuple{Int32,Float32}}}
  adj = Vector{Vector{Tuple{Int32,Float32}}}(undef, len*wid)
  
  #some algo to fill the and list sorted by the Int32
  
  return adj
end

The adjacency matrix is used later in my program to calculate some energy function for a given vertex in a graph. The way it is used is it gets accessed and then some loop iterates over the tuples, accessing some other array based on the Int32 in first part of the tuple, and multiplying by the Float32 in it.

I was working on updating the algorithm to create the adjacency matrix in a more efficient way, and noticed somewhere down the line my program was running slower than it used to be. I found out it is the adjacency matrix, although entry by entry it’s exactly the same after changing the algorithm. I tried adding proper type annotations everywhere, but to no avail.

What I found works is when changing the initialization of SomeStruct with a deepcopy of the adjacency matrix,

SomeStruct(len,wid) = SomeStruct(..., deepcopy(fillAdjList(len,wid)),...)

which returns the algorithm to its original speed. I’m a bit stumped on what could cause this performance difference. I was thinking it must be either some kind of type instability or data locality problem, but I don’t really know how to find out.

I am not sure if this is related to your performance issues, but just as an FYI: Array{Array{Tuple{Int32, Float32}}} is not a concrete type (something you can check with the isconcretetype function), and that can potentially be very harmful to the performance of the struct. Maybe you should try making the struct concrete with, for example,

mutable struct SomeStruct{N1,N2}
  adj::Array{Array{Tuple{Int32,Float32},N1},N2}
end

Sorry, the use of array was a typo, should’ve been vector. I fixed it now.

Do you have an MWE? It’s hard to debug like this

Not at the moment since it’s not super trivial to decouple everything from the rest of my program. Retrieving the adj from the struct it’s stored in and doing benchmarks also for sub parts of my algorithm gave the same performance, which is the reason it took me a long time to find out where the problem was.

I was hoping any of you might have some quick intuitions of why making a deepcopy before storing it in the struct might even cause a performance difference in the first place. If not, then I might try to make an MWE later, but for now the deepcopy trick works.

Here are some diagnosis tools you could try on the function with or without deepcopy. Posting screenshots of the results here (if confidentiality allows) may help us figure out what’s going on:

  1. Simple type analysis with @code_warntype
  2. Profiling, eg. with @profview in VSCode
  3. More advanced type analysis with JET.jl or Cthulhu.jl

Are you familiar with any of those?

In case anyone finds this at some point, I’m quite convinced the issue is data locality. The performance hit occurs when using push! to the inner arrays, which is what I was using to fill the inner arrays… When pushing, there is of course no guarantee the arrays can be located close to each other.
Making a deepcopy presumably finds some spot in memory where all the inner arrays can be adjacent or close, facilitating better performance than when there is random access to one of the inner arrays.

Using sizehint!, if you know the final size in advance, can let you avoid the reallocations.
Even better would be constructing full size arrays from the get go, and filling them in a loop. This avoids all the slow ccalls/should be much faster than push!.
Eventually, arrays will be implemented in Julia, and we won’t have that problem.

Thanks for the suggestion. I didn’t know about sizehint!. However, my biggest problem is not with construction of the arrays, but with using them for some algorithm, where I don’t understand how the construction affects the performance. I made a follow up post here:

Understanding the performance and overhead of a vector of SOA vs a vector of AOS for SIMD and the effect of push!

This also goes a bit further than just this topic, but overall is related I think.