How to speed up value updating in an array?

Hello!

I am running a code I wrote to find the nearest neighbours between some points. Benchmarking my current version of the code and using @profview I found that what is taking the longest time now, is not computations but adding values in the array:

As seen in line 130, this takes the longest time. The array β€œListOfInteractions” is a preallocated vector of tuples.

Does anyone have some general pointers on how to improve this or can point me in the right direction?

To me it seems like that my algorithm is working fine, but now it is updating values which is the slowest part.

Kind regards

What’s the type of TheCLL.ListOfInterations? Where is TheCLL defined? Is it a global?

Hey

No, it is a struct defined with concrete types, called inside of a function.

I am also using @views @inbounds and @inline everywhere I can in the code

Kind regards

I see that. It’s almost surely not effective in most (if not all) places there.

You really didn’t answer any of my questions, though. What is typeof(TheCLL.ListOfInterations)?

My bad!

Here it is :slight_smile:

image

Kind regards

I’ve never seen @inline sprinkled around like that. What is the purpose?

It’s better if you share code as text, not as images. It’s hard to copy and paste from images.

1 Like

@inline is not a general performance annotation, nor does it even make sense in some of the places you’ve applied it. In most circumstances, it will provide something between negligible benefit and horrible regression in performance.

I’d recommend you remove it here. If we got better performance by simply putting it everywhere, we’d just have Julia put it everywhere for us. By default, the compiler makes inlining decisions automatically based on its best judgement. If you find a generalizable example where @inline does provide a big benefit, bring it up for discussion and we can see about updating the inlining algorithm.

4 Likes

The best general pointers are all in here:

https://docs.julialang.org/en/v1/manual/performance-tips/

My questions were because I have a hunch you have a type instability on that line. Is this inside a function? Is TheCLL type-stable?

1 Like

Thank you for all the interest @DNF, @mikmoore and @mbauman

I uploaded the code here, I hope it works directly for you - I just tested and for me it seems to do so from scratch :slight_smile:

It is a temporary repo, I hope to release it in a better format as I learn along the way.

EDIT: Any help / pointers you can provide me with would be wonderful! As you can see I compare it up against @lmiq wonderful CellListMap and see that for small random distributions the code I have wins out, but for bigger and more realistic systems, mine gets slower.

I am trying to learn this kind of data structure since for particle simulations which I care about, this is really important and needs to be done on GPU later on. The function I mainly want to optimize is β€œCustomCLL”, but tips for other part of the code would be wonderful too.

Kind regards

I have a suspicion that your problems are stemming from the overuse of @inline. You don’t want to inline big functions like that.

1 Like

I made a test where I removed all @inline from every function and inner-function, and saw basically no performance difference - of course cleaner to just remove the @inline :slight_smile:

Kind regards

Just to make sure, if the problem really is, what you are thinking it is, then in your profiler, the majority of the time should be spent at the bottom of the stack in a function called Base.setindex!. If you are absolutely sure that there are no type instabilities (@code_warntype doesnt indicate anything in red), you can probably only get minor improvements unless you find a better algorithm.
You could try to change your memory layout from a vector of tuples to a tuple of vectors, i.e. by using StructArrays.jl. This would also allow you to use LoopVectorization which might make it faster.

1 Like

Okay, I will double check this, but I am pretty sure since the profiler has shown it at multiple times to be at the indexing, and when I did a sanity check of commenting out the updating of values the code ran in 25% of the previous time.

I like the suggestion of using LoopVectorization, unfortunately it did not work for the for loop I made and I have no idea of why :slight_smile: Didn’t know I would need StructArrays.jl to do so.

Kind regards

I think I best quote the README from the package:

The key to the @turbo macro’s performance gains is leveraging knowledge of exactly how data like Float64s and Ints are handled by a CPU. As such, it is not strightforward to generalize the @turbo macro to work on arrays containing structs such as Matrix{Complex{Float64}}. Instead, it is currently recommended that users wishing to apply @turbo to arrays of structs use packages such as StructArrays.jl which transform an array where each element is a struct into a struct where each element is an array. Using StructArrays.jl, we can write a matrix multiply (gemm) kernel that works on matrices of Complex{Float64}s and Complex{Int}s:

I can not guarantee you that it will be faster however. I think your problem is in essence the limited speed of access to memory, which is generally not super fast.
It might be that LoopVectorization uses better cpu instructions for that, but that does not neccessarily have to be the case…

The if conditional inside your loop is quite bad for SIMD operations and thus for LoopVectorization.

I don’t think it’s entirely unreasonable that the write to memory is a bottleneck but it’s possible that it’s too much of one. Experimenting with different memory layouts is a good idea but you don’t need a package to replace your ListOfInteractions field with three separate fields.

1 Like

Okay, just to get you clearly, your theory is that;

  1. Using 3 seperate arrays to store the values in the tuple (i,j,d) (index 1, index 2, distance between them) would be more performant than storing everything in a vector of tuples?
  2. If I can get rid of the if-condition, the code should be even faster for SIMD (which I do not know what is)? Perhaps I can do so by using some shape of β€œd2 < CutOffSquared” without the explicit β€œif”.

Kind regards

  1. Using three separate Arrays might be faster, since the individual arrays will be smaller and therefore it might be easier to write to them. My motivation was also that this would enable LoopVectorization, although its true that using StructArrays for this is an overkill. Theres probably a lot of thoughts one could have about this, but personally, I would just try it out, at the end this is always something that needs to be tested and proven to be faster explicitly. @GunnarFarneback, is right that I forgot that the if condition will kill SIMD, so one would have to do other things about that.
    There are ways to circumvent explicit branches such as ifelse, which actually executes both conditions so one would need to be a bit careful about that.
    In hindsight I am not so sure whether its really worth the effort, if there is a way to improve the algorithm at other points that would probably be much preferable.

Loop vectorization is not really useful in this context because in cell list algorithms the particles are accessed in essentially random order from the array of coordinates. There are alternative algorithms that first cluster the particles to try to align them in memory before computing the distances, but whether that cost pays off or not is very dependent on the number of particles, density and cutoff of the specific application.

1 Like

Hey, so this is what profview returns:

And setindex! is present as you mentioned.

Then checking using @code_warntype

@code_warntype SingleIterationCLL(n=10000,d=2);
MethodInstance for (::var"#SingleIterationCLL##kw")(::NamedTuple{(:n, :d), Tuple{Int64, Int64}}, ::typeof(SingleIterationCLL))
  from (::var"#SingleIterationCLL##kw")(::Any, ::typeof(SingleIterationCLL)) in Main at c:\Users\Ahmed Salih\Documents\git\SPH-JULIA-DEVELOPMENT\src\MyOwnCellList\Step2_CellList_StaticArrays_GPU.jl:237
Arguments
  _::Core.Const(var"#SingleIterationCLL##kw"())
  @_2::NamedTuple{(:n, :d), Tuple{Int64, Int64}}
  @_3::Core.Const(SingleIterationCLL)
Locals
  n::Int64
  d::Int64
  r::Float64
  T::Type{Float64}
  @_8::Int64
  @_9::Int64
  @_10::Float64
  @_11::Type{Float64}
Body::CLL
1 ── %1  = Base.haskey(@_2, :n)::Core.Const(true)
β”‚          Core.typeassert(%1, Core.Bool)
β”‚          (@_8 = Base.getindex(@_2, :n))
└───       goto #3
2 ──       Core.Const(:(@_8 = 20))
3 ┄─ %6  = @_8::Int64
β”‚          (n = %6)
β”‚    %8  = Base.haskey(@_2, :d)::Core.Const(true)
β”‚          Core.typeassert(%8, Core.Bool)
β”‚          (@_9 = Base.getindex(@_2, :d))
└───       goto #5
4 ──       Core.Const(:(@_9 = 2))
5 ┄─ %13 = @_9::Int64
β”‚          (d = %13)
β”‚    %15 = Base.haskey(@_2, :r)::Core.Const(false)
└───       goto #7 if not %15
6 ──       Core.Const(:(@_10 = Base.getindex(@_2, :r)))
└───       Core.Const(:(goto %20))
7 ┄─       (@_10 = 0.1)
β”‚    %20 = @_10::Core.Const(0.1)
β”‚          (r = %20)
β”‚    %22 = Base.haskey(@_2, :T)::Core.Const(false)
└───       goto #9 if not %22
8 ──       Core.Const(:(@_11 = Base.getindex(@_2, :T)))
└───       Core.Const(:(goto %27))
9 ┄─       (@_11 = Main.Float64)
β”‚    %27 = @_11::Core.Const(Float64)
β”‚          (T = %27)
β”‚    %29 = (:n, :d, :r, :T)::Core.Const((:n, :d, :r, :T))
β”‚    %30 = Core.apply_type(Core.NamedTuple, %29)::Core.Const(NamedTuple{(:n, :d, :r, :T)})
β”‚    %31 = Base.structdiff(@_2, %30)::Core.Const(NamedTuple())
β”‚    %32 = Base.pairs(%31)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
β”‚    %33 = Base.isempty(%32)::Core.Const(true)
β”‚          Core.typeassert(%33, Core.Bool)
└───       goto #11
10 ─       Core.Const(:(Base.kwerr(@_2, @_3)))
11 β”„ %37 = Main.:(var"#SingleIterationCLL#31")(n, d, r::Core.Const(0.1), T::Core.Const(Float64), @_3)::CLL
└───       return %37

I don’t see any β€œAny”, but it painted CLL red:

Is this bad?

Thank you for your ifelse suggestion. I found a way to execute it as follows:

            n_idx_cells = length(indices_in_cell)
            @inbounds for ki = 1:n_idx_cells-1
                @inbounds  k_idx = indices_in_cell[ki]
                  @inbounds for kj = (ki+1):n_idx_cells
                    @inbounds  k_1up = indices_in_cell[kj]
                    @inbounds  d2 = distance_condition(p[k_idx],p[k_1up])

                    cond = d2 <= TheCLL.CutOffSquared

                    # If cond true, we use nl + 1 as new index
                    ind = ifelse(cond,nl+1,length(TheCLL.ListOfInteractions))
                    @inbounds TheCLL.ListOfInteractions[ind] = (k_idx,k_1up,sqrt(d2))
                    # Then if cond true, update nl
                    nl  = ifelse(cond,ind,nl)
                end
            end

This avoids the if statement in the first code. Performance increased by ~6 ms when N = 7000, which means that I am now ~2 ms faster than CellListMap, instead of being ~4ms slower (at this N).

So that is pretty awesome :slight_smile: I still don’t understand exactly why, but I see that in a hot-loop if-else rewrite can be of benefit, even though know the code uses setindex! every iteration and sqrt(d2).

If the solution is correct then the sum of all values (ijd) should be very close to equal
Success! All comparisons matched.
CellListMap
BenchmarkTools.Trial: 287 samples with 1 evaluation.
 Range (min … max):  16.594 ms … 33.671 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     17.026 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   17.404 ms Β±  2.112 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–ˆβ–†β–„
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–…β–…β–„β–„β–β–„β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„ β–…
  16.6 ms      Histogram: log(frequency) by time      31.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
CLL
BenchmarkTools.Trial: 309 samples with 1 evaluation.
 Range (min … max):  14.340 ms … 45.076 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     15.111 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   16.166 ms Β±  3.340 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–†β–ˆβ–‡β–ƒβ–‚β–     ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–†β–…β–ˆβ–…β–…β–„β–…β–„β–†β–…β–…β–…β–†β–„β–β–…β–β–„β–β–…β–β–β–β–β–β–„β–β–β–„β–β–„β–„β–β–„β–„β–β–β–β–β–β–β–β–„β–„β–β–β–β–„β–„ β–†
  14.3 ms      Histogram: log(frequency) by time      28.9 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

And link to the code which produces this result:

Kind regards