How to speed up value updating in an array?

Ahmed_Salih · February 14, 2023, 9:56pm

Hello!

I am running a code I wrote to find the nearest neighbours between some points. Benchmarking my current version of the code and using @profview I found that what is taking the longest time now, is not computations but adding values in the array:

As seen in line 130, this takes the longest time. The array “ListOfInteractions” is a preallocated vector of tuples.

Does anyone have some general pointers on how to improve this or can point me in the right direction?

To me it seems like that my algorithm is working fine, but now it is updating values which is the slowest part.

Kind regards

mbauman · February 14, 2023, 10:06pm

What’s the type of TheCLL.ListOfInterations? Where is TheCLL defined? Is it a global?

Ahmed_Salih · February 14, 2023, 10:07pm

Hey

No, it is a struct defined with concrete types, called inside of a function.

I am also using @views @inbounds and @inline everywhere I can in the code

Kind regards

mbauman · February 14, 2023, 10:11pm

I see that. It’s almost surely not effective in most (if not all) places there.

You really didn’t answer any of my questions, though. What is typeof(TheCLL.ListOfInterations)?

Ahmed_Salih · February 14, 2023, 10:14pm

My bad!

Here it is

Kind regards

DNF · February 14, 2023, 10:18pm

I’ve never seen @inline sprinkled around like that. What is the purpose?

It’s better if you share code as text, not as images. It’s hard to copy and paste from images.

mikmoore · February 14, 2023, 10:18pm

@inline is not a general performance annotation, nor does it even make sense in some of the places you’ve applied it. In most circumstances, it will provide something between negligible benefit and horrible regression in performance.

I’d recommend you remove it here. If we got better performance by simply putting it everywhere, we’d just have Julia put it everywhere for us. By default, the compiler makes inlining decisions automatically based on its best judgement. If you find a generalizable example where @inline does provide a big benefit, bring it up for discussion and we can see about updating the inlining algorithm.

mbauman · February 14, 2023, 10:20pm

The best general pointers are all in here:

https://docs.julialang.org/en/v1/manual/performance-tips/

My questions were because I have a hunch you have a type instability on that line. Is this inside a function? Is TheCLL type-stable?

Ahmed_Salih · February 14, 2023, 10:22pm

Thank you for all the interest @DNF, @mikmoore and @mbauman

I uploaded the code here, I hope it works directly for you - I just tested and for me it seems to do so from scratch

github.com

AhmedSalih3d/TempRepoCLL/blob/main/Code.jl

using BenchmarkTools
import CellListMap
using StaticArrays
using LinearAlgebra
using Parameters
using Plots; using Measures

@with_kw struct CLL{I,T,D}
    Points::Vector{SVector{D,T}}
    MaxValidIndex::Base.RefValue{I} = Ref(0)
    CutOff::T
    CutOffSquared::T                         = CutOff^2
    Padding::I                               = 2
    HalfPad::I                               = convert(typeof(Padding),Padding//2)
    ZeroOffset::I                            = 1 #Since we start from 0 when generating cells

    ListOfInteractions::Vector{Tuple{I,I,T}} = Vector{Tuple{Int,Int,getsvecT(eltype(Points))}}(undef,length(Points)^2)
    Stencil::Vector{NTuple{D, I}}            = neighbors(getsvecD(eltype(Points))) 
    
    Cells::Vector{NTuple{D, I}}              = ExtractCells(Points,CutOff,getsvecD(eltype(Points)))

This file has been truncated. show original

It is a temporary repo, I hope to release it in a better format as I learn along the way.

EDIT: Any help / pointers you can provide me with would be wonderful! As you can see I compare it up against @lmiq wonderful CellListMap and see that for small random distributions the code I have wins out, but for bigger and more realistic systems, mine gets slower.

I am trying to learn this kind of data structure since for particle simulations which I care about, this is really important and needs to be done on GPU later on. The function I mainly want to optimize is “CustomCLL”, but tips for other part of the code would be wonderful too.

Kind regards

mbauman · February 14, 2023, 10:37pm

I have a suspicion that your problems are stemming from the overuse of @inline. You don’t want to inline big functions like that.

Ahmed_Salih · February 14, 2023, 10:44pm

I made a test where I removed all @inline from every function and inner-function, and saw basically no performance difference - of course cleaner to just remove the @inline

Kind regards

Salmon · February 15, 2023, 7:48am

Just to make sure, if the problem really is, what you are thinking it is, then in your profiler, the majority of the time should be spent at the bottom of the stack in a function called Base.setindex!. If you are absolutely sure that there are no type instabilities (@code_warntype doesnt indicate anything in red), you can probably only get minor improvements unless you find a better algorithm.
You could try to change your memory layout from a vector of tuples to a tuple of vectors, i.e. by using StructArrays.jl. This would also allow you to use LoopVectorization which might make it faster.

Ahmed_Salih · February 15, 2023, 8:48am

Okay, I will double check this, but I am pretty sure since the profiler has shown it at multiple times to be at the indexing, and when I did a sanity check of commenting out the updating of values the code ran in 25% of the previous time.

I like the suggestion of using LoopVectorization, unfortunately it did not work for the for loop I made and I have no idea of why Didn’t know I would need StructArrays.jl to do so.

Kind regards

Salmon · February 15, 2023, 9:00am

I think I best quote the README from the package:

The key to the @turbo macro’s performance gains is leveraging knowledge of exactly how data like Float64s and Ints are handled by a CPU. As such, it is not strightforward to generalize the @turbo macro to work on arrays containing structs such as Matrix{Complex{Float64}}. Instead, it is currently recommended that users wishing to apply @turbo to arrays of structs use packages such as StructArrays.jl which transform an array where each element is a struct into a struct where each element is an array. Using StructArrays.jl, we can write a matrix multiply (gemm) kernel that works on matrices of Complex{Float64}s and Complex{Int}s:

I can not guarantee you that it will be faster however. I think your problem is in essence the limited speed of access to memory, which is generally not super fast.
It might be that LoopVectorization uses better cpu instructions for that, but that does not neccessarily have to be the case…

GunnarFarneback · February 15, 2023, 9:20am

The if conditional inside your loop is quite bad for SIMD operations and thus for LoopVectorization.

I don’t think it’s entirely unreasonable that the write to memory is a bottleneck but it’s possible that it’s too much of one. Experimenting with different memory layouts is a good idea but you don’t need a package to replace your ListOfInteractions field with three separate fields.

Ahmed_Salih · February 15, 2023, 9:27am

Okay, just to get you clearly, your theory is that;

Using 3 seperate arrays to store the values in the tuple (i,j,d) (index 1, index 2, distance between them) would be more performant than storing everything in a vector of tuples?
If I can get rid of the if-condition, the code should be even faster for SIMD (which I do not know what is)? Perhaps I can do so by using some shape of “d2 < CutOffSquared” without the explicit “if”.

Kind regards

Salmon · February 15, 2023, 10:20am

Using three separate Arrays might be faster, since the individual arrays will be smaller and therefore it might be easier to write to them. My motivation was also that this would enable LoopVectorization, although its true that using StructArrays for this is an overkill. Theres probably a lot of thoughts one could have about this, but personally, I would just try it out, at the end this is always something that needs to be tested and proven to be faster explicitly. @GunnarFarneback, is right that I forgot that the if condition will kill SIMD, so one would have to do other things about that.
There are ways to circumvent explicit branches such as ifelse, which actually executes both conditions so one would need to be a bit careful about that.
In hindsight I am not so sure whether its really worth the effort, if there is a way to improve the algorithm at other points that would probably be much preferable.

lmiq · February 15, 2023, 11:20am

Loop vectorization is not really useful in this context because in cell list algorithms the particles are accessed in essentially random order from the array of coordinates. There are alternative algorithms that first cluster the particles to try to align them in memory before computing the distances, but whether that cost pays off or not is very dependent on the number of particles, density and cutoff of the specific application.

Ahmed_Salih · February 15, 2023, 7:13pm

Hey, so this is what profview returns:

And setindex! is present as you mentioned.

Then checking using @code_warntype

@code_warntype SingleIterationCLL(n=10000,d=2);
MethodInstance for (::var"#SingleIterationCLL##kw")(::NamedTuple{(:n, :d), Tuple{Int64, Int64}}, ::typeof(SingleIterationCLL))
  from (::var"#SingleIterationCLL##kw")(::Any, ::typeof(SingleIterationCLL)) in Main at c:\Users\Ahmed Salih\Documents\git\SPH-JULIA-DEVELOPMENT\src\MyOwnCellList\Step2_CellList_StaticArrays_GPU.jl:237
Arguments
  _::Core.Const(var"#SingleIterationCLL##kw"())
  @_2::NamedTuple{(:n, :d), Tuple{Int64, Int64}}
  @_3::Core.Const(SingleIterationCLL)
Locals
  n::Int64
  d::Int64
  r::Float64
  T::Type{Float64}
  @_8::Int64
  @_9::Int64
  @_10::Float64
  @_11::Type{Float64}
Body::CLL
1 ── %1  = Base.haskey(@_2, :n)::Core.Const(true)
│          Core.typeassert(%1, Core.Bool)
│          (@_8 = Base.getindex(@_2, :n))
└───       goto #3
2 ──       Core.Const(:(@_8 = 20))
3 ┄─ %6  = @_8::Int64
│          (n = %6)
│    %8  = Base.haskey(@_2, :d)::Core.Const(true)
│          Core.typeassert(%8, Core.Bool)
│          (@_9 = Base.getindex(@_2, :d))
└───       goto #5
4 ──       Core.Const(:(@_9 = 2))
5 ┄─ %13 = @_9::Int64
│          (d = %13)
│    %15 = Base.haskey(@_2, :r)::Core.Const(false)
└───       goto #7 if not %15
6 ──       Core.Const(:(@_10 = Base.getindex(@_2, :r)))
└───       Core.Const(:(goto %20))
7 ┄─       (@_10 = 0.1)
│    %20 = @_10::Core.Const(0.1)
│          (r = %20)
│    %22 = Base.haskey(@_2, :T)::Core.Const(false)
└───       goto #9 if not %22
8 ──       Core.Const(:(@_11 = Base.getindex(@_2, :T)))
└───       Core.Const(:(goto %27))
9 ┄─       (@_11 = Main.Float64)
│    %27 = @_11::Core.Const(Float64)
│          (T = %27)
│    %29 = (:n, :d, :r, :T)::Core.Const((:n, :d, :r, :T))
│    %30 = Core.apply_type(Core.NamedTuple, %29)::Core.Const(NamedTuple{(:n, :d, :r, :T)})
│    %31 = Base.structdiff(@_2, %30)::Core.Const(NamedTuple())
│    %32 = Base.pairs(%31)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
│    %33 = Base.isempty(%32)::Core.Const(true)
│          Core.typeassert(%33, Core.Bool)
└───       goto #11
10 ─       Core.Const(:(Base.kwerr(@_2, @_3)))
11 ┄ %37 = Main.:(var"#SingleIterationCLL#31")(n, d, r::Core.Const(0.1), T::Core.Const(Float64), @_3)::CLL
└───       return %37

I don’t see any “Any”, but it painted CLL red:

Is this bad?

Ahmed_Salih · February 15, 2023, 9:56pm

Thank you for your ifelse suggestion. I found a way to execute it as follows:

            n_idx_cells = length(indices_in_cell)
            @inbounds for ki = 1:n_idx_cells-1
                @inbounds  k_idx = indices_in_cell[ki]
                  @inbounds for kj = (ki+1):n_idx_cells
                    @inbounds  k_1up = indices_in_cell[kj]
                    @inbounds  d2 = distance_condition(p[k_idx],p[k_1up])

                    cond = d2 <= TheCLL.CutOffSquared

                    # If cond true, we use nl + 1 as new index
                    ind = ifelse(cond,nl+1,length(TheCLL.ListOfInteractions))
                    @inbounds TheCLL.ListOfInteractions[ind] = (k_idx,k_1up,sqrt(d2))
                    # Then if cond true, update nl
                    nl  = ifelse(cond,ind,nl)
                end
            end

This avoids the if statement in the first code. Performance increased by ~6 ms when N = 7000, which means that I am now ~2 ms faster than CellListMap, instead of being ~4ms slower (at this N).

So that is pretty awesome I still don’t understand exactly why, but I see that in a hot-loop if-else rewrite can be of benefit, even though know the code uses setindex! every iteration and sqrt(d2).

If the solution is correct then the sum of all values (ijd) should be very close to equal
Success! All comparisons matched.
CellListMap
BenchmarkTools.Trial: 287 samples with 1 evaluation.
 Range (min … max):  16.594 ms … 33.671 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.026 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.404 ms ±  2.112 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▄
  ████▆▆▄▅▅▄▄▁▄▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▅
  16.6 ms      Histogram: log(frequency) by time      31.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
CLL
BenchmarkTools.Trial: 309 samples with 1 evaluation.
 Range (min … max):  14.340 ms … 45.076 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.111 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.166 ms ±  3.340 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆█▇▃▂▁     ▁
  ████████▆▅▆▅█▅▅▄▅▄▆▅▅▅▆▄▁▅▁▄▁▅▁▁▁▁▁▄▁▁▄▁▄▄▁▄▄▁▁▁▁▁▁▁▄▄▁▁▁▄▄ ▆
  14.3 ms      Histogram: log(frequency) by time      28.9 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

And link to the code which produces this result:

Github Repo

Kind regards

Topic		Replies	Views
Cell list algorithm is slower than double for loop Performance	18	2273	April 2, 2020
Accelerate pairwise Lennard-Jones force computation Performance	87	3308	February 12, 2022
Speeding up force calculations and mutable structs Performance	20	1188	August 5, 2020
Where is the extra memory allocation? Performance memory-allocation	11	614	November 5, 2021
Seemingly unnecessary allocation within for loop Performance memory-allocation , loops , cartesianindices	15	3191	December 10, 2021

How to speed up value updating in an array?

Related topics