Possible to speed up this function for calculating cartesian indices?

Hello!

I have a small function which I call a lot:

function BatchExtractCells!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + 2 * one(ci) 
    end
end

I’ve been trying to see if I can speed it up somehow. The current performance is:

using StaticArrays
using Polyester
using Chairmarks
using BenchmarkTools

function BatchExtractCells!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + 2 * one(ci) 
    end
end

# Number of points/cells to test
N         = 100_000
Dims      = 2 
FloatType = Float64
CutOff    = 0.5

# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]


benchmark_result = @b BatchExtractCells!($Cells, $Points, $CutOff)
println("BatchExtractCells"); display(benchmark_result)

@benchmark BatchExtractCells!($Cells, $Points, $CutOff)

BenchmarkTools.Trial: 769 samples with 1 evaluation.
 Range (min … max):  5.827 ms … 16.340 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.067 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.494 ms ±  1.193 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇▅▃▃▂▁
  ████████████▇▅▇▆▅▅▅▅▆▆▇▆▅▄▁▅▅▆▁▅▄▄▆▇▄▅▄▁▆▄▁▁▁▄▄▄▄▆▅▄▁▁▄▄▁▅ ▇
  5.83 ms      Histogram: log(frequency) by time     11.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

I don’t think there is much more to do, to speed it up, but I have been wrong in the past. Rather be wrong and get a faster code, than not knowing :slight_smile:

Kind regards

That’s some nice code, IMO. But I was able to get some significant improvement by hoisting the CartesianIndex creation from the loop. On my machine:

julia> @benchmark BatchExtractCells!($Cells, $Points, $CutOff)
BenchmarkTools.Trial: 5510 samples with 1 evaluation.
 Range (min … max):  638.600 μs …   8.569 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     657.750 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   904.728 μs ± 497.728 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▁▁▁  ▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁
  █████████████████████████████▇▇█▆▇▆▆▆▇▅▅▆▅▆▅▅▅▅▅▅▅▄▅▄▂▃▅▄▄▄▄▃ █
  639 μs        Histogram: log(frequency) by time       2.66 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark BatchExtractCells2!($Cells, $Points, $CutOff)
BenchmarkTools.Trial: 9784 samples with 1 evaluation.
 Range (min … max):  338.500 μs …   9.726 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     349.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   508.377 μs ± 409.123 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▁▁     ▁▁  ▁ ▁▁                                             ▁
  ██████████████████▇██████▇▇▇▇▇▇██▇▇▇▇▆▆▆▆▆▆▆▆▆▅▅▅▄▅▅▅▄▅▅▅▅▄▅▄ █
  338 μs        Histogram: log(frequency) by time       1.98 ms <

 Memory estimate: 80 bytes, allocs estimate: 3.

where

function BatchExtractCells2!(Cells, Points, CutOff)
    twoci = CartesianIndex(ntuple(_ -> 2,Dims))
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + twoci
    end
end

You can get rid of the allocation if you make Dims a constant:

const Dims = 2

Thank you, that is a good insight! I was under the impression that the Julia compiler would automize that out, but I guess it can’t since it is unsure of the size? Instead of making the tuple at all, I realized I just could " + 2"…

function BatchExtractCells!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. Int(fld(Points[i] + 2, CutOff))...)
    end
end

Which gives me:

BatchExtractCells
3.240 ms

And that is a 2x speed up as you showed!

Thanks :slight_smile:

Does the precision of fld matter? If not, then floating-point division is much faster:

julia> x = rand(); y = 0.5;

julia> @btime Int(fld($x, $y));  # sometimes only 9ns
  19.523 ns (0 allocations: 0 bytes)

julia> @btime floor(Int, $x/$y);
  4.090 ns (0 allocations: 0 bytes)

I tried to integrate it into the loop inside BatchExtractCells!, but that slowed things down for some reason.
With

function BatchExtractCells3!(Cells, Points, CutOff)
    # @batch per=thread
    for i ∈ eachindex(Cells)
        t = map(Tuple(Points[i])) do x
            floor(Int, x/CutOff)+2
        end
        Cells[i] = CartesianIndex(t)
    end
end

this gives:

julia> @btime BatchExtractCells!($Cells, $Points, $CutOff)
  5.172 ms (0 allocations: 0 bytes)

julia> @btime BatchExtractCells3!($Cells, $Points, $CutOff)
  318.036 μs (0 allocations: 0 bytes)

Wow!

I just gave it a go. Here is the full script:

Script
using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Base.Threads
using Test

function BatchExtractCells!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. Int(fld(Points[i] + 2, CutOff))...)
    end
end

function BatchExtractCellsFloor!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i]   = CartesianIndex(@. floor(Int, 4 + Points[i] / CutOff)...)
    end
end


# Number of points/cells to test
N         = 100_000
Dims      = 2 
FloatType = Float64
CutOff    = 0.5

# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

benchmark_result = @b BatchExtractCells!($Cells, $Points, $CutOff)
println("BatchExtractCells!"); display(benchmark_result)

benchmark_result = @b BatchExtractCellsFloor!($Cells, $Points, $CutOff)
println("BatchExtractCellsFloor!"); display(benchmark_result)

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]

BatchExtractCells!(CellsA, Points, CutOff)
BatchExtractCellsFloor!(CellsB, Points, CutOff)

@test CellsA == CellsB
BatchExtractCells!
677.500 μs
BatchExtractCellsFloor!
79.600 μs
Test Passed

10x faster of a solution which is already twice as fast!

Could you kindly explain to me why I would have to do +4 instead of +2 when using floor?

Kind regards

@matthias314 thank you for the last code bit!

I implemented it and it is even faster, even when not using @batch?

BatchExtractCells!
3.308 ms
BatchExtractCellsFloor!
326.400 μs
BatchExtractCells3!
313.000 μs
Test Passed

When I try to enable @batch it does not find the CutOff parameter for some reason in the map comprehension. Do you know why?

Kind regards

Adding 4 after division by 0.5 is the same as adding 2 before the division. (I think you want to add 2 after the division.)

Indeed, thank you, makes total sense. Yes, I want to add 2 after the division as you mention.

With floating division:

With initial implementation:

I thought it would be fun for you to see what you have helped speed up, @PeterSimon and @matthias314. This is for my package SPHExample on Github.

Which is a nice speed up to get for free with no down-side!

In simulations with more particles, this function is under more stress and the gain is even more evident :blush:

Kind regards and thank you

If this form gives a valid result, it appears to be faster still.

function BatchExtractCells4!(Cells, Points, CutOff)
    for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(2+Points[i][1]>CutOff, 2+Points[i][2]>CutOff)
    end
end

This is faster to write :grinning:

@. CartesianIndex(2 + (first(Points) >0.5), 2 + (last(Points)>0.5))

Hi!

It does not unfortunately, since no division with CutOff occurs :slight_smile:

Kind regards

But your point about using a fixed value of 0.5 seems to matter. I did a test-script here:

Summary
using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Base.Threads
using Test


function BatchExtractCellsFloorOriginal!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. floor(Int, 4 + Points[i] / CutOff)...)
    end
end

function BatchExtractCellsFloorVal!(Cells, Points, ::Val{CutOff}) where CutOff
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. floor(Int, 4 + Points[i] / CutOff)...)
    end
end

# CutOff fixed to 0.5
function BatchExtractCellsFloorFixed!(Cells, Points, _)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i]   = CartesianIndex(@. floor(Int, 4 + Points[i] / 0.5)...)
    end
end

# Number of points/cells to test
let
    N         = 100_000
    Dims      = 2 
    FloatType = Float64
    CutOff    = 0.5

    # Initialize Points with random SVector values
    Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

    # Initialize Cells as an array of CartesianIndex{3}, dummy initialization
    Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

    benchmark_result = @b BatchExtractCellsFloorOriginal!($Cells, $Points, $CutOff)
    println("BatchExtractCellsFloorOriginal!"); display(benchmark_result)

    benchmark_result = @b BatchExtractCellsFloorVal!($Cells, $Points, $(Val(CutOff)))
    println("BatchExtractCellsFloorVal!"); display(benchmark_result)

    benchmark_result = @b BatchExtractCellsFloorFixed!($Cells, $Points, $CutOff)
    println("BatchExtractCellsFloorFixed!"); display(benchmark_result)

    CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsC = [zero(CartesianIndex{Dims}) for _ = 1:N]

    BatchExtractCellsFloorOriginal!(CellsA, Points, CutOff)
    BatchExtractCellsFloorVal!(CellsB, Points, Val(CutOff))
    BatchExtractCellsFloorFixed!(CellsC, Points, CutOff)

    @test CellsA == CellsB == CellsC
end

With results:

BatchExtractCellsFloorOriginal!
79.700 μs
BatchExtractCellsFloorVal!
79.700 μs
BatchExtractCellsFloorFixed!
61.700 μs
Test Passed

Quite interesting, not sure how come using a fixed value is better, when it should know the same fixed value at compile time, especially when using Val.

Kind regards

If even a small improvement is useful to you, test this way too.


function BatchExtractCells5!(Cells, Points, CutOff)
    for i ∈ eachindex(Cells)
        OffCut=1/CutOff
        t = map(Tuple(Points[i])) do x
            floor(Int, muladd(x,OffCut,2))
        end
        Cells[i] = CartesianIndex(t)
    end
end

function BatchExtractCellsTrunc!(Cells, Points, CutOff)
    for i ∈ eachindex(Cells)
        OffCut=1/CutOff
        t = map(Tuple(Points[i])) do x
            trunc(Int, muladd(x,OffCut,2))
        end
        Cells[i] = CartesianIndex(t)
    end
end

However, I am curious to understand what the real context is in more detail with respect to the domain of the Points vector and the CutOff value.
I feel like saying that if CutOff=mean(first.(Points)) or last.(points), instead of the floor function you can use the <() function.
Another observation concerns the shift of 2. If this is known/fixed, can’t we predefine/preload Cells with this value and then avoid having to add it?

I crave performance. If you can make it even faster, please do so! I tested your solution with some small modifications to enable multi-threading.

Summary
using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Base.Threads
using Test

function BatchExtractCellsOriginal!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + 2 * one(ci) 
    end
end

function BatchExtractCellsFloorOriginal!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. floor(Int, 2 + Points[i] / CutOff)...)
    end
end

function BatchExtractCellsFloorVal!(Cells, Points, ::Val{CutOff}) where CutOff
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. floor(Int, 2 + Points[i] / CutOff)...)
    end
end

function BatchExtractCellsFloorMap!(Cells, Points, ::Val{CutOff}) where CutOff
    function map_floor(x)
        floor(Int, x / CutOff) + 2
    end
    
    @batch per=thread for i ∈ eachindex(Cells)
        t = map(map_floor, Tuple(Points[i]))
        Cells[i] = CartesianIndex(t)
    end
end

function BatchExtractCellsMuladd!(Cells, Points, ::Val{InverseCutOff}) where InverseCutOff
    function map_floor(x)
        floor(Int, muladd(x,InverseCutOff,2))
    end

    @batch per=thread for i ∈ eachindex(Cells)
        t = map(map_floor, Tuple(Points[i]))
        Cells[i] = CartesianIndex(t)
    end
end

# CutOff fixed to 0.5
function BatchExtractCellsFloorFixed!(Cells, Points, _)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i]   = CartesianIndex(@. floor(Int, 2 + Points[i] / 0.5)...)
    end
end

# Number of points/cells to test
let
    N         = 100_000
    Dims      = 2 
    FloatType = Float64
    CutOff    = 0.5
    InverseCutOff = 1/CutOff

    # Initialize Points with random SVector values
    Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

    # Initialize Cells as an array of CartesianIndex{3}, dummy initialization
    Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

    # Benchmarks
    benchmark_result = @benchmark BatchExtractCellsOriginal!($Cells, $Points, $CutOff)
    println("BatchExtractCellsOriginal!"); display(benchmark_result)

    benchmark_result = @benchmark BatchExtractCellsFloorOriginal!($Cells, $Points, $CutOff)
    println("BatchExtractCellsFloorOriginal!"); display(benchmark_result)

    benchmark_result = @benchmark BatchExtractCellsFloorVal!($Cells, $Points, $(Val(CutOff)))
    println("BatchExtractCellsFloorVal!"); display(benchmark_result)

    benchmark_result = @benchmark BatchExtractCellsFloorMap!($Cells, $Points, $(Val(CutOff)))
    println("BatchExtractCellsFloorMap!"); display(benchmark_result)

    
    benchmark_result = @benchmark BatchExtractCellsMuladd!($Cells, $Points, $(Val(InverseCutOff)))
    println("BatchExtractCellsMuladd!"); display(benchmark_result)

    benchmark_result = @benchmark BatchExtractCellsFloorFixed!($Cells, $Points, $CutOff)
    println("BatchExtractCellsFloorFixed!"); display(benchmark_result)

    Cells0 = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsC = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsD = [zero(CartesianIndex{Dims}) for _ = 1:N]

    BatchExtractCellsOriginal!(Cells0, Points, CutOff)
    BatchExtractCellsFloorOriginal!(CellsA, Points, CutOff)
    BatchExtractCellsFloorVal!(CellsB, Points, Val(CutOff))
    BatchExtractCellsFloorMap!(CellsC, Points, Val(CutOff))
    BatchExtractCellsFloorFixed!(CellsD, Points, CutOff)

    @test Cells0 == CellsA == CellsB == CellsC == CellsD
end

The results as follows:

BatchExtractCellsOriginal!
BenchmarkTools.Trial: 2750 samples with 1 evaluation.
 Range (min … max):  1.121 ms … 28.540 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.394 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.813 ms ±  1.229 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▄▃▄▄▃▂▃▃▃▂▂▂▁▂▂▁▁                                        ▁
  ███████████████████████▇█▇▇▆▆▆▅▅▄▆▅▅▄▅▆▅▃▅▁▁▃▅▄▃▄▃▃▃▃▃▄▃▁▅ █
  1.12 ms      Histogram: log(frequency) by time     6.78 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorOriginal!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   79.600 μs …  27.497 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      81.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   157.044 μs ± 729.045 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▁▁                                                          ▁
  █████▇▆▅▄▅▅▄▅▅▄▅▅▅▄▅▄▄▅▅▄▅▃▄▄▃▃▅▄▆▄▅▃▄▄▄▅▅▃▄▃▄▄▄▅▄▂▃▂▃▃▄▂▂▃▂▄ █
  79.6 μs       Histogram: log(frequency) by time       1.55 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorVal!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   79.700 μs …   9.897 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      81.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   127.487 μs ± 297.875 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁ ▁                                                          ▁
  █████▇▇▆▆▆▄▄▅▄▅▅▄▅▅▅▄▄▄▄▄▄▄▃▅▄▅▄▂▄▄▄▄▄▄▄▄▄▅▄▄▄▄▅▄▄▄▄▄▃▃▃▄▃▂▃▄ █
  79.7 μs       Histogram: log(frequency) by time       1.29 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorMap!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  55.400 μs …  13.771 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     57.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   91.524 μs ± 287.055 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▁                                                          ▁
  ██████▇▇▆▇▆▆▅▅▄▄▄▅▄▄▅▄▅▅▆▅▅▆▄▄▄▄▄▄▅▁▄▅▅▄▁▅▄▄▄▄▄▅▄▅▄▄▄▄▄▄▄▄▅▄ █
  55.4 μs       Histogram: log(frequency) by time       949 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsMuladd!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  51.500 μs …  15.129 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   82.511 μs ± 277.106 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                            ▁
  ████▇█▇▇▆▅▅▅▅▅▅▄▅▃▄▄▄▅▃▄▄▄▄▄▄▄▃▃▂▄▄▄▃▄▃▃▃▃▄▄▃▄▄▄▄▄▃▃▂▅▃▃▃▃▄▄ █
  51.5 μs       Histogram: log(frequency) by time       856 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorFixed!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   61.600 μs …  13.533 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      63.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   101.151 μs ± 348.971 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▁▁                                                          ▁
  ███████▇▆▆▆▆▆▆▅▆▅▅▄▄▅▄▄▅▆▅▅▄▄▄▃▄▁▃▄▄▄▄▄▄▅▅▄▁▄▆▅▃▃▄▄▄▁▃▁▄▄▆▄▄▄ █
  61.6 μs       Histogram: log(frequency) by time        913 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Test Passed

Where with 51.5 mus your solution is winning!

As I was writing this post, you added the trunc solution. Changing floor in BatchExtractCellsMuladd! to trunc now gives:

BatchExtractCellsMuladd!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  39.400 μs …  15.730 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     40.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   67.832 μs ± 303.542 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁                                                           ▁
  ███▇▇█▇▇▆▆▆▅▁▃▄▄▃▄▃▃▃▄▁▄▄▁▁▃▄▃▄▃▄▄▃▁▁▄▃▄▁▁▃▃▁▄▃▄▃▄▁▅▃▄▁▄▁▃▃▄ █
  39.4 μs       Histogram: log(frequency) by time       791 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Which means you have taken the lead again (from your self) with 39.4 mus !

Amazing stuff!

By the way, I went back to +2 since then it matches my original implementation. My mind is telling me it should be +4 though…

The real context of this is to grid particles into cell blocks of a fixed size (CutOff). This is to prepare them for being looped through via a stencil operation to calculate particle properties.

In regards to the shift of 2, would allocating it as 2’s not have the same performance as adding 2 afterwards?

EDIT: Advise to use @benchmark for this @b fluctuates too much.

Kind regards

I think the sum of 2 within muladd weighs little.
As a conceptual matter, I was wondering whether, given that Cells must(?) be predefined with values of 0 and then added to 2, whether it was not appropriate to define it in advance with initial values of 2.

Got it. Yes, conceptually, it would be fine to allocate 2 and then add the calculation on top. I prefer having it as 0 initially, because I need to reset the vector each time step for my actual use case, and then it is cleaner in a “mental model” to reset it to 0 in my opinion.

Building on your solution @rocco_sprmnt21 I found that unsafe_trunc produces an even faster code, which is very nice.

Summary
using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Base.Threads
using Test

function BatchExtractCellsOriginal!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + 2 * one(ci) 
    end
end

function BatchExtractCellsMuladd!(Cells, Points, ::Val{InverseCutOff}) where InverseCutOff
    function map_floor(x)
        unsafe_trunc(Int, muladd(x,InverseCutOff,2))
    end

    @batch per=thread for i ∈ eachindex(Cells)
        t = map(map_floor, Tuple(Points[i]))
        Cells[i] = CartesianIndex(t)
    end
end

# Number of points/cells to test
let
    N         = 100_000
    Dims      = 2 
    FloatType = Float64
    CutOff    = 0.5
    InverseCutOff = 1/CutOff
    ValInverseCutOff = Val(InverseCutOff)

    # Initialize Points with random SVector values
    Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

    # Initialize Cells as an array of CartesianIndex{3}, dummy initialization
    Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

    # Benchmarks
    benchmark_result = @benchmark BatchExtractCellsOriginal!($Cells, $Points, $CutOff)
    println("BatchExtractCellsOriginal!"); display(benchmark_result)
    
    benchmark_result = @benchmark BatchExtractCellsMuladd!($Cells, $Points, $ValInverseCutOff)
    println("BatchExtractCellsMuladd!"); display(benchmark_result)


    Cells0 = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]

    BatchExtractCellsOriginal!(Cells0, Points, CutOff)
    BatchExtractCellsMuladd!(CellsA, Points, ValInverseCutOff)

    @test Cells0 == CellsA
end

BatchExtractCellsOriginal!
BenchmarkTools.Trial: 2817 samples with 1 evaluation.
 Range (min … max):  1.117 ms … 32.984 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.258 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.770 ms ±  1.531 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▃▂▃▂▂▂▃▃▂▂▂▂▁▁                                           ▁
  ███████████████████▇█▇▇▆▆▇▆▅▅▃▄▅▅▅▄▅▅▆▆▅▄▅▃▆▃▃▅▃▁▁▁▅▄▃▁▁▄▄ █
  1.12 ms      Histogram: log(frequency) by time      7.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsMuladd!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  28.800 μs …  34.056 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   85.681 μs ± 832.679 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                            ▁
  ███▇▇██▇▇▆▇▅▅▅▄▄▄▅▅▃▄▅▄▁▃▄▄▃▃▄▁▄▄▄▅▃▁▄▄▃▄▃▃▄▁▄▃▁▄▃▁▃▄▃▄▄▄▃▁▃ █
  28.8 μs       Histogram: log(frequency) by time       955 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Test Passed

The script above has the original implementation and the current fastest one. unsafe_trunc is fine to use, as long as I can guarantee that my Points are never NaN - which I can for the time being.

Any more ideas, anyone?

Kind regards

note that unsafe_trunc will also cause problems for values bigger than Int

You could try doing all the value calculations as StaticArray elements and then interpreting the tuples as Cartesian indices at the end

@Oscar_Smith , thanks for highlighting, just checked, I should realistically never get there.

@rocco_sprmnt21 gave it a go, but was not able to speed the code up, by either using StaticArray or tuples etc.

I think this is fastest possible for now then?