Possible to speed up this function for calculating cartesian indices?

Hello!

I have a small function which I call a lot:

function BatchExtractCells!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + 2 * one(ci) 
    end
end

I’ve been trying to see if I can speed it up somehow. The current performance is:

using StaticArrays
using Polyester
using Chairmarks
using BenchmarkTools

function BatchExtractCells!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + 2 * one(ci) 
    end
end

# Number of points/cells to test
N         = 100_000
Dims      = 2 
FloatType = Float64
CutOff    = 0.5

# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]


benchmark_result = @b BatchExtractCells!($Cells, $Points, $CutOff)
println("BatchExtractCells"); display(benchmark_result)

@benchmark BatchExtractCells!($Cells, $Points, $CutOff)

BenchmarkTools.Trial: 769 samples with 1 evaluation.
 Range (min … max):  5.827 ms … 16.340 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.067 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.494 ms ±  1.193 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇▅▃▃▂▁
  ████████████▇▅▇▆▅▅▅▅▆▆▇▆▅▄▁▅▅▆▁▅▄▄▆▇▄▅▄▁▆▄▁▁▁▄▄▄▄▆▅▄▁▁▄▄▁▅ ▇
  5.83 ms      Histogram: log(frequency) by time     11.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

I don’t think there is much more to do, to speed it up, but I have been wrong in the past. Rather be wrong and get a faster code, than not knowing :slight_smile:

Kind regards

That’s some nice code, IMO. But I was able to get some significant improvement by hoisting the CartesianIndex creation from the loop. On my machine:

julia> @benchmark BatchExtractCells!($Cells, $Points, $CutOff)
BenchmarkTools.Trial: 5510 samples with 1 evaluation.
 Range (min … max):  638.600 μs …   8.569 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     657.750 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   904.728 μs ± 497.728 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▁▁▁  ▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁
  █████████████████████████████▇▇█▆▇▆▆▆▇▅▅▆▅▆▅▅▅▅▅▅▅▄▅▄▂▃▅▄▄▄▄▃ █
  639 μs        Histogram: log(frequency) by time       2.66 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark BatchExtractCells2!($Cells, $Points, $CutOff)
BenchmarkTools.Trial: 9784 samples with 1 evaluation.
 Range (min … max):  338.500 μs …   9.726 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     349.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   508.377 μs ± 409.123 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▁▁     ▁▁  ▁ ▁▁                                             ▁
  ██████████████████▇██████▇▇▇▇▇▇██▇▇▇▇▆▆▆▆▆▆▆▆▆▅▅▅▄▅▅▅▄▅▅▅▅▄▅▄ █
  338 μs        Histogram: log(frequency) by time       1.98 ms <

 Memory estimate: 80 bytes, allocs estimate: 3.

where

function BatchExtractCells2!(Cells, Points, CutOff)
    twoci = CartesianIndex(ntuple(_ -> 2,Dims))
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + twoci
    end
end

You can get rid of the allocation if you make Dims a constant:

const Dims = 2

Thank you, that is a good insight! I was under the impression that the Julia compiler would automize that out, but I guess it can’t since it is unsure of the size? Instead of making the tuple at all, I realized I just could " + 2"…

function BatchExtractCells!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. Int(fld(Points[i] + 2, CutOff))...)
    end
end

Which gives me:

BatchExtractCells
3.240 ms

And that is a 2x speed up as you showed!

Thanks :slight_smile:

1 Like

Does the precision of fld matter? If not, then floating-point division is much faster:

julia> x = rand(); y = 0.5;

julia> @btime Int(fld($x, $y));  # sometimes only 9ns
  19.523 ns (0 allocations: 0 bytes)

julia> @btime floor(Int, $x/$y);
  4.090 ns (0 allocations: 0 bytes)

I tried to integrate it into the loop inside BatchExtractCells!, but that slowed things down for some reason.
With

function BatchExtractCells3!(Cells, Points, CutOff)
    # @batch per=thread
    for i ∈ eachindex(Cells)
        t = map(Tuple(Points[i])) do x
            floor(Int, x/CutOff)+2
        end
        Cells[i] = CartesianIndex(t)
    end
end

this gives:

julia> @btime BatchExtractCells!($Cells, $Points, $CutOff)
  5.172 ms (0 allocations: 0 bytes)

julia> @btime BatchExtractCells3!($Cells, $Points, $CutOff)
  318.036 μs (0 allocations: 0 bytes)
3 Likes

Wow!

I just gave it a go. Here is the full script:

Script
using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Base.Threads
using Test

function BatchExtractCells!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. Int(fld(Points[i] + 2, CutOff))...)
    end
end

function BatchExtractCellsFloor!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i]   = CartesianIndex(@. floor(Int, 4 + Points[i] / CutOff)...)
    end
end


# Number of points/cells to test
N         = 100_000
Dims      = 2 
FloatType = Float64
CutOff    = 0.5

# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

benchmark_result = @b BatchExtractCells!($Cells, $Points, $CutOff)
println("BatchExtractCells!"); display(benchmark_result)

benchmark_result = @b BatchExtractCellsFloor!($Cells, $Points, $CutOff)
println("BatchExtractCellsFloor!"); display(benchmark_result)

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]

BatchExtractCells!(CellsA, Points, CutOff)
BatchExtractCellsFloor!(CellsB, Points, CutOff)

@test CellsA == CellsB
BatchExtractCells!
677.500 μs
BatchExtractCellsFloor!
79.600 μs
Test Passed

10x faster of a solution which is already twice as fast!

Could you kindly explain to me why I would have to do +4 instead of +2 when using floor?

Kind regards

@matthias314 thank you for the last code bit!

I implemented it and it is even faster, even when not using @batch?

BatchExtractCells!
3.308 ms
BatchExtractCellsFloor!
326.400 μs
BatchExtractCells3!
313.000 μs
Test Passed

When I try to enable @batch it does not find the CutOff parameter for some reason in the map comprehension. Do you know why?

Kind regards

Adding 4 after division by 0.5 is the same as adding 2 before the division. (I think you want to add 2 after the division.)

1 Like

Indeed, thank you, makes total sense. Yes, I want to add 2 after the division as you mention.

With floating division:

With initial implementation:

I thought it would be fun for you to see what you have helped speed up, @PeterSimon and @matthias314. This is for my package SPHExample on Github.

Which is a nice speed up to get for free with no down-side!

In simulations with more particles, this function is under more stress and the gain is even more evident :blush:

Kind regards and thank you

1 Like

If this form gives a valid result, it appears to be faster still.

function BatchExtractCells4!(Cells, Points, CutOff)
    for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(2+Points[i][1]>CutOff, 2+Points[i][2]>CutOff)
    end
end

This is faster to write :grinning:

@. CartesianIndex(2 + (first(Points) >0.5), 2 + (last(Points)>0.5))

Hi!

It does not unfortunately, since no division with CutOff occurs :slight_smile:

Kind regards

But your point about using a fixed value of 0.5 seems to matter. I did a test-script here:

Summary
using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Base.Threads
using Test


function BatchExtractCellsFloorOriginal!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. floor(Int, 4 + Points[i] / CutOff)...)
    end
end

function BatchExtractCellsFloorVal!(Cells, Points, ::Val{CutOff}) where CutOff
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. floor(Int, 4 + Points[i] / CutOff)...)
    end
end

# CutOff fixed to 0.5
function BatchExtractCellsFloorFixed!(Cells, Points, _)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i]   = CartesianIndex(@. floor(Int, 4 + Points[i] / 0.5)...)
    end
end

# Number of points/cells to test
let
    N         = 100_000
    Dims      = 2 
    FloatType = Float64
    CutOff    = 0.5

    # Initialize Points with random SVector values
    Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

    # Initialize Cells as an array of CartesianIndex{3}, dummy initialization
    Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

    benchmark_result = @b BatchExtractCellsFloorOriginal!($Cells, $Points, $CutOff)
    println("BatchExtractCellsFloorOriginal!"); display(benchmark_result)

    benchmark_result = @b BatchExtractCellsFloorVal!($Cells, $Points, $(Val(CutOff)))
    println("BatchExtractCellsFloorVal!"); display(benchmark_result)

    benchmark_result = @b BatchExtractCellsFloorFixed!($Cells, $Points, $CutOff)
    println("BatchExtractCellsFloorFixed!"); display(benchmark_result)

    CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsC = [zero(CartesianIndex{Dims}) for _ = 1:N]

    BatchExtractCellsFloorOriginal!(CellsA, Points, CutOff)
    BatchExtractCellsFloorVal!(CellsB, Points, Val(CutOff))
    BatchExtractCellsFloorFixed!(CellsC, Points, CutOff)

    @test CellsA == CellsB == CellsC
end

With results:

BatchExtractCellsFloorOriginal!
79.700 μs
BatchExtractCellsFloorVal!
79.700 μs
BatchExtractCellsFloorFixed!
61.700 μs
Test Passed

Quite interesting, not sure how come using a fixed value is better, when it should know the same fixed value at compile time, especially when using Val.

Kind regards

If even a small improvement is useful to you, test this way too.


function BatchExtractCells5!(Cells, Points, CutOff)
    for i ∈ eachindex(Cells)
        OffCut=1/CutOff
        t = map(Tuple(Points[i])) do x
            floor(Int, muladd(x,OffCut,2))
        end
        Cells[i] = CartesianIndex(t)
    end
end

function BatchExtractCellsTrunc!(Cells, Points, CutOff)
    for i ∈ eachindex(Cells)
        OffCut=1/CutOff
        t = map(Tuple(Points[i])) do x
            trunc(Int, muladd(x,OffCut,2))
        end
        Cells[i] = CartesianIndex(t)
    end
end

However, I am curious to understand what the real context is in more detail with respect to the domain of the Points vector and the CutOff value.
I feel like saying that if CutOff=mean(first.(Points)) or last.(points), instead of the floor function you can use the <() function.
Another observation concerns the shift of 2. If this is known/fixed, can’t we predefine/preload Cells with this value and then avoid having to add it?

2 Likes

I crave performance. If you can make it even faster, please do so! I tested your solution with some small modifications to enable multi-threading.

Summary
using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Base.Threads
using Test

function BatchExtractCellsOriginal!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + 2 * one(ci) 
    end
end

function BatchExtractCellsFloorOriginal!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. floor(Int, 2 + Points[i] / CutOff)...)
    end
end

function BatchExtractCellsFloorVal!(Cells, Points, ::Val{CutOff}) where CutOff
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i] = CartesianIndex(@. floor(Int, 2 + Points[i] / CutOff)...)
    end
end

function BatchExtractCellsFloorMap!(Cells, Points, ::Val{CutOff}) where CutOff
    function map_floor(x)
        floor(Int, x / CutOff) + 2
    end
    
    @batch per=thread for i ∈ eachindex(Cells)
        t = map(map_floor, Tuple(Points[i]))
        Cells[i] = CartesianIndex(t)
    end
end

function BatchExtractCellsMuladd!(Cells, Points, ::Val{InverseCutOff}) where InverseCutOff
    function map_floor(x)
        floor(Int, muladd(x,InverseCutOff,2))
    end

    @batch per=thread for i ∈ eachindex(Cells)
        t = map(map_floor, Tuple(Points[i]))
        Cells[i] = CartesianIndex(t)
    end
end

# CutOff fixed to 0.5
function BatchExtractCellsFloorFixed!(Cells, Points, _)
    @batch per=thread for i ∈ eachindex(Cells)
        Cells[i]   = CartesianIndex(@. floor(Int, 2 + Points[i] / 0.5)...)
    end
end

# Number of points/cells to test
let
    N         = 100_000
    Dims      = 2 
    FloatType = Float64
    CutOff    = 0.5
    InverseCutOff = 1/CutOff

    # Initialize Points with random SVector values
    Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

    # Initialize Cells as an array of CartesianIndex{3}, dummy initialization
    Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

    # Benchmarks
    benchmark_result = @benchmark BatchExtractCellsOriginal!($Cells, $Points, $CutOff)
    println("BatchExtractCellsOriginal!"); display(benchmark_result)

    benchmark_result = @benchmark BatchExtractCellsFloorOriginal!($Cells, $Points, $CutOff)
    println("BatchExtractCellsFloorOriginal!"); display(benchmark_result)

    benchmark_result = @benchmark BatchExtractCellsFloorVal!($Cells, $Points, $(Val(CutOff)))
    println("BatchExtractCellsFloorVal!"); display(benchmark_result)

    benchmark_result = @benchmark BatchExtractCellsFloorMap!($Cells, $Points, $(Val(CutOff)))
    println("BatchExtractCellsFloorMap!"); display(benchmark_result)

    
    benchmark_result = @benchmark BatchExtractCellsMuladd!($Cells, $Points, $(Val(InverseCutOff)))
    println("BatchExtractCellsMuladd!"); display(benchmark_result)

    benchmark_result = @benchmark BatchExtractCellsFloorFixed!($Cells, $Points, $CutOff)
    println("BatchExtractCellsFloorFixed!"); display(benchmark_result)

    Cells0 = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsC = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsD = [zero(CartesianIndex{Dims}) for _ = 1:N]

    BatchExtractCellsOriginal!(Cells0, Points, CutOff)
    BatchExtractCellsFloorOriginal!(CellsA, Points, CutOff)
    BatchExtractCellsFloorVal!(CellsB, Points, Val(CutOff))
    BatchExtractCellsFloorMap!(CellsC, Points, Val(CutOff))
    BatchExtractCellsFloorFixed!(CellsD, Points, CutOff)

    @test Cells0 == CellsA == CellsB == CellsC == CellsD
end

The results as follows:

BatchExtractCellsOriginal!
BenchmarkTools.Trial: 2750 samples with 1 evaluation.
 Range (min … max):  1.121 ms … 28.540 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.394 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.813 ms ±  1.229 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▄▃▄▄▃▂▃▃▃▂▂▂▁▂▂▁▁                                        ▁
  ███████████████████████▇█▇▇▆▆▆▅▅▄▆▅▅▄▅▆▅▃▅▁▁▃▅▄▃▄▃▃▃▃▃▄▃▁▅ █
  1.12 ms      Histogram: log(frequency) by time     6.78 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorOriginal!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   79.600 μs …  27.497 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      81.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   157.044 μs ± 729.045 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▁▁                                                          ▁
  █████▇▆▅▄▅▅▄▅▅▄▅▅▅▄▅▄▄▅▅▄▅▃▄▄▃▃▅▄▆▄▅▃▄▄▄▅▅▃▄▃▄▄▄▅▄▂▃▂▃▃▄▂▂▃▂▄ █
  79.6 μs       Histogram: log(frequency) by time       1.55 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorVal!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   79.700 μs …   9.897 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      81.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   127.487 μs ± 297.875 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁ ▁                                                          ▁
  █████▇▇▆▆▆▄▄▅▄▅▅▄▅▅▅▄▄▄▄▄▄▄▃▅▄▅▄▂▄▄▄▄▄▄▄▄▄▅▄▄▄▄▅▄▄▄▄▄▃▃▃▄▃▂▃▄ █
  79.7 μs       Histogram: log(frequency) by time       1.29 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorMap!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  55.400 μs …  13.771 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     57.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   91.524 μs ± 287.055 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▁                                                          ▁
  ██████▇▇▆▇▆▆▅▅▄▄▄▅▄▄▅▄▅▅▆▅▅▆▄▄▄▄▄▄▅▁▄▅▅▄▁▅▄▄▄▄▄▅▄▅▄▄▄▄▄▄▄▄▅▄ █
  55.4 μs       Histogram: log(frequency) by time       949 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsMuladd!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  51.500 μs …  15.129 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   82.511 μs ± 277.106 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                            ▁
  ████▇█▇▇▆▅▅▅▅▅▅▄▅▃▄▄▄▅▃▄▄▄▄▄▄▄▃▃▂▄▄▄▃▄▃▃▃▃▄▄▃▄▄▄▄▄▃▃▂▅▃▃▃▃▄▄ █
  51.5 μs       Histogram: log(frequency) by time       856 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorFixed!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   61.600 μs …  13.533 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      63.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   101.151 μs ± 348.971 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▁▁                                                          ▁
  ███████▇▆▆▆▆▆▆▅▆▅▅▄▄▅▄▄▅▆▅▅▄▄▄▃▄▁▃▄▄▄▄▄▄▅▅▄▁▄▆▅▃▃▄▄▄▁▃▁▄▄▆▄▄▄ █
  61.6 μs       Histogram: log(frequency) by time        913 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Test Passed

Where with 51.5 mus your solution is winning!

As I was writing this post, you added the trunc solution. Changing floor in BatchExtractCellsMuladd! to trunc now gives:

BatchExtractCellsMuladd!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  39.400 μs …  15.730 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     40.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   67.832 μs ± 303.542 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁                                                           ▁
  ███▇▇█▇▇▆▆▆▅▁▃▄▄▃▄▃▃▃▄▁▄▄▁▁▃▄▃▄▃▄▄▃▁▁▄▃▄▁▁▃▃▁▄▃▄▃▄▁▅▃▄▁▄▁▃▃▄ █
  39.4 μs       Histogram: log(frequency) by time       791 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Which means you have taken the lead again (from your self) with 39.4 mus !

Amazing stuff!

By the way, I went back to +2 since then it matches my original implementation. My mind is telling me it should be +4 though…

The real context of this is to grid particles into cell blocks of a fixed size (CutOff). This is to prepare them for being looped through via a stencil operation to calculate particle properties.

In regards to the shift of 2, would allocating it as 2’s not have the same performance as adding 2 afterwards?

EDIT: Advise to use @benchmark for this @b fluctuates too much.

Kind regards

I think the sum of 2 within muladd weighs little.
As a conceptual matter, I was wondering whether, given that Cells must(?) be predefined with values of 0 and then added to 2, whether it was not appropriate to define it in advance with initial values of 2.

Got it. Yes, conceptually, it would be fine to allocate 2 and then add the calculation on top. I prefer having it as 0 initially, because I need to reset the vector each time step for my actual use case, and then it is cleaner in a “mental model” to reset it to 0 in my opinion.

Building on your solution @rocco_sprmnt21 I found that unsafe_trunc produces an even faster code, which is very nice.

Summary
using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Base.Threads
using Test

function BatchExtractCellsOriginal!(Cells, Points, CutOff)
    @batch per=thread for i ∈ eachindex(Cells)
        ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
        Cells[i] = ci + 2 * one(ci) 
    end
end

function BatchExtractCellsMuladd!(Cells, Points, ::Val{InverseCutOff}) where InverseCutOff
    function map_floor(x)
        unsafe_trunc(Int, muladd(x,InverseCutOff,2))
    end

    @batch per=thread for i ∈ eachindex(Cells)
        t = map(map_floor, Tuple(Points[i]))
        Cells[i] = CartesianIndex(t)
    end
end

# Number of points/cells to test
let
    N         = 100_000
    Dims      = 2 
    FloatType = Float64
    CutOff    = 0.5
    InverseCutOff = 1/CutOff
    ValInverseCutOff = Val(InverseCutOff)

    # Initialize Points with random SVector values
    Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

    # Initialize Cells as an array of CartesianIndex{3}, dummy initialization
    Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

    # Benchmarks
    benchmark_result = @benchmark BatchExtractCellsOriginal!($Cells, $Points, $CutOff)
    println("BatchExtractCellsOriginal!"); display(benchmark_result)
    
    benchmark_result = @benchmark BatchExtractCellsMuladd!($Cells, $Points, $ValInverseCutOff)
    println("BatchExtractCellsMuladd!"); display(benchmark_result)


    Cells0 = [zero(CartesianIndex{Dims}) for _ = 1:N]
    CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]

    BatchExtractCellsOriginal!(Cells0, Points, CutOff)
    BatchExtractCellsMuladd!(CellsA, Points, ValInverseCutOff)

    @test Cells0 == CellsA
end

BatchExtractCellsOriginal!
BenchmarkTools.Trial: 2817 samples with 1 evaluation.
 Range (min … max):  1.117 ms … 32.984 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.258 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.770 ms ±  1.531 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▃▂▃▂▂▂▃▃▂▂▂▂▁▁                                           ▁
  ███████████████████▇█▇▇▆▆▇▆▅▅▃▄▅▅▅▄▅▅▆▆▅▄▅▃▆▃▃▅▃▁▁▁▅▄▃▁▁▄▄ █
  1.12 ms      Histogram: log(frequency) by time      7.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsMuladd!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  28.800 μs …  34.056 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   85.681 μs ± 832.679 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                            ▁
  ███▇▇██▇▇▆▇▅▅▅▄▄▄▅▅▃▄▅▄▁▃▄▄▃▃▄▁▄▄▄▅▃▁▄▄▃▄▃▃▄▁▄▃▁▄▃▁▃▄▃▄▄▄▃▁▃ █
  28.8 μs       Histogram: log(frequency) by time       955 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Test Passed

The script above has the original implementation and the current fastest one. unsafe_trunc is fine to use, as long as I can guarantee that my Points are never NaN - which I can for the time being.

Any more ideas, anyone?

Kind regards

note that unsafe_trunc will also cause problems for values bigger than Int

1 Like

You could try doing all the value calculations as StaticArray elements and then interpreting the tuples as Cartesian indices at the end

@Oscar_Smith , thanks for highlighting, just checked, I should realistically never get there.

@rocco_sprmnt21 gave it a go, but was not able to speed the code up, by either using StaticArray or tuples etc.

I think this is fastest possible for now then?