# Possible to speed up this function for calculating cartesian indices?

Hello!

I have a small function which I call a lot:

``````function BatchExtractCells!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
Cells[i] = ci + 2 * one(ci)
end
end
``````

I’ve been trying to see if I can speed it up somehow. The current performance is:

``````using StaticArrays
using Polyester
using Chairmarks
using BenchmarkTools

function BatchExtractCells!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
Cells[i] = ci + 2 * one(ci)
end
end

# Number of points/cells to test
N         = 100_000
Dims      = 2
FloatType = Float64
CutOff    = 0.5

# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

benchmark_result = @b BatchExtractCells!(\$Cells, \$Points, \$CutOff)
println("BatchExtractCells"); display(benchmark_result)

@benchmark BatchExtractCells!(\$Cells, \$Points, \$CutOff)

BenchmarkTools.Trial: 769 samples with 1 evaluation.
Range (min … max):  5.827 ms … 16.340 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     6.067 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   6.494 ms ±  1.193 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

▄█▇▅▃▃▂▁
████████████▇▅▇▆▅▅▅▅▆▆▇▆▅▄▁▅▅▆▁▅▄▄▆▇▄▅▄▁▆▄▁▁▁▄▄▄▄▆▅▄▁▁▄▄▁▅ ▇
5.83 ms      Histogram: log(frequency) by time     11.7 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

I don’t think there is much more to do, to speed it up, but I have been wrong in the past. Rather be wrong and get a faster code, than not knowing

Kind regards

That’s some nice code, IMO. But I was able to get some significant improvement by hoisting the `CartesianIndex` creation from the loop. On my machine:

``````julia> @benchmark BatchExtractCells!(\$Cells, \$Points, \$CutOff)
BenchmarkTools.Trial: 5510 samples with 1 evaluation.
Range (min … max):  638.600 μs …   8.569 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     657.750 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   904.728 μs ± 497.728 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▃▁▁▁  ▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁
█████████████████████████████▇▇█▆▇▆▆▆▇▅▅▆▅▆▅▅▅▅▅▅▅▄▅▄▂▃▅▄▄▄▄▃ █
639 μs        Histogram: log(frequency) by time       2.66 ms <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark BatchExtractCells2!(\$Cells, \$Points, \$CutOff)
BenchmarkTools.Trial: 9784 samples with 1 evaluation.
Range (min … max):  338.500 μs …   9.726 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     349.700 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   508.377 μs ± 409.123 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▃▁▁     ▁▁  ▁ ▁▁                                             ▁
██████████████████▇██████▇▇▇▇▇▇██▇▇▇▇▆▆▆▆▆▆▆▆▆▅▅▅▄▅▅▅▄▅▅▅▅▄▅▄ █
338 μs        Histogram: log(frequency) by time       1.98 ms <

Memory estimate: 80 bytes, allocs estimate: 3.
``````

where

``````function BatchExtractCells2!(Cells, Points, CutOff)
twoci = CartesianIndex(ntuple(_ -> 2,Dims))
@batch per=thread for i ∈ eachindex(Cells)
ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
Cells[i] = ci + twoci
end
end
``````

You can get rid of the allocation if you make `Dims` a constant:

``````const Dims = 2
``````

Thank you, that is a good insight! I was under the impression that the Julia compiler would automize that out, but I guess it can’t since it is unsure of the size? Instead of making the tuple at all, I realized I just could " + 2"…

``````function BatchExtractCells!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(@. Int(fld(Points[i] + 2, CutOff))...)
end
end
``````

Which gives me:

``````BatchExtractCells
3.240 ms
``````

And that is a 2x speed up as you showed!

Thanks

1 Like

Does the precision of `fld` matter? If not, then floating-point division is much faster:

``````julia> x = rand(); y = 0.5;

julia> @btime Int(fld(\$x, \$y));  # sometimes only 9ns
19.523 ns (0 allocations: 0 bytes)

julia> @btime floor(Int, \$x/\$y);
4.090 ns (0 allocations: 0 bytes)
``````

I tried to integrate it into the loop inside `BatchExtractCells!`, but that slowed things down for some reason.
With

``````function BatchExtractCells3!(Cells, Points, CutOff)
for i ∈ eachindex(Cells)
t = map(Tuple(Points[i])) do x
floor(Int, x/CutOff)+2
end
Cells[i] = CartesianIndex(t)
end
end
``````

this gives:

``````julia> @btime BatchExtractCells!(\$Cells, \$Points, \$CutOff)
5.172 ms (0 allocations: 0 bytes)

julia> @btime BatchExtractCells3!(\$Cells, \$Points, \$CutOff)
318.036 μs (0 allocations: 0 bytes)
``````
3 Likes

Wow!

I just gave it a go. Here is the full script:

Script
``````using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Test

function BatchExtractCells!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(@. Int(fld(Points[i] + 2, CutOff))...)
end
end

function BatchExtractCellsFloor!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
Cells[i]   = CartesianIndex(@. floor(Int, 4 + Points[i] / CutOff)...)
end
end

# Number of points/cells to test
N         = 100_000
Dims      = 2
FloatType = Float64
CutOff    = 0.5

# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

benchmark_result = @b BatchExtractCells!(\$Cells, \$Points, \$CutOff)
println("BatchExtractCells!"); display(benchmark_result)

benchmark_result = @b BatchExtractCellsFloor!(\$Cells, \$Points, \$CutOff)
println("BatchExtractCellsFloor!"); display(benchmark_result)

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]

BatchExtractCells!(CellsA, Points, CutOff)
BatchExtractCellsFloor!(CellsB, Points, CutOff)

@test CellsA == CellsB
``````
``````BatchExtractCells!
677.500 μs
BatchExtractCellsFloor!
79.600 μs
Test Passed
``````

10x faster of a solution which is already twice as fast!

Could you kindly explain to me why I would have to do `+4` instead of `+2` when using floor?

Kind regards

@matthias314 thank you for the last code bit!

I implemented it and it is even faster, even when not using `@batch`?

``````BatchExtractCells!
3.308 ms
BatchExtractCellsFloor!
326.400 μs
BatchExtractCells3!
313.000 μs
Test Passed
``````

When I try to enable `@batch` it does not find the CutOff parameter for some reason in the map comprehension. Do you know why?

Kind regards

Adding `4` after division by `0.5` is the same as adding `2` before the division. (I think you want to add `2` after the division.)

1 Like

Indeed, thank you, makes total sense. Yes, I want to add 2 after the division as you mention.

With floating division:

With initial implementation:

I thought it would be fun for you to see what you have helped speed up, @PeterSimon and @matthias314. This is for my package `SPHExample` on Github.

Which is a nice speed up to get for free with no down-side!

In simulations with more particles, this function is under more stress and the gain is even more evident

Kind regards and thank you

1 Like

If this form gives a valid result, it appears to be faster still.

``````function BatchExtractCells4!(Cells, Points, CutOff)
for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(2+Points[i][1]>CutOff, 2+Points[i][2]>CutOff)
end
end
``````

This is faster to write

``````@. CartesianIndex(2 + (first(Points) >0.5), 2 + (last(Points)>0.5))
``````

Hi!

It does not unfortunately, since no division with CutOff occurs

Kind regards

But your point about using a fixed value of `0.5` seems to matter. I did a test-script here:

Summary
``````using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Test

function BatchExtractCellsFloorOriginal!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(@. floor(Int, 4 + Points[i] / CutOff)...)
end
end

function BatchExtractCellsFloorVal!(Cells, Points, ::Val{CutOff}) where CutOff
@batch per=thread for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(@. floor(Int, 4 + Points[i] / CutOff)...)
end
end

# CutOff fixed to 0.5
function BatchExtractCellsFloorFixed!(Cells, Points, _)
@batch per=thread for i ∈ eachindex(Cells)
Cells[i]   = CartesianIndex(@. floor(Int, 4 + Points[i] / 0.5)...)
end
end

# Number of points/cells to test
let
N         = 100_000
Dims      = 2
FloatType = Float64
CutOff    = 0.5

# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

benchmark_result = @b BatchExtractCellsFloorOriginal!(\$Cells, \$Points, \$CutOff)
println("BatchExtractCellsFloorOriginal!"); display(benchmark_result)

benchmark_result = @b BatchExtractCellsFloorVal!(\$Cells, \$Points, \$(Val(CutOff)))
println("BatchExtractCellsFloorVal!"); display(benchmark_result)

benchmark_result = @b BatchExtractCellsFloorFixed!(\$Cells, \$Points, \$CutOff)
println("BatchExtractCellsFloorFixed!"); display(benchmark_result)

CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsC = [zero(CartesianIndex{Dims}) for _ = 1:N]

BatchExtractCellsFloorOriginal!(CellsA, Points, CutOff)
BatchExtractCellsFloorVal!(CellsB, Points, Val(CutOff))
BatchExtractCellsFloorFixed!(CellsC, Points, CutOff)

@test CellsA == CellsB == CellsC
end
``````

With results:

``````BatchExtractCellsFloorOriginal!
79.700 μs
BatchExtractCellsFloorVal!
79.700 μs
BatchExtractCellsFloorFixed!
61.700 μs
Test Passed
``````

Quite interesting, not sure how come using a fixed value is better, when it should know the same fixed value at compile time, especially when using `Val`.

Kind regards

If even a small improvement is useful to you, test this way too.

``````
function BatchExtractCells5!(Cells, Points, CutOff)
for i ∈ eachindex(Cells)
OffCut=1/CutOff
t = map(Tuple(Points[i])) do x
end
Cells[i] = CartesianIndex(t)
end
end
``````
``````
function BatchExtractCellsTrunc!(Cells, Points, CutOff)
for i ∈ eachindex(Cells)
OffCut=1/CutOff
t = map(Tuple(Points[i])) do x
end
Cells[i] = CartesianIndex(t)
end
end
``````

However, I am curious to understand what the real context is in more detail with respect to the domain of the Points vector and the CutOff value.
I feel like saying that if CutOff=mean(first.(Points)) or last.(points), instead of the floor function you can use the <() function.
Another observation concerns the shift of 2. If this is known/fixed, can’t we predefine/preload Cells with this value and then avoid having to add it?

2 Likes

I crave performance. If you can make it even faster, please do so! I tested your solution with some small modifications to enable multi-threading.

Summary
``````using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Test

function BatchExtractCellsOriginal!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
Cells[i] = ci + 2 * one(ci)
end
end

function BatchExtractCellsFloorOriginal!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(@. floor(Int, 2 + Points[i] / CutOff)...)
end
end

function BatchExtractCellsFloorVal!(Cells, Points, ::Val{CutOff}) where CutOff
@batch per=thread for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(@. floor(Int, 2 + Points[i] / CutOff)...)
end
end

function BatchExtractCellsFloorMap!(Cells, Points, ::Val{CutOff}) where CutOff
function map_floor(x)
floor(Int, x / CutOff) + 2
end

@batch per=thread for i ∈ eachindex(Cells)
t = map(map_floor, Tuple(Points[i]))
Cells[i] = CartesianIndex(t)
end
end

function BatchExtractCellsMuladd!(Cells, Points, ::Val{InverseCutOff}) where InverseCutOff
function map_floor(x)
end

@batch per=thread for i ∈ eachindex(Cells)
t = map(map_floor, Tuple(Points[i]))
Cells[i] = CartesianIndex(t)
end
end

# CutOff fixed to 0.5
function BatchExtractCellsFloorFixed!(Cells, Points, _)
@batch per=thread for i ∈ eachindex(Cells)
Cells[i]   = CartesianIndex(@. floor(Int, 2 + Points[i] / 0.5)...)
end
end

# Number of points/cells to test
let
N         = 100_000
Dims      = 2
FloatType = Float64
CutOff    = 0.5
InverseCutOff = 1/CutOff

# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

# Benchmarks
benchmark_result = @benchmark BatchExtractCellsOriginal!(\$Cells, \$Points, \$CutOff)
println("BatchExtractCellsOriginal!"); display(benchmark_result)

benchmark_result = @benchmark BatchExtractCellsFloorOriginal!(\$Cells, \$Points, \$CutOff)
println("BatchExtractCellsFloorOriginal!"); display(benchmark_result)

benchmark_result = @benchmark BatchExtractCellsFloorVal!(\$Cells, \$Points, \$(Val(CutOff)))
println("BatchExtractCellsFloorVal!"); display(benchmark_result)

benchmark_result = @benchmark BatchExtractCellsFloorMap!(\$Cells, \$Points, \$(Val(CutOff)))
println("BatchExtractCellsFloorMap!"); display(benchmark_result)

benchmark_result = @benchmark BatchExtractCellsMuladd!(\$Cells, \$Points, \$(Val(InverseCutOff)))

benchmark_result = @benchmark BatchExtractCellsFloorFixed!(\$Cells, \$Points, \$CutOff)
println("BatchExtractCellsFloorFixed!"); display(benchmark_result)

Cells0 = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsC = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsD = [zero(CartesianIndex{Dims}) for _ = 1:N]

BatchExtractCellsOriginal!(Cells0, Points, CutOff)
BatchExtractCellsFloorOriginal!(CellsA, Points, CutOff)
BatchExtractCellsFloorVal!(CellsB, Points, Val(CutOff))
BatchExtractCellsFloorMap!(CellsC, Points, Val(CutOff))
BatchExtractCellsFloorFixed!(CellsD, Points, CutOff)

@test Cells0 == CellsA == CellsB == CellsC == CellsD
end
``````

The results as follows:

``````BatchExtractCellsOriginal!
BenchmarkTools.Trial: 2750 samples with 1 evaluation.
Range (min … max):  1.121 ms … 28.540 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     1.394 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   1.813 ms ±  1.229 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▅▄▃▄▄▃▂▃▃▃▂▂▂▁▂▂▁▁                                        ▁
███████████████████████▇█▇▇▆▆▆▅▅▄▆▅▅▄▅▆▅▃▅▁▁▃▅▄▃▄▃▃▃▃▃▄▃▁▅ █
1.12 ms      Histogram: log(frequency) by time     6.78 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorOriginal!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):   79.600 μs …  27.497 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):      81.300 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   157.044 μs ± 729.045 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▂▁▁                                                          ▁
█████▇▆▅▄▅▅▄▅▅▄▅▅▅▄▅▄▄▅▅▄▅▃▄▄▃▃▅▄▆▄▅▃▄▄▄▅▅▃▄▃▄▄▄▅▄▂▃▂▃▃▄▂▂▃▂▄ █
79.6 μs       Histogram: log(frequency) by time       1.55 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorVal!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):   79.700 μs …   9.897 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):      81.300 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   127.487 μs ± 297.875 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▁ ▁                                                          ▁
█████▇▇▆▆▆▄▄▅▄▅▅▄▅▅▅▄▄▄▄▄▄▄▃▅▄▅▄▂▄▄▄▄▄▄▄▄▄▅▄▄▄▄▅▄▄▄▄▄▃▃▃▄▃▂▃▄ █
79.7 μs       Histogram: log(frequency) by time       1.29 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorMap!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  55.400 μs …  13.771 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     57.200 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   91.524 μs ± 287.055 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▁▁                                                          ▁
██████▇▇▆▇▆▆▅▅▄▄▄▅▄▄▅▄▅▅▆▅▅▆▄▄▄▄▄▄▅▁▄▅▅▄▁▅▄▄▄▄▄▅▄▅▄▄▄▄▄▄▄▄▅▄ █
55.4 μs       Histogram: log(frequency) by time       949 μs <

Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  51.500 μs …  15.129 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     53.300 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   82.511 μs ± 277.106 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█                                                            ▁
████▇█▇▇▆▅▅▅▅▅▅▄▅▃▄▄▄▅▃▄▄▄▄▄▄▄▃▃▂▄▄▄▃▄▃▃▃▃▄▄▃▄▄▄▄▄▃▃▂▅▃▃▃▃▄▄ █
51.5 μs       Histogram: log(frequency) by time       856 μs <

Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorFixed!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):   61.600 μs …  13.533 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):      63.400 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   101.151 μs ± 348.971 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▁▁▁                                                          ▁
███████▇▆▆▆▆▆▆▅▆▅▅▄▄▅▄▄▅▆▅▅▄▄▄▃▄▁▃▄▄▄▄▄▄▅▅▄▁▄▆▅▃▃▄▄▄▁▃▁▄▄▆▄▄▄ █
61.6 μs       Histogram: log(frequency) by time        913 μs <

Memory estimate: 0 bytes, allocs estimate: 0.
Test Passed
``````

Where with `51.5 mus` your solution is winning!

As I was writing this post, you added the `trunc` solution. Changing `floor` in `BatchExtractCellsMuladd!` to `trunc` now gives:

``````BatchExtractCellsMuladd!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  39.400 μs …  15.730 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     40.800 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   67.832 μs ± 303.542 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▁                                                           ▁
███▇▇█▇▇▆▆▆▅▁▃▄▄▃▄▃▃▃▄▁▄▄▁▁▃▄▃▄▃▄▄▃▁▁▄▃▄▁▁▃▃▁▄▃▄▃▄▁▅▃▄▁▄▁▃▃▄ █
39.4 μs       Histogram: log(frequency) by time       791 μs <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

Which means you have taken the lead again (from your self) with `39.4 mus` !

Amazing stuff!

By the way, I went back to `+2` since then it matches my original implementation. My mind is telling me it should be `+4` though…

The real context of this is to grid particles into cell blocks of a fixed size (CutOff). This is to prepare them for being looped through via a stencil operation to calculate particle properties.

In regards to the shift of 2, would allocating it as 2’s not have the same performance as adding 2 afterwards?

EDIT: Advise to use `@benchmark` for this `@b` fluctuates too much.

Kind regards

I think the sum of 2 within muladd weighs little.
As a conceptual matter, I was wondering whether, given that Cells must(?) be predefined with values of 0 and then added to 2, whether it was not appropriate to define it in advance with initial values of 2.

Got it. Yes, conceptually, it would be fine to allocate `2` and then add the calculation on top. I prefer having it as `0` initially, because I need to reset the vector each time step for my actual use case, and then it is cleaner in a “mental model” to reset it to `0` in my opinion.

Building on your solution @rocco_sprmnt21 I found that `unsafe_trunc` produces an even faster code, which is very nice.

Summary
``````using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Test

function BatchExtractCellsOriginal!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
Cells[i] = ci + 2 * one(ci)
end
end

function BatchExtractCellsMuladd!(Cells, Points, ::Val{InverseCutOff}) where InverseCutOff
function map_floor(x)
end

@batch per=thread for i ∈ eachindex(Cells)
t = map(map_floor, Tuple(Points[i]))
Cells[i] = CartesianIndex(t)
end
end

# Number of points/cells to test
let
N         = 100_000
Dims      = 2
FloatType = Float64
CutOff    = 0.5
InverseCutOff = 1/CutOff
ValInverseCutOff = Val(InverseCutOff)

# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]

# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]

# Benchmarks
benchmark_result = @benchmark BatchExtractCellsOriginal!(\$Cells, \$Points, \$CutOff)
println("BatchExtractCellsOriginal!"); display(benchmark_result)

benchmark_result = @benchmark BatchExtractCellsMuladd!(\$Cells, \$Points, \$ValInverseCutOff)

Cells0 = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]

BatchExtractCellsOriginal!(Cells0, Points, CutOff)

@test Cells0 == CellsA
end

``````
``````BatchExtractCellsOriginal!
BenchmarkTools.Trial: 2817 samples with 1 evaluation.
Range (min … max):  1.117 ms … 32.984 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     1.258 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   1.770 ms ±  1.531 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▅▃▂▃▂▂▂▃▃▂▂▂▂▁▁                                           ▁
███████████████████▇█▇▇▆▆▇▆▅▅▃▄▅▅▅▄▅▅▆▆▅▄▅▃▆▃▃▅▃▁▁▁▅▄▃▁▁▄▄ █
1.12 ms      Histogram: log(frequency) by time      7.3 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  28.800 μs …  34.056 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     29.000 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   85.681 μs ± 832.679 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█                                                            ▁
███▇▇██▇▇▆▇▅▅▅▄▄▄▅▅▃▄▅▄▁▃▄▄▃▃▄▁▄▄▄▅▃▁▄▄▃▄▃▃▄▁▄▃▁▄▃▁▃▄▃▄▄▄▃▁▃ █
28.8 μs       Histogram: log(frequency) by time       955 μs <

Memory estimate: 0 bytes, allocs estimate: 0.
Test Passed
``````

The script above has the original implementation and the current fastest one. `unsafe_trunc` is fine to use, as long as I can guarantee that my `Points` are never `NaN` - which I can for the time being.

Any more ideas, anyone?

Kind regards

note that unsafe_trunc will also cause problems for values bigger than `Int`

1 Like

You could try doing all the value calculations as StaticArray elements and then interpreting the tuples as Cartesian indices at the end

@Oscar_Smith , thanks for highlighting, just checked, I should realistically never get there.

@rocco_sprmnt21 gave it a go, but was not able to speed the code up, by either using StaticArray or tuples etc.

I think this is fastest possible for now then?