I crave performance. If you can make it even faster, please do so! I tested your solution with some small modifications to enable multi-threading.
Summary
using StaticArrays
using Chairmarks
using BenchmarkTools
using Polyester
using Base.Threads
using Test
function BatchExtractCellsOriginal!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
ci = CartesianIndex(@. Int(fld(Points[i], CutOff))...)
Cells[i] = ci + 2 * one(ci)
end
end
function BatchExtractCellsFloorOriginal!(Cells, Points, CutOff)
@batch per=thread for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(@. floor(Int, 2 + Points[i] / CutOff)...)
end
end
function BatchExtractCellsFloorVal!(Cells, Points, ::Val{CutOff}) where CutOff
@batch per=thread for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(@. floor(Int, 2 + Points[i] / CutOff)...)
end
end
function BatchExtractCellsFloorMap!(Cells, Points, ::Val{CutOff}) where CutOff
function map_floor(x)
floor(Int, x / CutOff) + 2
end
@batch per=thread for i ∈ eachindex(Cells)
t = map(map_floor, Tuple(Points[i]))
Cells[i] = CartesianIndex(t)
end
end
function BatchExtractCellsMuladd!(Cells, Points, ::Val{InverseCutOff}) where InverseCutOff
function map_floor(x)
floor(Int, muladd(x,InverseCutOff,2))
end
@batch per=thread for i ∈ eachindex(Cells)
t = map(map_floor, Tuple(Points[i]))
Cells[i] = CartesianIndex(t)
end
end
# CutOff fixed to 0.5
function BatchExtractCellsFloorFixed!(Cells, Points, _)
@batch per=thread for i ∈ eachindex(Cells)
Cells[i] = CartesianIndex(@. floor(Int, 2 + Points[i] / 0.5)...)
end
end
# Number of points/cells to test
let
N = 100_000
Dims = 2
FloatType = Float64
CutOff = 0.5
InverseCutOff = 1/CutOff
# Initialize Points with random SVector values
Points = [SVector{Dims, FloatType}(rand(Dims)...) for _ = 1:N]
# Initialize Cells as an array of CartesianIndex{3}, dummy initialization
Cells = [zero(CartesianIndex{Dims}) for _ = 1:N]
# Benchmarks
benchmark_result = @benchmark BatchExtractCellsOriginal!($Cells, $Points, $CutOff)
println("BatchExtractCellsOriginal!"); display(benchmark_result)
benchmark_result = @benchmark BatchExtractCellsFloorOriginal!($Cells, $Points, $CutOff)
println("BatchExtractCellsFloorOriginal!"); display(benchmark_result)
benchmark_result = @benchmark BatchExtractCellsFloorVal!($Cells, $Points, $(Val(CutOff)))
println("BatchExtractCellsFloorVal!"); display(benchmark_result)
benchmark_result = @benchmark BatchExtractCellsFloorMap!($Cells, $Points, $(Val(CutOff)))
println("BatchExtractCellsFloorMap!"); display(benchmark_result)
benchmark_result = @benchmark BatchExtractCellsMuladd!($Cells, $Points, $(Val(InverseCutOff)))
println("BatchExtractCellsMuladd!"); display(benchmark_result)
benchmark_result = @benchmark BatchExtractCellsFloorFixed!($Cells, $Points, $CutOff)
println("BatchExtractCellsFloorFixed!"); display(benchmark_result)
Cells0 = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsA = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsB = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsC = [zero(CartesianIndex{Dims}) for _ = 1:N]
CellsD = [zero(CartesianIndex{Dims}) for _ = 1:N]
BatchExtractCellsOriginal!(Cells0, Points, CutOff)
BatchExtractCellsFloorOriginal!(CellsA, Points, CutOff)
BatchExtractCellsFloorVal!(CellsB, Points, Val(CutOff))
BatchExtractCellsFloorMap!(CellsC, Points, Val(CutOff))
BatchExtractCellsFloorFixed!(CellsD, Points, CutOff)
@test Cells0 == CellsA == CellsB == CellsC == CellsD
end
The results as follows:
BatchExtractCellsOriginal!
BenchmarkTools.Trial: 2750 samples with 1 evaluation.
Range (min … max): 1.121 ms … 28.540 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.394 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.813 ms ± 1.229 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅▄▃▄▄▃▂▃▃▃▂▂▂▁▂▂▁▁ ▁
███████████████████████▇█▇▇▆▆▆▅▅▄▆▅▅▄▅▆▅▃▅▁▁▃▅▄▃▄▃▃▃▃▃▄▃▁▅ █
1.12 ms Histogram: log(frequency) by time 6.78 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorOriginal!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 79.600 μs … 27.497 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 81.300 μs ┊ GC (median): 0.00%
Time (mean ± σ): 157.044 μs ± 729.045 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂▁▁ ▁
█████▇▆▅▄▅▅▄▅▅▄▅▅▅▄▅▄▄▅▅▄▅▃▄▄▃▃▅▄▆▄▅▃▄▄▄▅▅▃▄▃▄▄▄▅▄▂▃▂▃▃▄▂▂▃▂▄ █
79.6 μs Histogram: log(frequency) by time 1.55 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorVal!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 79.700 μs … 9.897 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 81.300 μs ┊ GC (median): 0.00%
Time (mean ± σ): 127.487 μs ± 297.875 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁ ▁ ▁
█████▇▇▆▆▆▄▄▅▄▅▅▄▅▅▅▄▄▄▄▄▄▄▃▅▄▅▄▂▄▄▄▄▄▄▄▄▄▅▄▄▄▄▅▄▄▄▄▄▃▃▃▄▃▂▃▄ █
79.7 μs Histogram: log(frequency) by time 1.29 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorMap!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 55.400 μs … 13.771 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 57.200 μs ┊ GC (median): 0.00%
Time (mean ± σ): 91.524 μs ± 287.055 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁▁ ▁
██████▇▇▆▇▆▆▅▅▄▄▄▅▄▄▅▄▅▅▆▅▅▆▄▄▄▄▄▄▅▁▄▅▅▄▁▅▄▄▄▄▄▅▄▅▄▄▄▄▄▄▄▄▅▄ █
55.4 μs Histogram: log(frequency) by time 949 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsMuladd!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 51.500 μs … 15.129 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 53.300 μs ┊ GC (median): 0.00%
Time (mean ± σ): 82.511 μs ± 277.106 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁
████▇█▇▇▆▅▅▅▅▅▅▄▅▃▄▄▄▅▃▄▄▄▄▄▄▄▃▃▂▄▄▄▃▄▃▃▃▃▄▄▃▄▄▄▄▄▃▃▂▅▃▃▃▃▄▄ █
51.5 μs Histogram: log(frequency) by time 856 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
BatchExtractCellsFloorFixed!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 61.600 μs … 13.533 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 63.400 μs ┊ GC (median): 0.00%
Time (mean ± σ): 101.151 μs ± 348.971 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁▁▁ ▁
███████▇▆▆▆▆▆▆▅▆▅▅▄▄▅▄▄▅▆▅▅▄▄▄▃▄▁▃▄▄▄▄▄▄▅▅▄▁▄▆▅▃▃▄▄▄▁▃▁▄▄▆▄▄▄ █
61.6 μs Histogram: log(frequency) by time 913 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Test Passed
Where with 51.5 mus
your solution is winning!
As I was writing this post, you added the trunc
solution. Changing floor
in BatchExtractCellsMuladd!
to trunc
now gives:
BatchExtractCellsMuladd!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 39.400 μs … 15.730 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 40.800 μs ┊ GC (median): 0.00%
Time (mean ± σ): 67.832 μs ± 303.542 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁ ▁
███▇▇█▇▇▆▆▆▅▁▃▄▄▃▄▃▃▃▄▁▄▄▁▁▃▄▃▄▃▄▄▃▁▁▄▃▄▁▁▃▃▁▄▃▄▃▄▁▅▃▄▁▄▁▃▃▄ █
39.4 μs Histogram: log(frequency) by time 791 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Which means you have taken the lead again (from your self) with 39.4 mus
!
Amazing stuff!
By the way, I went back to +2
since then it matches my original implementation. My mind is telling me it should be +4
though…
The real context of this is to grid particles into cell blocks of a fixed size (CutOff). This is to prepare them for being looped through via a stencil operation to calculate particle properties.
In regards to the shift of 2, would allocating it as 2’s not have the same performance as adding 2 afterwards?
EDIT: Advise to use @benchmark
for this @b
fluctuates too much.
Kind regards