I have a function that takes in a line segment. I wanted to benchmark the performance and see if there is a difference whether the line segment is specified with endpoints given in integers or floats. The function has an argument g
which could be identity
, Float32
, β¦ etc. Compared to baking in g
into the function, benchmarking reveals the generic approach to be a lot slower. This is unexpected because
- Each function is its own type, and the compiler should specialize the method for
identity
orFloat32
-
@code_llvm
reveals the functions to be identical (so indeed the compiler does specialize)
Here are the full details, including a simpler βMWEβ that does not demonstrate this surprising behavior.
using BenchmarkTools
using UnPack: @unpack
struct LineSegment{T}
point1::T
point2::T
end
# a complicated iterator:
struct LineSegmentGridIterator2{TP, TVI, TVIF, TVF}
v_1::TP
v_N::TP
d::TVI
m::TVIF
s_1::TVF
end
"""
This iterates the voxels that are intersected by `l`. A voxel is named by an integer tuple `v`, representing the coordinates of the center. The voxel's extents are `v .- 0.5` to ``v .+ 0.5`
This goes by names such as the grid supercover, the outer Jordan digitization, and the conservative rasterization of the line segment.
"""
function LineSegmentGridIterator2(l::LineSegment)
p1 = l.point1
p2 = l.point2
v_1 = round.(Int, p1)
v_N = round.(Int, p2)
d = Int8.(sign.(p2 .- p1))
m = abs.(p2 .- p1)
# delta_t = 1 ./ m
# do not promote `Float32` to `Float64`
min_ = v_1 .- 0.5f0
max_ = v_1 .+ 0.5f0
s = ifelse.(p1 .> p2, p1 .- min_, max_ .- p1)
LineSegmentGridIterator2(
v_1,
v_N,
d,
m,
s
)
end
Base.IteratorSize(::Type{<:LineSegmentGridIterator2}) = Base.SizeUnknown()
Base.eltype(::Type{LineSegmentGridIterator2{TP, TV, TVF}}) where {TP, TV, TVF} = TP
function Base.iterate(l::LineSegmentGridIterator2, state=nothing)
@unpack v_1, v_N, d, m, s_1 = l
if state === nothing
state = (;s=s_1, v_i=v_1, break_=false)
end
@unpack s, v_i, break_ = state
break_ && return nothing
t = s ./ m
incs = minimum(t) .== t
any(incs .& (v_i .== v_N)) && (break_ = true)
# if `break_`, then can avoid computing these, but do it anyway.
v_iβ² = v_i .+ incs .* d
s = s .+ incs
new_state = (;s, v_i=v_iβ², break_)
output = v_i
return (output, new_state)
end
# a simple iterator, attempt at a MWE that did not exhibit behavior
struct FooItr{T}
x::T
end
Base.IteratorSize(::Type{<:FooItr}) = Base.SizeUnknown()
Base.eltype(::Type{FooItr{T}}) where {T} = T
function Base.iterate(itr::FooItr, state=nothing)
if state === nothing
state = zero.(itr.x)
end
all(state .> itr.x) && return nothing
stateβ² = if state[1] <= itr.x[1]
state .+ (1, 0)
else
state .+ (0, 1)
end
output = state
return (output, stateβ²)
end
FooItr(l::LineSegment) = FooItr(l.point2)
# e.g. benchmark1(LineSegmentGridIterator, identity, 2000)
function benchmark1(Itr, g, N)
s = g.((0, 0))
for i = 1:N
line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
for v_i in Itr(line)
s = s .+ v_i
end
end
return s
end
# e.g. benchmark1(LineSegmentGridIterator, 2000)
function benchmark2(Itr, N)
s = (0, 0)
for i = 1:N
line = LineSegment((0, 0), (1, 1 + 2*i))
for v_i in Itr(line)
s = s .+ v_i
end
end
return s
end
(after compiling. @benchmark
shows same story, but doing @time
for succinct output)
julia> # these benchmark the same
@time benchmark1(FooItr, identity, 200)
0.000083 seconds
(81400, 5433900)
julia> @time benchmark2(FooItr, 200)
0.000055 seconds
(81400, 5433900)
julia> # these are very different! even though @code_llvm is identical
@time benchmark1(LineSegmentGridIterator2, identity, 200)
0.030239 seconds (245.60 k allocations: 10.605 MiB, 28.06% gc time)
(20300, 5433900)
julia> @time benchmark2(LineSegmentGridIterator2, 200)
0.000170 seconds
(20300, 5433900)
The bizarre thing is that @code_llvm benchmark1(LineSegmentGridIterator2, identity, 200)
and @code_llvm benchmark2(LineSegmentGridIterator2, 200)
produce identical (up to line numbers and names) output. Note also that whatever is happening does not happen with FooItr
to the same extent, though benchmarking does reveal a detectable difference:
julia> @benchmark benchmark1(FooItr, identity, 200)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 50.988 ΞΌs β¦ 239.415 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 53.715 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 57.313 ΞΌs Β± 10.752 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
β ββ β
β β
ββββββββββββββ
ββ
ββββββββββ
ββββββββ
β
βββ
βββ
ββ
β
β
β
β
ββ
ββ
ββββ
ββ
βββ β
51 ΞΌs Histogram: log(frequency) by time 109 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark benchmark2(FooItr, 200)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 37.625 ΞΌs β¦ 184.335 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 39.994 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 42.544 ΞΌs Β± 8.510 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β ββββ β
β β β
ββββββββ
βββββββββ
βββββββ
ββββββββββββββ
ββββββββββ
ββββββ
β
βββββ
β
37.6 ΞΌs Histogram: log(frequency) by time 85.3 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.