Different behavior for same `@code_llvm` for voxel traversal

I have a function that takes in a line segment. I wanted to benchmark the performance and see if there is a difference whether the line segment is specified with endpoints given in integers or floats. The function has an argument g which could be identity, Float32, … etc. Compared to baking in g into the function, benchmarking reveals the generic approach to be a lot slower. This is unexpected because

  1. Each function is its own type, and the compiler should specialize the method for identity or Float32
  2. @code_llvm reveals the functions to be identical (so indeed the compiler does specialize)

Here are the full details, including a simpler β€œMWE” that does not demonstrate this surprising behavior.

using BenchmarkTools
using UnPack: @unpack

struct LineSegment{T}
  point1::T
  point2::T
end

# a complicated iterator:
struct LineSegmentGridIterator2{TP, TVI, TVIF, TVF}
  v_1::TP
  v_N::TP
  d::TVI
  m::TVIF
  s_1::TVF
end

"""
This iterates the voxels that are intersected by `l`. A voxel is named by an integer tuple `v`, representing the coordinates of the center. The voxel's extents are `v .- 0.5` to ``v .+ 0.5`
This goes by names such as the grid supercover, the outer Jordan digitization, and the conservative rasterization of the line segment.
"""
function LineSegmentGridIterator2(l::LineSegment)
  p1 = l.point1
  p2 = l.point2

  v_1 = round.(Int, p1)
  v_N = round.(Int, p2)

  d = Int8.(sign.(p2 .- p1))
  m = abs.(p2 .- p1)
  # delta_t = 1 ./ m

  # do not promote `Float32` to `Float64`
  min_ = v_1 .- 0.5f0
  max_ = v_1 .+ 0.5f0

  s = ifelse.(p1 .> p2, p1 .- min_, max_ .- p1)
  LineSegmentGridIterator2(
    v_1,
    v_N,
    d,
    m,
    s
  )
end

Base.IteratorSize(::Type{<:LineSegmentGridIterator2}) = Base.SizeUnknown()
Base.eltype(::Type{LineSegmentGridIterator2{TP, TV, TVF}}) where {TP, TV, TVF} = TP

function Base.iterate(l::LineSegmentGridIterator2, state=nothing)
  @unpack v_1, v_N, d, m, s_1 = l

  if state === nothing
    state = (;s=s_1, v_i=v_1, break_=false)
  end

  @unpack s, v_i, break_ = state

  break_ && return nothing
  t = s ./ m
  incs = minimum(t) .== t
  any(incs .& (v_i .== v_N)) && (break_ = true)

  # if `break_`, then can avoid computing these, but do it anyway.
  v_iβ€² = v_i .+ incs .* d
  s = s .+ incs

  new_state = (;s, v_i=v_iβ€², break_)
  output = v_i

  return (output, new_state)
end


# a simple iterator, attempt at a MWE that did not exhibit behavior
struct FooItr{T}
    x::T
end

Base.IteratorSize(::Type{<:FooItr}) = Base.SizeUnknown()
Base.eltype(::Type{FooItr{T}}) where {T} = T
function Base.iterate(itr::FooItr, state=nothing)
    if state === nothing
        state = zero.(itr.x)
    end
    all(state .> itr.x) && return nothing
    stateβ€² = if state[1] <= itr.x[1]
        state .+ (1, 0)
    else
        state .+ (0, 1)
    end
    output = state
    return (output, stateβ€²)
end

FooItr(l::LineSegment) = FooItr(l.point2)

# e.g. benchmark1(LineSegmentGridIterator, identity, 2000)
function benchmark1(Itr, g, N)
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

# e.g. benchmark1(LineSegmentGridIterator, 2000)
function benchmark2(Itr, N)
  s = (0, 0)
  for i = 1:N
    line = LineSegment((0, 0), (1, 1 + 2*i))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

(after compiling. @benchmark shows same story, but doing @time for succinct output)

julia> # these benchmark the same
       @time benchmark1(FooItr, identity, 200)
  0.000083 seconds
(81400, 5433900)

julia> @time benchmark2(FooItr, 200)
  0.000055 seconds
(81400, 5433900)

julia> # these are very different! even though @code_llvm is identical
       @time benchmark1(LineSegmentGridIterator2, identity, 200)

  0.030239 seconds (245.60 k allocations: 10.605 MiB, 28.06% gc time)
(20300, 5433900)

julia> @time benchmark2(LineSegmentGridIterator2, 200)
  0.000170 seconds
(20300, 5433900)

The bizarre thing is that @code_llvm benchmark1(LineSegmentGridIterator2, identity, 200) and @code_llvm benchmark2(LineSegmentGridIterator2, 200) produce identical (up to line numbers and names) output. Note also that whatever is happening does not happen with FooItr to the same extent, though benchmarking does reveal a detectable difference:

julia> @benchmark benchmark1(FooItr, identity, 200)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  50.988 ΞΌs … 239.415 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     53.715 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   57.313 ΞΌs Β±  10.752 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–… β–ˆ β–‡β–ƒ   β–…      ▁                                            ▁
  β–ˆβ–„β–ˆβ–†β–ˆβ–ˆβ–‡β–‡β–„β–ˆβ–ˆβ–„β–ˆβ–…β–ˆβ–…β–ˆβ–‡β–‡β–ˆβ–„β–ˆβ–‡β–†β–‡β–…β–„β–ˆβ–„β–„β–†β–†β–†β–…β–…β–‡β–†β–…β–†β–†β–…β–†β–…β–…β–…β–…β–…β–ƒβ–…β–†β–…β–†β–ƒβ–„β–…β–†β–…β–„β–„β–„ β–ˆ
  51 ΞΌs         Histogram: log(frequency) by time       109 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark benchmark2(FooItr, 200)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.625 ΞΌs … 184.335 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     39.994 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   42.544 ΞΌs Β±   8.510 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„ β–ˆβ–†β–†β–‚  β–…          ▁  ▁                                      ▁
  β–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–ˆβ–‡β–‡β–‡β–‡β–ˆβ–ƒβ–ˆβ–…β–ˆβ–†β–ˆβ–ˆβ–‡β–ˆβ–…β–‡β–†β–ˆβ–†β–†β–„β–„β–ƒβ–ƒβ–ƒβ–„β–„β–†β–…β–†β–„β–„β–ƒβ–‡β–†β–‡β–†β–„β–…β–„β–„β–†β–ƒβ–„β–…β–…β–„β–ƒβ–‚β–ƒβ–… β–ˆ
  37.6 ΞΌs       Histogram: log(frequency) by time      85.3 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

For identity your benchmark probably should look like

function benchmark3(Itr, g, N)
  s = g((0, 0))
  for i = 1:N
    line = LineSegment(g((0, 0)), g((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

No, that’s not what I intended. g is just some way to convert data from Int to e.g. Float64.

Maybe it would be useful then to reflect your intention in the MWE?

Sure. I included more context. In any case, I expect it to be ultimately irrelevant whether you do identity(tuple) or identity.(tuple). The LLVM is different but the native code is the same for the following simple example:

julia> f(t) = identity(t)
f (generic function with 1 method)

julia> g(t) = identity.(t)
g (generic function with 1 method)

julia> @code_llvm f((1,1))
;  @ REPL[1]:1 within `f'
define void @julia_f_224([2 x i64]* noalias nocapture sret %0, [2 x i64]* nocapture nonnull readonly align 8 dereferenceable(16) %1) {
top:
  %2 = bitcast [2 x i64]* %0 to i8*
  %3 = bitcast [2 x i64]* %1 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(16) %2, i8* nonnull align 8 dereferenceable(16) %3, i64 16, i1 false)
  ret void
}

julia> @code_llvm g((1,1))
;  @ REPL[2]:1 within `g'
define void @julia_g_252([2 x i64]* noalias nocapture sret %0, [2 x i64]* nocapture nonnull readonly align 8 dereferenceable(16) %1) {
top:
; β”Œ @ broadcast.jl:883 within `materialize'
; β”‚β”Œ @ broadcast.jl:1098 within `copy'
; β”‚β”‚β”Œ @ ntuple.jl:49 within `ntuple'
     %2 = bitcast [2 x i64]* %1 to <2 x i64>*
     %3 = load <2 x i64>, <2 x i64>* %2, align 8
; β””β””β””
  %4 = bitcast [2 x i64]* %0 to <2 x i64>*
  store <2 x i64> %3, <2 x i64>* %4, align 8
  ret void
}

julia> @code_native f((1,1))
	.section	__TEXT,__text,regular,pure_instructions
; β”Œ @ REPL[1]:1 within `f'
	movq	%rdi, %rax
	vmovups	(%rsi), %xmm0
	vmovups	%xmm0, (%rdi)
	retq
	nopl	(%rax)
; β””

julia> @code_native g((1,1))
	.section	__TEXT,__text,regular,pure_instructions
; β”Œ @ REPL[2]:1 within `g'
	movq	%rdi, %rax
; β”‚β”Œ @ broadcast.jl:883 within `materialize'
; β”‚β”‚β”Œ @ broadcast.jl:1098 within `copy'
; β”‚β”‚β”‚β”Œ @ ntuple.jl:49 within `ntuple'
	vmovups	(%rsi), %xmm0
; β”‚β””β””β””
	vmovups	%xmm0, (%rdi)
	retq
	nopl	(%rax)
; β””

Thanks.

The code I posted is allocation free. Will check the updated MWE later.

It doesn’t seems so. I still don’t see an application of a function other than identity to the iterator. What am I missing?

I don’t understand. Did you notice the definitions for f and g in the most recent snippet?

The identity vs. identity. case I already handled, no? They don’t behave identical in Julia 1.6.3. Do you have another function to apply?

Let’s back up. I posted some code. You replied β€œyour benchmark probably should look like […]”. I responded that it’s not what it should look like. I also explained a sense in which it’s irrelevant whether I meant my original code or what you thought it should be.

Let’s return to the original question. Why is benchmark1 slower than benchmark2, and why is it so much so for LineSegmentGridIterator.

From my understanding because it broadcasts instead of applying. So it depends on the function given to the iterator? If you want it to work independently of the function given to the iterator we’ll probably have to wait for others to step in.

To be clear, using

julia> f(t) = identity(t)
f (generic function with 1 method)

julia> g(t) = identity.(t)
g (generic function with 1 method)

You are calling f the function that does β€œapplying” and g the function that does β€œbroadcasting”. In that isolated example, I showed that there is no native-code difference between f and g when applied to Tuple{Int, Int} (and conjecture there is no difference for any Tuple generally). So there is no β€œit does this instead of that”. They are doing the same thing in that isolated example.

But, if I’m understanding you correctly, that’s not what I’m asking.

If instead of

function benchmark1(Itr, g, N)
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

benchmark1(LineSegmentGridIterator, identity, 2000)

I use

function benchmark4(Itr, N)
  g = identity   # hardcode the `g` argument
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

benchmark1(LineSegmentGridIterator, 2000)

I see the same performance discrepancy between benchmark1 and benchmark4, even though both are doing broadcasting.

I hope that finally clears things up…

1 Like

This seems to works on 1.7.0.

What about

function benchmark1(Itr, g::G, N) where {G <: Function}
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

then?

The <: Function annotation is unnecessary (and personally, an anti-pattern in julia, since any type is callable), but adding the type parameter G does force specialization for the function, which I had forgotten about. Thanks. I still don’t know why @code_llvm is misleading, but it’s not the first time I’ve run into that.

function benchmark1(f, g, N)
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in f(line)
      s = s .+ v_i
    end
  end
  return s
end

function benchmark5(f, g::G, N) where {G}
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in f(line)
      s = s .+ v_i
    end
  end
  return s
end
julia> @benchmark benchmark1(LineSegmentGridIterator2, Float64, 200)
BenchmarkTools.Trial: 220 samples with 1 evaluation.
 Range (min … max):  19.685 ms … 29.983 ms  β”Š GC (min … max): 0.00% … 20.93%
 Time  (median):     21.542 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   22.747 ms Β±  2.724 ms  β”Š GC (mean Β± Οƒ):  5.35% Β±  9.10%

          β–‡β–ˆβ–†β–…β–…                                                
  β–ƒβ–„β–ƒβ–…β–„β–„β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ƒβ–…β–ƒβ–ƒβ–„β–ƒβ–„β–β–ƒβ–β–ƒβ–β–β–β–β–ƒβ–β–ƒβ–β–β–ƒβ–β–β–ƒβ–β–β–β–ƒβ–ƒβ–β–ƒβ–ƒβ–ƒβ–ƒβ–„β–ƒβ–…β–…β–„β–„β–ƒβ–„β–„β–β–ƒ β–ƒ
  19.7 ms         Histogram: frequency by time        29.3 ms <

 Memory estimate: 10.62 MiB, allocs estimate: 246406.

julia> @benchmark benchmark5(LineSegmentGridIterator2, Float64, 200)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  336.884 ΞΌs … 949.680 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     376.127 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   391.940 ΞΌs Β±  44.942 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–† β–ˆβ–…β–‚β–†β–‚β–‚β–‚β–β–ˆβ–…β–„β–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–  ▁                            β–‚
  β–‡β–„β–β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–‡β–†β–‡β–‡β–†β–‡β–…β–…β–…β–†β–…β–…β–… β–ˆ
  337 ΞΌs        Histogram: log(frequency) by time        582 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

Ugly trap:

Note that @code_typed and friends will always show you specialized code, even if Julia would not normally specialize that method call. You need to check the method internals if you want to see whether specializations are generated when argument types are changed, i.e., if (@which f(...)).specializations contains specializations for the argument in question.

from Performance Tips Β· The Julia Language

Obvious question: could we change that?

1 Like