Different behavior for same `@code_llvm` for voxel traversal

goretkin · November 18, 2021, 2:25am

I have a function that takes in a line segment. I wanted to benchmark the performance and see if there is a difference whether the line segment is specified with endpoints given in integers or floats. The function has an argument g which could be identity, Float32, … etc. Compared to baking in g into the function, benchmarking reveals the generic approach to be a lot slower. This is unexpected because

Each function is its own type, and the compiler should specialize the method for identity or Float32
@code_llvm reveals the functions to be identical (so indeed the compiler does specialize)

Here are the full details, including a simpler “MWE” that does not demonstrate this surprising behavior.

using BenchmarkTools
using UnPack: @unpack

struct LineSegment{T}
  point1::T
  point2::T
end

# a complicated iterator:
struct LineSegmentGridIterator2{TP, TVI, TVIF, TVF}
  v_1::TP
  v_N::TP
  d::TVI
  m::TVIF
  s_1::TVF
end

"""
This iterates the voxels that are intersected by `l`. A voxel is named by an integer tuple `v`, representing the coordinates of the center. The voxel's extents are `v .- 0.5` to ``v .+ 0.5`
This goes by names such as the grid supercover, the outer Jordan digitization, and the conservative rasterization of the line segment.
"""
function LineSegmentGridIterator2(l::LineSegment)
  p1 = l.point1
  p2 = l.point2

  v_1 = round.(Int, p1)
  v_N = round.(Int, p2)

  d = Int8.(sign.(p2 .- p1))
  m = abs.(p2 .- p1)
  # delta_t = 1 ./ m

  # do not promote `Float32` to `Float64`
  min_ = v_1 .- 0.5f0
  max_ = v_1 .+ 0.5f0

  s = ifelse.(p1 .> p2, p1 .- min_, max_ .- p1)
  LineSegmentGridIterator2(
    v_1,
    v_N,
    d,
    m,
    s
  )
end

Base.IteratorSize(::Type{<:LineSegmentGridIterator2}) = Base.SizeUnknown()
Base.eltype(::Type{LineSegmentGridIterator2{TP, TV, TVF}}) where {TP, TV, TVF} = TP

function Base.iterate(l::LineSegmentGridIterator2, state=nothing)
  @unpack v_1, v_N, d, m, s_1 = l

  if state === nothing
    state = (;s=s_1, v_i=v_1, break_=false)
  end

  @unpack s, v_i, break_ = state

  break_ && return nothing
  t = s ./ m
  incs = minimum(t) .== t
  any(incs .& (v_i .== v_N)) && (break_ = true)

  # if `break_`, then can avoid computing these, but do it anyway.
  v_i′ = v_i .+ incs .* d
  s = s .+ incs

  new_state = (;s, v_i=v_i′, break_)
  output = v_i

  return (output, new_state)
end


# a simple iterator, attempt at a MWE that did not exhibit behavior
struct FooItr{T}
    x::T
end

Base.IteratorSize(::Type{<:FooItr}) = Base.SizeUnknown()
Base.eltype(::Type{FooItr{T}}) where {T} = T
function Base.iterate(itr::FooItr, state=nothing)
    if state === nothing
        state = zero.(itr.x)
    end
    all(state .> itr.x) && return nothing
    state′ = if state[1] <= itr.x[1]
        state .+ (1, 0)
    else
        state .+ (0, 1)
    end
    output = state
    return (output, state′)
end

FooItr(l::LineSegment) = FooItr(l.point2)

# e.g. benchmark1(LineSegmentGridIterator, identity, 2000)
function benchmark1(Itr, g, N)
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

# e.g. benchmark1(LineSegmentGridIterator, 2000)
function benchmark2(Itr, N)
  s = (0, 0)
  for i = 1:N
    line = LineSegment((0, 0), (1, 1 + 2*i))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

(after compiling. @benchmark shows same story, but doing @time for succinct output)

julia> # these benchmark the same
       @time benchmark1(FooItr, identity, 200)
  0.000083 seconds
(81400, 5433900)

julia> @time benchmark2(FooItr, 200)
  0.000055 seconds
(81400, 5433900)

julia> # these are very different! even though @code_llvm is identical
       @time benchmark1(LineSegmentGridIterator2, identity, 200)

  0.030239 seconds (245.60 k allocations: 10.605 MiB, 28.06% gc time)
(20300, 5433900)

julia> @time benchmark2(LineSegmentGridIterator2, 200)
  0.000170 seconds
(20300, 5433900)

The bizarre thing is that @code_llvm benchmark1(LineSegmentGridIterator2, identity, 200) and @code_llvm benchmark2(LineSegmentGridIterator2, 200) produce identical (up to line numbers and names) output. Note also that whatever is happening does not happen with FooItr to the same extent, though benchmarking does reveal a detectable difference:

julia> @benchmark benchmark1(FooItr, identity, 200)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  50.988 μs … 239.415 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.715 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.313 μs ±  10.752 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅ █ ▇▃   ▅      ▁                                            ▁
  █▄█▆██▇▇▄██▄█▅█▅█▇▇█▄█▇▆▇▅▄█▄▄▆▆▆▅▅▇▆▅▆▆▅▆▅▅▅▅▅▃▅▆▅▆▃▄▅▆▅▄▄▄ █
  51 μs         Histogram: log(frequency) by time       109 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark benchmark2(FooItr, 200)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.625 μs … 184.335 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     39.994 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.544 μs ±   8.510 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄ █▆▆▂  ▅          ▁  ▁                                      ▁
  █▆█████▅█▇▇▇▇█▃█▅█▆██▇█▅▇▆█▆▆▄▄▃▃▃▄▄▆▅▆▄▄▃▇▆▇▆▄▅▄▄▆▃▄▅▅▄▃▂▃▅ █
  37.6 μs       Histogram: log(frequency) by time      85.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

goerch · November 19, 2021, 11:27am

For identity your benchmark probably should look like

function benchmark3(Itr, g, N)
  s = g((0, 0))
  for i = 1:N
    line = LineSegment(g((0, 0)), g((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

goretkin · November 19, 2021, 6:11pm

No, that’s not what I intended. g is just some way to convert data from Int to e.g. Float64.

goerch · November 19, 2021, 6:27pm

Maybe it would be useful then to reflect your intention in the MWE?

goretkin · November 19, 2021, 6:47pm

Sure. I included more context. In any case, I expect it to be ultimately irrelevant whether you do identity(tuple) or identity.(tuple). The LLVM is different but the native code is the same for the following simple example:

julia> f(t) = identity(t)
f (generic function with 1 method)

julia> g(t) = identity.(t)
g (generic function with 1 method)

julia> @code_llvm f((1,1))
;  @ REPL[1]:1 within `f'
define void @julia_f_224([2 x i64]* noalias nocapture sret %0, [2 x i64]* nocapture nonnull readonly align 8 dereferenceable(16) %1) {
top:
  %2 = bitcast [2 x i64]* %0 to i8*
  %3 = bitcast [2 x i64]* %1 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(16) %2, i8* nonnull align 8 dereferenceable(16) %3, i64 16, i1 false)
  ret void
}

julia> @code_llvm g((1,1))
;  @ REPL[2]:1 within `g'
define void @julia_g_252([2 x i64]* noalias nocapture sret %0, [2 x i64]* nocapture nonnull readonly align 8 dereferenceable(16) %1) {
top:
; ┌ @ broadcast.jl:883 within `materialize'
; │┌ @ broadcast.jl:1098 within `copy'
; ││┌ @ ntuple.jl:49 within `ntuple'
     %2 = bitcast [2 x i64]* %1 to <2 x i64>*
     %3 = load <2 x i64>, <2 x i64>* %2, align 8
; └└└
  %4 = bitcast [2 x i64]* %0 to <2 x i64>*
  store <2 x i64> %3, <2 x i64>* %4, align 8
  ret void
}

julia> @code_native f((1,1))
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[1]:1 within `f'
	movq	%rdi, %rax
	vmovups	(%rsi), %xmm0
	vmovups	%xmm0, (%rdi)
	retq
	nopl	(%rax)
; └

julia> @code_native g((1,1))
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[2]:1 within `g'
	movq	%rdi, %rax
; │┌ @ broadcast.jl:883 within `materialize'
; ││┌ @ broadcast.jl:1098 within `copy'
; │││┌ @ ntuple.jl:49 within `ntuple'
	vmovups	(%rsi), %xmm0
; │└└└
	vmovups	%xmm0, (%rdi)
	retq
	nopl	(%rax)
; └

goerch · November 19, 2021, 6:49pm

Thanks.

The code I posted is allocation free. Will check the updated MWE later.

It doesn’t seems so. I still don’t see an application of a function other than identity to the iterator. What am I missing?

goretkin · November 19, 2021, 9:12pm

I don’t understand. Did you notice the definitions for f and g in the most recent snippet?

goerch · November 19, 2021, 9:13pm

The identity vs. identity. case I already handled, no? They don’t behave identical in Julia 1.6.3. Do you have another function to apply?

goretkin · November 19, 2021, 9:19pm

Let’s back up. I posted some code. You replied “your benchmark probably should look like […]”. I responded that it’s not what it should look like. I also explained a sense in which it’s irrelevant whether I meant my original code or what you thought it should be.

Let’s return to the original question. Why is benchmark1 slower than benchmark2, and why is it so much so for LineSegmentGridIterator.

goerch · November 19, 2021, 9:21pm

From my understanding because it broadcasts instead of applying. So it depends on the function given to the iterator? If you want it to work independently of the function given to the iterator we’ll probably have to wait for others to step in.

goretkin · November 19, 2021, 9:41pm

To be clear, using

julia> f(t) = identity(t)
f (generic function with 1 method)

julia> g(t) = identity.(t)
g (generic function with 1 method)

You are calling f the function that does “applying” and g the function that does “broadcasting”. In that isolated example, I showed that there is no native-code difference between f and g when applied to Tuple{Int, Int} (and conjecture there is no difference for any Tuple generally). So there is no “it does this instead of that”. They are doing the same thing in that isolated example.

But, if I’m understanding you correctly, that’s not what I’m asking.

If instead of

function benchmark1(Itr, g, N)
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

benchmark1(LineSegmentGridIterator, identity, 2000)

I use

function benchmark4(Itr, N)
  g = identity   # hardcode the `g` argument
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

benchmark1(LineSegmentGridIterator, 2000)

I see the same performance discrepancy between benchmark1 and benchmark4, even though both are doing broadcasting.

I hope that finally clears things up…

goerch · November 19, 2021, 9:47pm

goretkin:

function benchmark4(Itr, N)
  g = identity   # hardcode the `g` argument
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

This seems to works on 1.7.0.

goerch · November 19, 2021, 9:49pm

What about

function benchmark1(Itr, g::G, N) where {G <: Function}
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in Itr(line)
      s = s .+ v_i
    end
  end
  return s
end

then?

goretkin · November 19, 2021, 10:05pm

The <: Function annotation is unnecessary (and personally, an anti-pattern in julia, since any type is callable), but adding the type parameter G does force specialization for the function, which I had forgotten about. Thanks. I still don’t know why @code_llvm is misleading, but it’s not the first time I’ve run into that.

function benchmark1(f, g, N)
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in f(line)
      s = s .+ v_i
    end
  end
  return s
end

function benchmark5(f, g::G, N) where {G}
  s = g.((0, 0))
  for i = 1:N
    line = LineSegment(g.((0, 0)), g.((1, 1 + 2*i)))
    for v_i in f(line)
      s = s .+ v_i
    end
  end
  return s
end

julia> @benchmark benchmark1(LineSegmentGridIterator2, Float64, 200)
BenchmarkTools.Trial: 220 samples with 1 evaluation.
 Range (min … max):  19.685 ms … 29.983 ms  ┊ GC (min … max): 0.00% … 20.93%
 Time  (median):     21.542 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.747 ms ±  2.724 ms  ┊ GC (mean ± σ):  5.35% ±  9.10%

          ▇█▆▅▅                                                
  ▃▄▃▅▄▄▅███████▃▅▃▃▄▃▄▁▃▁▃▁▁▁▁▃▁▃▁▁▃▁▁▃▁▁▁▃▃▁▃▃▃▃▄▃▅▅▄▄▃▄▄▁▃ ▃
  19.7 ms         Histogram: frequency by time        29.3 ms <

 Memory estimate: 10.62 MiB, allocs estimate: 246406.

julia> @benchmark benchmark5(LineSegmentGridIterator2, Float64, 200)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  336.884 μs … 949.680 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     376.127 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   391.940 μs ±  44.942 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▆ █▅▂▆▂▂▂▁█▅▄▃▃▂▃▂▂▂▂▂▂▁▁▁▁  ▁                            ▂
  ▇▄▁▁██████████████████████████████████▇▇█▇███▇▆▇▇▆▇▇▆▇▅▅▅▆▅▅▅ █
  337 μs        Histogram: log(frequency) by time        582 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

goerch · November 20, 2021, 12:33am

Ugly trap:

Note that @code_typed and friends will always show you specialized code, even if Julia would not normally specialize that method call. You need to check the method internals if you want to see whether specializations are generated when argument types are changed, i.e., if (@which f(...)).specializations contains specializations for the argument in question.

from Performance Tips · The Julia Language

Obvious question: could we change that?