Bad performance from parametric struct dispatch

While trying to optimize the hot loop of my code, I ran into a very strange performance footgun that is reducing the speed of a simple function by at least ~100ns, which is a 10x performance hit for my use case.

Question: Why do the two almost-identical functions below have such different performance?

Background: I am working with binary configurations over a small grid (e.g. 3x7 or 6x8). Configurations are encoded in the low bits of a UInt64 or similar. Since the grid size is known in advance, I use a parametric struct grid = Grid{N1,N2} to represent it, and pass the grid to functions so they can specialize on it (since this is in the hot loop). For example: do_something(data, grid :: Type{Grid{N1,N2}) where {N1,N2}. However, this seems to take ~100ns longer than do_something(data, ::Val{N1}, ::Val{N2}) where {N1,N2}. I would have thought that, since N1 and N2 are known at compile time, these approaches would be completely equivalent, but the second is almost 10x faster. What’s going on here? Is there a way to make something in the style of the first function fast? (Also, are there any other ways to speed my function up?)

Minimal Working Example:

struct Grid{N1,N2}
    #other data here
end

struct GridInd{G <: Grid}
    n1 :: Int8
    n2 :: Int8
    function GridInd{Grid{N1,N2}}(n1, n2) where {N1,N2}
        return new{Grid{N1,N2}}(mod(n1,N1),mod(n2,N2))
    end
end
#just for display
showconfig(config, gd :: Type{Grid{N1,N2}}) where {N1,N2} = reverse(reshape(digits(config,base=2,pad=N1*N2),N1,N2),dims=2)'

function momentum_v1(config, ::Type{Grid{N1,N2}}) where {N1,N2}
    K1, K2 = 0, 0
    while config != 0
        place = trailing_zeros(config) #place of last set 1
        k2, k1 = fldmod(place,N1)
        K1 += k1
        K2 += k2
        config = config & (config-1) #remove last set 1
    end
    return GridInd{Grid{N1,N2}}(K1,K2)
end

#only the function signature is different than v1
function momentum_v2(config, ::Val{N1}, ::Val{N2}) where {N1,N2}
    K1, K2 = 0, 0
    while config != 0
        place = trailing_zeros(config) #place of last set 1
        k2, k1 = fldmod(place,N1)
        K1 += k1
        K2 += k2
        config = config & (config-1) #remove last set 1
    end
    return GridInd{Grid{N1,N2}}(K1,K2)
end

grid = Grid{3,7}
config = UInt64(4325) #typical example configuration
display(showconfig(config,grid)) #show configuration for clarity

using BenchmarkTools
@benchmark momentum_v1($config,$grid)
v1,v2= Val(3),Val(7)
@benchmark momentum2($config,$v1,$v2) #fixed interpolations

This yields:

#example configuration
7Γ—3 adjoint(::Matrix{Int64}) with eltype Int64:
 0  0  0
 0  0  0
 1  0  0
 0  0  0
 1  1  0
 0  0  1
 1  0  1

# momentum_v1
BenchmarkTools.Trial: 10000 samples with 925 evaluations.
 Range (min … max):  112.027 ns … 514.099 ns  β”Š GC (min … max): 0.00% … 72.97%
 Time  (median):     113.018 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   115.742 ns Β±  11.667 ns  β”Š GC (mean Β± Οƒ):  0.18% Β±  1.75%

  β–…β–ˆβ–…β–‚β–ƒβ–ƒβ–„β–ƒβ–β–‚β–β–                                                  ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–†β–†β–†β–‡β–†β–‡β–†β–‡β–‡β–†β–†β–‡β–‡β–‡β–‡β–†β–‡β–‡β–‡β–‡β–†β–‡β–†β–†β–†β–†β–†β–†β–…β–…β–…β–†β–…β–…β–…β–†β–…β–…β–„β–„β–„β–…β–…β–„ β–ˆ
  112 ns        Histogram: log(frequency) by time        147 ns <

 Memory estimate: 32 bytes, allocs estimate: 2.

# momentum_v2
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  10.333 ns … 54.792 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     10.417 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.588 ns Β±  1.594 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ˆ                                                           
  β–ƒβ–ˆβ–„β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–β–β–β–‚β–‚β–β–β–β–β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚ β–‚
  10.3 ns         Histogram: frequency by time        14.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Don’t know why this is happening but I did notice that the second argument v2 is not prefixed by $ in one of the benchmarks, which I assume is what you intended to do. I made this change and the benchmark ran a little faster.

I looked at the code using @code_warntype for both functions. They generate nearly identical code, nothing stands out as being obviously inefficient. Maybe a compiler guru can explain what’s going on.

I think you should avoid the argument ::Type{Grid{N1,N2}}, it requires some allocations for dispatch. If you instead use momentum_v1(config, ::Grid{N1,N2}), there are no allocations, and it’s as fast as the Val version, though you must give it a Grid object, not the type.

https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing

brianguenter Thank you for catching that typo. My benchmarks above used the proper interpolation for v2.

lmiq The page you linked seems very helpful. It claims

#but this will [specialize]

function g_type(t::Type{T}) where T
    x = ones(T, 10)
    return sum(map(sin, x))
end

which I thought was the case I am in here.

1 Like

After looking into this. It’s actually an artifact of the benchmarking, not an actual slowdown in a real program.

@benchmark momentum_v1($config,$(Ref{Type{Grid{3,7}}}(grid))[])

is fine.

To give another way of avoiding the benchmarking artifacts:

julia> foo1(config) = momentum_v1(config, Grid{3,7});

julia> foo2(config) = momentum_v2(config, Val(3), Val(7))

julia> @btime foo1($config)
  7.450 ns (0 allocations: 0 bytes)
GridInd{Grid{3, 7}}(2, 2)

julia> @btime foo2($config)
  7.438 ns (0 allocations: 0 bytes)
GridInd{Grid{3, 7}}(2, 2)

The exact details of the benchmarking loop body is somewhat of a black magic. Generally, if results turn out surprising, I would advise to write your own benchmarking loop. That way, you control inlining and what is known to the compiler, and you know whether you’re benchmarking throughput or also latency. Big difference whether you need to apply a function to lots of elements independently (throughput!) or the next function application needs the result of the last as input (latency!). You also can guess at the state of caches, branchpredictor, etc and adjust to more closely match your expected real-world state.

1 Like

While that speeds up the microbenchmark, the performance I get from putting momentum inside another function is quite different between the two cases.

using Random
function do_it_many_times(N,grid)
    a = 0
    Random.seed!(4432543)
    for n in 1:N
        config = rand(UInt64) & (1<<22 -1)
        K = momentum_v1(config,grid)
        a += K.n1 #prevent compiler from trivializing loop
    end
    return a
end

function do_it_many_times2(N,v1,v2)
    a = 0
    Random.seed!(4432543)
    for n in 1:N
        config = rand(UInt64) & (1<<22 -1)
        K = momentum_v2(config,v1,v2)
        a += K.n1 #prevent compiler from trivializing loop
    end
    return a
end


N = 1000
gd = Grid{3,7}
@btime do_it_many_times($N,$gd)
@btime do_it_many_times($N,$(Ref{Type{Grid{3,7}}}(gd))[])
v1, v2 = Val(3), Val(7)
@btime do_it_many_times2($N,$v1,$v2)

@assert do_it_many_times(N,gd) == do_it_many_times2(N,v1,v2)

yields

124.791 ΞΌs (2008 allocations: 31.75 KiB)
124.708 ΞΌs (2006 allocations: 31.72 KiB)
18.583 ΞΌs (6 allocations: 480 bytes)

Unless there’s something wrong with these additional benchmarks, there really does seem to be a performance difference here.

But here you might run into the no-specialize case, because the grid argument is a DataType, so no specialization. I.e. do_it_many_times must do a dynamic dispatch on momentum_v1, I think, resulting in allocations.

1 Like

Ahh, I see!

function do_it_many_times3(N,grid :: T) where {T}
    a = 0
    Random.seed!(4432543)
    for n in 1:N
        config = rand(UInt64) & (1<<22 -1)
        K = momentum_v1(config,grid)
        a += K.n1 #prevent compiler from trivializing loop
    end
    return a
end
@btime do_it_many_times($N,$gd)
@btime do_it_many_times2($N,$v1,$v2)
@btime do_it_many_times3($N,$v1,$v2)

yields

94.584 ΞΌs (2006 allocations: 31.72 KiB)
14.333 ΞΌs (6 allocations: 480 bytes)
14.667 ΞΌs (8 allocations: 512 bytes)

Thank you all for helping to resolve this!

1 Like