While trying to optimize the hot loop of my code, I ran into a very strange performance footgun that is reducing the speed of a simple function by at least ~100ns, which is a 10x performance hit for my use case.
Question: Why do the two almost-identical functions below have such different performance?
Background: I am working with binary configurations over a small grid (e.g. 3x7 or 6x8). Configurations are encoded in the low bits of a UInt64
or similar. Since the grid size is known in advance, I use a parametric struct grid = Grid{N1,N2}
to represent it, and pass the grid to functions so they can specialize on it (since this is in the hot loop). For example: do_something(data, grid :: Type{Grid{N1,N2}) where {N1,N2}
. However, this seems to take ~100ns longer than do_something(data, ::Val{N1}, ::Val{N2}) where {N1,N2}
. I would have thought that, since N1
and N2
are known at compile time, these approaches would be completely equivalent, but the second is almost 10x faster. Whatβs going on here? Is there a way to make something in the style of the first function fast? (Also, are there any other ways to speed my function up?)
Minimal Working Example:
struct Grid{N1,N2}
#other data here
end
struct GridInd{G <: Grid}
n1 :: Int8
n2 :: Int8
function GridInd{Grid{N1,N2}}(n1, n2) where {N1,N2}
return new{Grid{N1,N2}}(mod(n1,N1),mod(n2,N2))
end
end
#just for display
showconfig(config, gd :: Type{Grid{N1,N2}}) where {N1,N2} = reverse(reshape(digits(config,base=2,pad=N1*N2),N1,N2),dims=2)'
function momentum_v1(config, ::Type{Grid{N1,N2}}) where {N1,N2}
K1, K2 = 0, 0
while config != 0
place = trailing_zeros(config) #place of last set 1
k2, k1 = fldmod(place,N1)
K1 += k1
K2 += k2
config = config & (config-1) #remove last set 1
end
return GridInd{Grid{N1,N2}}(K1,K2)
end
#only the function signature is different than v1
function momentum_v2(config, ::Val{N1}, ::Val{N2}) where {N1,N2}
K1, K2 = 0, 0
while config != 0
place = trailing_zeros(config) #place of last set 1
k2, k1 = fldmod(place,N1)
K1 += k1
K2 += k2
config = config & (config-1) #remove last set 1
end
return GridInd{Grid{N1,N2}}(K1,K2)
end
grid = Grid{3,7}
config = UInt64(4325) #typical example configuration
display(showconfig(config,grid)) #show configuration for clarity
using BenchmarkTools
@benchmark momentum_v1($config,$grid)
v1,v2= Val(3),Val(7)
@benchmark momentum2($config,$v1,$v2) #fixed interpolations
This yields:
#example configuration
7Γ3 adjoint(::Matrix{Int64}) with eltype Int64:
0 0 0
0 0 0
1 0 0
0 0 0
1 1 0
0 0 1
1 0 1
# momentum_v1
BenchmarkTools.Trial: 10000 samples with 925 evaluations.
Range (min β¦ max): 112.027 ns β¦ 514.099 ns β GC (min β¦ max): 0.00% β¦ 72.97%
Time (median): 113.018 ns β GC (median): 0.00%
Time (mean Β± Ο): 115.742 ns Β± 11.667 ns β GC (mean Β± Ο): 0.18% Β± 1.75%
β
ββ
βββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββ
β
β
ββ
β
β
ββ
β
ββββ
β
β β
112 ns Histogram: log(frequency) by time 147 ns <
Memory estimate: 32 bytes, allocs estimate: 2.
# momentum_v2
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min β¦ max): 10.333 ns β¦ 54.792 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 10.417 ns β GC (median): 0.00%
Time (mean Β± Ο): 10.588 ns Β± 1.594 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
10.3 ns Histogram: frequency by time 14.7 ns <
Memory estimate: 0 bytes, allocs estimate: 0.