In what sense, when? Could you give an example?
Is this something hard to see in microbenchmarks of just the generated function, and only apparent when it gets called from elsewhere?
Because generated functions can be inlined, I would have thought LLVM has all the information?
Or does this have something to do with Julia’s front end, eg constant propagation?
Does this apply at all to macros?
One of the most common reasons for me to write generated functions is to get increased performance. A simple example:
using VectorizationBase, SIMDPirates
function regularized_cov_block_quote(W::Int, T, reps_per_block::Int, stride, mask_last::Bool = false, mask = :r)# = 0xff)
# loads from ptr_sample
# stores in ptr_s² and ptr_invs
# needs vNinv, mulreg, and addreg to be defined
reps_per_block -= 1
size_T = sizeof(T)
WT = size_T*W
V = Vec{W,T}
quote
$([Expr(:(=), Symbol(:μ_,i), :(vload($V, ptr_smpl + $(WT*i), $([mask for _ ∈ 1:((i==reps_per_block) & mask_last)]...)))) for i ∈ 0:reps_per_block]...)
$([Expr(:(=), Symbol(:Σδ_,i), :(vbroadcast($V,zero($T)))) for i ∈ 0:reps_per_block]...)
$([Expr(:(=), Symbol(:Σδ²_,i), :(vbroadcast($V,zero($T)))) for i ∈ 0:reps_per_block]...)
for n ∈ 1:N-1
$([Expr(:(=), Symbol(:δ_,i), :(vsub(vload($V, ptr_smpl + $(WT*i) + n*$stride*$size_T),$(Symbol(:μ_,i))))) for i ∈ 0:reps_per_block]...)
$([Expr(:(=), Symbol(:Σδ_,i), :(vadd($(Symbol(:δ_,i)),$(Symbol(:Σδ_,i))))) for i ∈ 0:reps_per_block]...)
$([Expr(:(=), Symbol(:Σδ²_,i), :(vmuladd($(Symbol(:δ_,i)),$(Symbol(:δ_,i)),$(Symbol(:Σδ²_,i))))) for i ∈ 0:reps_per_block]...)
end
$([Expr(:(=), Symbol(:xbar_,i), :(vmuladd(vNinv, $(Symbol(:Σδ_,i)), $(Symbol(:μ_,i))))) for i ∈ 0:reps_per_block]...)
$([Expr(:(=), Symbol(:ΣδΣδ_,i), :(vmul($(Symbol(:Σδ_,i)),$(Symbol(:Σδ_,i))))) for i ∈ 0:reps_per_block]...)
$([Expr(:(=), Symbol(:s²_,i), :(vmul(vNm1inv,vfnmadd($(Symbol(:ΣδΣδ_,i)),vNinv,$(Symbol(:Σδ²_,i)))))) for i ∈ 0:reps_per_block]...)
$([:(vstore!(ptr_mean, $(Symbol(:xbar_,i)), $([mask for _ ∈ 1:((i==reps_per_block) & mask_last)]...)); ptr_mean += $WT) for i ∈ 0:reps_per_block]...)
$([:(vstore!(ptr_vars, $(Symbol(:s²_,i)), $([mask for _ ∈ 1:((i==reps_per_block) & mask_last)]...)); ptr_vars += $WT) for i ∈ 0:reps_per_block]...)
ptr_smpl += $(WT*(reps_per_block+1))
end
end
@generated function mean_and_var!(
means::AbstractVector{T}, vars::AbstractVector{T}, sample::AbstractArray{T}
) where {T}
W, Wshift = VectorizationBase.pick_vector_width_shift(T)
V = Vec{W,T}
quote
D, N = size(sample); sample_stride = stride(sample, 2)
@boundscheck if length(means) < D || length(vars) < D
throw(BoundsError("Size of sample: ($D,$N); length of preallocated mean vector: $(length(means)); length of preallocated var vector: $(length(vars))"))
end
ptr_mean = pointer(means); ptr_vars = pointer(vars); ptr_smpl = pointer(sample)
vNinv = vbroadcast($V, 1/N); vNm1inv = vbroadcast($V, 1/(N-1))
for _ in 1:(D >>> $(Wshift + 2)) # blocks of 4 vectors
$(regularized_cov_block_quote(W, T, 4, :sample_stride))
end
for _ in 1:((D & $((W << 2)-1)) >>> $Wshift) # single vectors
$(regularized_cov_block_quote(W, T, 1, :sample_stride))
end
r = D & $(W-1)
if r > 0 # remainder
mask = VectorizationBase.mask(T, r)
$(regularized_cov_block_quote(W, T, 1, :sample_stride, true, :mask))
end
nothing
end
end
Benchmarking:
julia> using BenchmarkTools, Statistics
julia> @benchmark mean_and_var!($x,$y,$A)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 39.568 μs (0.00% GC)
median time: 40.256 μs (0.00% GC)
mean time: 43.080 μs (0.00% GC)
maximum time: 227.730 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> x'
1×200 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
-0.0466792 -0.0455147 0.0170653 0.0011996 0.0307585 0.0308389 0.0303355 -0.0144757 -0.0509977 -0.0120208 -0.0161698 0.0498736 0.0142003 0.0513357 0.00356376 0.0202032 -0.0300317 -0.0591260 … 0.0351895 -0.014007 0.0231309 -0.00640476 0.0121385 0.00250655 0.00367508 -0.0373912 -0.00839410 -0.00719569 -0.0306729 0.0163719 -0.038363 0.0357159 0.0111598 0.00553716 -0.018665 0.0148885
julia> mean(A, dims = 2)'
1×200 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
-0.0466792 -0.0455147 0.0170653 0.0011996 0.0307585 0.0308389 0.0303355 -0.0144757 -0.0509977 -0.0120208 -0.0161698 0.0498736 0.0142003 0.0513357 0.00356376 0.0202032 -0.0300317 -0.0591260 … 0.0351895 -0.014007 0.0231309 -0.00640476 0.0121385 0.00250655 0.00367508 -0.0373912 -0.00839410 -0.00719569 -0.0306729 0.0163719 -0.038363 0.0357159 0.0111598 0.00553716 -0.018665 0.0148885
julia> y'
1×200 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
1.02852 1.02274 1.00838 1.06236 1.04392 0.951408 1.02583 0.995716 1.03187 1.046 1.02397 1.02082 0.991599 0.937852 0.985895 1.03206 0.979809 1.00042 1.0083 1.00608 1.02262 1.00769 0.951676 1.01429 … 0.981213 0.993444 1.08527 0.976448 1.01732 0.942424 1.05196 1.0542 0.972378 0.991214 0.965925 0.981092 0.938367 0.996919 1.07532 0.939985 1.00628 0.994173 0.976612 0.970468 1.02659
julia> var(A, dims = 2)'
1×200 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
1.02852 1.02274 1.00838 1.06236 1.04392 0.951408 1.02583 0.995716 1.03187 1.046 1.02397 1.02082 0.991599 0.937852 0.985895 1.03206 0.979809 1.00042 1.0083 1.00608 1.02262 1.00769 0.951676 1.01429 … 0.981213 0.993444 1.08527 0.976448 1.01732 0.942424 1.05196 1.0542 0.972378 0.991214 0.965925 0.981092 0.938367 0.996919 1.07532 0.939985 1.00628 0.994173 0.976612 0.970468 1.02659
julia> @benchmark mean!($y,$A)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 50.627 μs (0.00% GC)
median time: 51.388 μs (0.00% GC)
mean time: 51.841 μs (0.00% GC)
maximum time: 136.623 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark var($A, dims = 2, mean = $y)
BenchmarkTools.Trial:
memory estimate: 3.91 KiB
allocs estimate: 14
--------------
minimum time: 107.394 μs (0.00% GC)
median time: 110.219 μs (0.00% GC)
mean time: 111.107 μs (0.00% GC)
maximum time: 242.747 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
It is about 25% faster at getting both the mean and variance as Statistics.mean!
is at getting just the mean. [EDIT for good measure:
julia> @benchmark sum!($x,$A)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 53.718 μs (0.00% GC)
median time: 54.253 μs (0.00% GC)
mean time: 54.750 μs (0.00% GC)
maximum time: 139.711 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
]
Looking at this particular example, I think it’d actually be fairly easy to make it not @generated
, so I really didn’t try hard enough .
But on the other hand, is it worth it to spend my time de-@generated
functions like these?
Perhaps I can expect better compile times (and I have been suffering from bad compile times, so that is a definite win), but in which situations can I expect better run time performance, too?