Why not use @generated?

jw3126 · October 5, 2019, 8:43pm

It seems that using @generated is discouraged by the core devs and I would like to understand better when and why it is discouraged. I am aware of the excellent Keynote by @stevengj which already talks a lot about when (not) to use metaprogramming.

What are reasons against @generated?
I heard the theory, that calling an @generated function can actually hurt inference. Is that true?
There is if @generated ... How does the compiler decide if it uses the @generated definition? How does using if @generated interact with inference?

Raf · October 5, 2019, 10:02pm

A good non technical reason is extensibility. Anything that happens inside generated can’t have new methods added afterwards. Recursive methods are open to extension everywhere with multiple dispatch.

Elrod · October 6, 2019, 4:02am

I use a lot of (perhaps too many) generated functions. This biggest problems I face are that it makes debugging and fixing problems more difficult. Stack traces tend to be a lot worse, especially if you do a lot of pushing into expressions:

q = quote end
push!(q.args, :(c = a + b))

I’m considering eventually going through my code and pushing a bunch of LineNumberNode(@__LINE__, @__FILE__) into my expressions, but I probably ought to learn something about how these are used first, so I don’t end up doing something silly.

Also, if your generated functions are calling other functions to help build the expression you’re using, changing those functions wont automatically update the generated function:

julia> foo(::Type{T}) where {T} = :(a + b + one($T))
foo (generic function with 1 method)

julia> @generated bar(a::T,b::T) where {T} = foo(T)
bar (generic function with 1 method)

julia> bar(2,3)
6

julia> foo(::Type{T}) where {T} = :(a + b - one($T))
foo (generic function with 1 method)

julia> bar(2,3)
6

julia> @generated bar(a::T,b::T) where {T} = foo(T)
bar (generic function with 1 method)

julia> bar(2,3)
4

To work around this, a lot of my functions with two or more type signatures end up looking like:

@generated function foobar(
    a::T, b::S
# ) where {T,S}
) where {S,T}
    ...
end

So that I can switch which line is commented out whenever I want Revise to update the function.

Keno · October 6, 2019, 4:42am

Generated functions are an escape hatch that allows you to bypass the regular multiple dispatch mechanism. As such, you should only use them if regular multiple dispatch can’t get you what we need. Since multiple dispatch is a very powerful abstraction, you should think carefully about whether your problem is of sufficient complexity that it is beyond the reach of regular multiple dispatch. If you use generated functions, the compiler will have less information about what the function is going to do, which may result in:

Slower compile times due to excessive specialization (and invocation of the generated function).
Various sorts of world age issues
Reduced performance
Reduced debuggability
Crashes if you return something invalid from a generated function (the compiler isn’t particularly hardened against invalid IR, because usually it is generated from the frontend and thus correct by construction).

Or in other words, you topple over a domino with a nuke, but maybe blowing on it is enough (lest you accidentally disintegrate it or cause a nuclear winter). Generated functions are there if you need them, but even those who well know their power try to avoid them if possible.

Elrod · October 6, 2019, 6:57am

In what sense, when? Could you give an example?

Is this something hard to see in microbenchmarks of just the generated function, and only apparent when it gets called from elsewhere?

Because generated functions can be inlined, I would have thought LLVM has all the information?
Or does this have something to do with Julia’s front end, eg constant propagation?

Does this apply at all to macros?

One of the most common reasons for me to write generated functions is to get increased performance. A simple example:

using VectorizationBase, SIMDPirates
function regularized_cov_block_quote(W::Int, T, reps_per_block::Int, stride, mask_last::Bool = false, mask = :r)# = 0xff)
    # loads from ptr_sample
    # stores in ptr_s² and ptr_invs
    # needs vNinv, mulreg, and addreg to be defined
    reps_per_block -= 1
    size_T = sizeof(T)
    WT = size_T*W
    V = Vec{W,T}
    quote
        $([Expr(:(=), Symbol(:μ_,i), :(vload($V, ptr_smpl + $(WT*i), $([mask for _ ∈ 1:((i==reps_per_block) & mask_last)]...)))) for i ∈ 0:reps_per_block]...)
        $([Expr(:(=), Symbol(:Σδ_,i), :(vbroadcast($V,zero($T)))) for i ∈ 0:reps_per_block]...)
        $([Expr(:(=), Symbol(:Σδ²_,i), :(vbroadcast($V,zero($T)))) for i ∈ 0:reps_per_block]...)
        for n ∈ 1:N-1
            $([Expr(:(=), Symbol(:δ_,i), :(vsub(vload($V, ptr_smpl + $(WT*i) + n*$stride*$size_T),$(Symbol(:μ_,i))))) for i ∈ 0:reps_per_block]...)
            $([Expr(:(=), Symbol(:Σδ_,i), :(vadd($(Symbol(:δ_,i)),$(Symbol(:Σδ_,i))))) for i ∈ 0:reps_per_block]...)
            $([Expr(:(=), Symbol(:Σδ²_,i), :(vmuladd($(Symbol(:δ_,i)),$(Symbol(:δ_,i)),$(Symbol(:Σδ²_,i))))) for i ∈ 0:reps_per_block]...)
        end
        $([Expr(:(=), Symbol(:xbar_,i), :(vmuladd(vNinv, $(Symbol(:Σδ_,i)), $(Symbol(:μ_,i))))) for i ∈ 0:reps_per_block]...)
        $([Expr(:(=), Symbol(:ΣδΣδ_,i), :(vmul($(Symbol(:Σδ_,i)),$(Symbol(:Σδ_,i))))) for i ∈ 0:reps_per_block]...)
        $([Expr(:(=), Symbol(:s²_,i), :(vmul(vNm1inv,vfnmadd($(Symbol(:ΣδΣδ_,i)),vNinv,$(Symbol(:Σδ²_,i)))))) for i ∈ 0:reps_per_block]...)
        $([:(vstore!(ptr_mean, $(Symbol(:xbar_,i)), $([mask for _ ∈ 1:((i==reps_per_block) & mask_last)]...)); ptr_mean += $WT) for i ∈ 0:reps_per_block]...)
        $([:(vstore!(ptr_vars, $(Symbol(:s²_,i)), $([mask for _ ∈ 1:((i==reps_per_block) & mask_last)]...)); ptr_vars += $WT) for i ∈ 0:reps_per_block]...)
        ptr_smpl += $(WT*(reps_per_block+1))
    end
end
@generated function mean_and_var!(
    means::AbstractVector{T}, vars::AbstractVector{T}, sample::AbstractArray{T}
) where {T}
    W, Wshift = VectorizationBase.pick_vector_width_shift(T)
    V = Vec{W,T}
    quote 
        D, N = size(sample); sample_stride = stride(sample, 2)
        @boundscheck if length(means) < D || length(vars) < D
            throw(BoundsError("Size of sample: ($D,$N); length of preallocated mean vector: $(length(means)); length of preallocated var vector: $(length(vars))"))
        end
        ptr_mean = pointer(means); ptr_vars = pointer(vars); ptr_smpl = pointer(sample)
        vNinv = vbroadcast($V, 1/N); vNm1inv = vbroadcast($V, 1/(N-1))
        for _ in 1:(D >>> $(Wshift + 2)) # blocks of 4 vectors
            $(regularized_cov_block_quote(W, T, 4, :sample_stride))
        end
        for _ in 1:((D & $((W << 2)-1)) >>> $Wshift) # single vectors
            $(regularized_cov_block_quote(W, T, 1, :sample_stride))
        end
        r = D & $(W-1)
        if r > 0 # remainder
            mask = VectorizationBase.mask(T, r)
            $(regularized_cov_block_quote(W, T, 1, :sample_stride, true, :mask))
        end
        nothing
    end
end

Benchmarking:

julia> using BenchmarkTools, Statistics

julia> @benchmark mean_and_var!($x,$y,$A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     39.568 μs (0.00% GC)
  median time:      40.256 μs (0.00% GC)
  mean time:        43.080 μs (0.00% GC)
  maximum time:     227.730 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> x'
1×200 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 -0.0466792  -0.0455147  0.0170653  0.0011996  0.0307585  0.0308389  0.0303355  -0.0144757  -0.0509977  -0.0120208  -0.0161698  0.0498736  0.0142003  0.0513357  0.00356376  0.0202032  -0.0300317  -0.0591260  …  0.0351895  -0.014007  0.0231309  -0.00640476  0.0121385  0.00250655  0.00367508  -0.0373912  -0.00839410  -0.00719569  -0.0306729  0.0163719  -0.038363  0.0357159  0.0111598  0.00553716  -0.018665  0.0148885

julia> mean(A, dims = 2)'
1×200 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 -0.0466792  -0.0455147  0.0170653  0.0011996  0.0307585  0.0308389  0.0303355  -0.0144757  -0.0509977  -0.0120208  -0.0161698  0.0498736  0.0142003  0.0513357  0.00356376  0.0202032  -0.0300317  -0.0591260  …  0.0351895  -0.014007  0.0231309  -0.00640476  0.0121385  0.00250655  0.00367508  -0.0373912  -0.00839410  -0.00719569  -0.0306729  0.0163719  -0.038363  0.0357159  0.0111598  0.00553716  -0.018665  0.0148885

julia> y'
1×200 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 1.02852  1.02274  1.00838  1.06236  1.04392  0.951408  1.02583  0.995716  1.03187  1.046  1.02397  1.02082  0.991599  0.937852  0.985895  1.03206  0.979809  1.00042  1.0083  1.00608  1.02262  1.00769  0.951676  1.01429  …  0.981213  0.993444  1.08527  0.976448  1.01732  0.942424  1.05196  1.0542  0.972378  0.991214  0.965925  0.981092  0.938367  0.996919  1.07532  0.939985  1.00628  0.994173  0.976612  0.970468  1.02659

julia> var(A, dims = 2)'
1×200 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 1.02852  1.02274  1.00838  1.06236  1.04392  0.951408  1.02583  0.995716  1.03187  1.046  1.02397  1.02082  0.991599  0.937852  0.985895  1.03206  0.979809  1.00042  1.0083  1.00608  1.02262  1.00769  0.951676  1.01429  …  0.981213  0.993444  1.08527  0.976448  1.01732  0.942424  1.05196  1.0542  0.972378  0.991214  0.965925  0.981092  0.938367  0.996919  1.07532  0.939985  1.00628  0.994173  0.976612  0.970468  1.02659

julia> @benchmark mean!($y,$A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     50.627 μs (0.00% GC)
  median time:      51.388 μs (0.00% GC)
  mean time:        51.841 μs (0.00% GC)
  maximum time:     136.623 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark var($A, dims = 2, mean = $y)
BenchmarkTools.Trial: 
  memory estimate:  3.91 KiB
  allocs estimate:  14
  --------------
  minimum time:     107.394 μs (0.00% GC)
  median time:      110.219 μs (0.00% GC)
  mean time:        111.107 μs (0.00% GC)
  maximum time:     242.747 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

It is about 25% faster at getting both the mean and variance as Statistics.mean! is at getting just the mean. [EDIT for good measure:

julia> @benchmark sum!($x,$A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     53.718 μs (0.00% GC)
  median time:      54.253 μs (0.00% GC)
  mean time:        54.750 μs (0.00% GC)
  maximum time:     139.711 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

]
Looking at this particular example, I think it’d actually be fairly easy to make it not @generated, so I really didn’t try hard enough .

But on the other hand, is it worth it to spend my time de-@generated functions like these?

Perhaps I can expect better compile times (and I have been suffering from bad compile times, so that is a definite win), but in which situations can I expect better run time performance, too?

tkf · October 6, 2019, 9:36am

Is it the only reason that @generated exists? How about contextual dispatch as used in e.g. auto-diff systems?

(But I agree that many uses of generated functions can be eliminated by cleverly using multiple dispatch and constant folding.)

tkf · October 6, 2019, 9:37am

Here is a (slightly contrived) example where @generated seems to hurt inference:

foldldiag_gen(op, acc, A::AbstractArray) =
    ndims(A) === 0 ? acc : _foldldiag_gen(op, acc, A, 1, size(A, 1))

@generated function _foldldiag_gen(op, acc::T, A::AbstractArray, i0, i1) where T
    idx = [:i for i in 1:ndims(A)]
    quote
        for i in i0:i1
            acc′ = op(acc, A[$(idx...)])
            acc′ isa T || return _foldldiag_gen(op, acc′, A, i + 1, i1)
            acc = acc′
        end
        return acc
    end
end

foldldiag_fun(op, acc, A::AbstractArray) =
    ndims(A) === 0 ? acc : _foldldiag_fun(op, acc, A, 1, size(A, 1))

function _foldldiag_fun(op, acc::T, A, i0, i1) where T
    for i in i0:i1
        idx = ntuple(_ -> i, ndims(A))
        acc′ = op(acc, A[idx...])
        acc′ isa T || return _foldldiag_fun(op, acc′, A, i + 1, i1)
        acc = acc′
    end
    return acc
end

Base.return_types(
    foldldiag_gen,
    Tuple{
        typeof(+),
        Bool,
        Matrix{T} where T <: Union{Int, Float64},
    },
)
# 1-element Array{Any,1}:
#  Any

Base.return_types(
    foldldiag_fun,
    Tuple{
        typeof(+),
        Bool,
        Matrix{T} where T <: Union{Int, Float64},
    },
)
# 1-element Array{Any,1}:
#  Union{Bool, Float64, Int64}

oxinabox · October 6, 2019, 10:25am

Generated functions, when initially added to the language were only allowed to return AST.
Which is fine.
But can’t implement the kinds of code transforms needed for AD,
as
A) Julia doesn’t provide a way to get the AST.
B) AST is the wrong level for AD, AD loves single assignment, which is lowered form.

Generated functions were exended to also be allowed to return CodeInfo (i.e. lowered IR)
much later.
That, combined with Base.uncompressed_ast (which despite the name does not retun the AST, but actually the lowered IR) is enough to implement that.

Its kind of a hack though.
and as I recall Jaret talking about it, it was kind of intended to just abuse @generated and Base.uncompress_ast while a more sensible way to do this kind of thing was devised.
So as to allow the idea to be prototyped.

oxinabox · October 6, 2019, 10:33am

@generated is pretty obsure, and you can work with julia for years without ever needing it.

Many people do not understand how it works.
Like it is petty odd to use as well, even compared to macros.
The arguments are types in the generator body, and when interpolated into the returned quote.
But if directly used by name they are inscope.
(in macros they would be out of scope).
this makes sense of couse, and they couldn’t be any other way.
but you definately have to learn to use them.

Anyway, since they are rarely needed and are obsure,
you’re making your code much harded for other people to modify or contribute to,
so it had better be worth it.

Tamas_Papp · October 6, 2019, 10:44am

I have no data on that… but generated functions are well-documented, with plenty of examples. Anyone motivated to learn about them can do so.

Generated functions are part of Julia. Because of the reasons mentioned above, they should only be used when needed, but sometimes they are needed.

tkf · October 6, 2019, 11:08am

Just to explain the context of this thread, we are discussing if it makes sense to avoid using @generated to implement ConstructionBase.setproperties (see the documentation for what it does). See:

https://github.com/JuliaObjects/ConstructionBase.jl/issues/21

If people here have some specific comments to the way ConstructionBase.setproperties is implemented, please feel free to join the discussion there.

But I still am interested in general when @generated hurts compilation performance or quality. My example above is rather a stupid use of @generated. Is there a minimal but practically relevant example where using @generated is bad?

tkf · October 6, 2019, 11:19am

I knew the history that CodeInfo was added to help Cassette.jl but didn’t know that it was treated a hack until a new mechanism is implemented? But isn’t @generated at the right stage for doing this (i.e., input types are figured out)? Is there a discussion on alternative mechanisms?

oxinabox · October 6, 2019, 12:00pm

Here is the original issue, it kind of references something that kinda sounds like Cassette,
though it isn’t Contextual Dispatch (see below) it is a Custom Compiler Pass.

and the PR that solved it,
which mentions dispatch interceptable function wrappers (which is Contextual Dispatch).

github.com/JuliaLang/julia

allow CodeInfo to be returned directly from generated function generator

JuliaLang:master ← JuliaLang:jr/cinfofromgenerated

opened 08:09PM - 19 Jun 17 UTC

jrevels

+54 -28

My pitiable attempt at resolving #21146. I'm not a C programmer by any means, so… I've probably done something stupid/silly. I would appreciate any direction here, since if I can get it working, this would be quite useful for tracing generic Julia calls via dispatch-interceptable function wrappers.

My own reading of those, and how empty they are is that there are a lot of discussions that happened offline. So idk if there is more. anywhere.

I think it is worth being clear on what is Contentual Dispatch vs Custom Compiler Pass.
Context Dispatch is the thing Cassette cassette overdub does (at least when you don’t specify a pass).
I prefer to think of that as call overloading, it overloads what it means to do a call at all.
Julia already provides the ability to overload what a particular call does, via (::T)(args...) = ,
which overloads what it means to call some function of type T.

If if julia didn’t lower functions to directly be called (and that to hit those (::T)(args...) = )
overloads,
but instead lowered them always to call(current_context(), f::T, args...)
with a default implementation of call(::NormCtx, f::T, args...) = f(args...)
then one would not need to use generated functions + Base.uncompressed_ast to make that transform it would already be made.
I think I head that that was being considered.

Anyway so the other thing is the Custom Compiler Pass.
Where you transform the IR.
So the contextual (call overloading) can be accomplished using a Custom Compiler Pass, by just replacing all the calls in the IR as discussed.

But isn’t @generated at the right stage for doing this (i.e., input types are figured out)?

Not nessecarily. Really depends what you want to do.

Note that cassette operated on lowered IR, not typed+specialized IR,
because the generator runs before the optimizer.
Generators run very early in the typed-IR creation stage, they have types, but nothing else.

Contextual dispatch basically doesn’t matter when you run it, since if you are just doing a replace of all calls t.o call something else, then that can be done on the AST (see Arborist.jl), lowered IR, typed IR, optimized Typed IR, LLVM, etc.
If you do it before the optimizer runs then you can write overdubs that depend on types and have the ones you don’t change get optimized out.
But if you do it certain later stages, then you have access to the types and can not repace calls you don’t want to replace.

Other things you might want to run, like hoising things out of loops,
is much easier done either ealier: On the AST when loops are still clear,
or later:
once the control-flow-graph (CFG) and domination tree (DomTree (unrelated to the HTML use)) is constructed and available.
Since making those is really expensive (MagnetricReadHead now constructs those during a cassette pass, it is not fast and the optimizer redoes all that work again.)
Without the CFG and DomTree making custom optimization passes is very difficult, because you don’t have easy way of knowing what runs in what order (you can work it out, but doing so is equiv to constructing the CFG)

I think the time it runs might actually be the worst time it could run for most custom compiler-pass purposes.
But it is a whole lot better than not having it at all.

tkf · October 6, 2019, 12:23pm

Thanks a lot! That’s super informative. I haven’t thought about hooking things into later stages of compilation but that sounds very interesting.

Elrod · October 6, 2019, 7:57pm

@jw3126 and @tkf have a much better reason to use @generated than my example. It came from the fact that I was initially writing code in that way to support fixed size arrays, where I’m likely to change the blocking behavior as a function of array size (eg, if we have avx512 and are summing 72 elements, we’d want to do it in 3 blocks of 24, not 2 blocks of 32 followed by 1 block of 8).
We would change both the number and the size of the blocks as a function of the number of rows we are summing.

But if they’re dynamically sized, we probably don’t want to do anything like that, so it makes a lot more sense to write it as a simple function:

using VectorizationBase, SIMDPirates
function mean_and_var_nogen!(
    means::AbstractVector{T}, vars::AbstractVector{T}, sample::AbstractArray{T}
) where {T}
    V = VectorizationBase.pick_vector(T)
    W, Wshift = VectorizationBase.pick_vector_width_shift(T)
    WT = VectorizationBase.REGISTER_SIZE
    D, N = size(sample); sample_stride = stride(sample, 2) * sizeof(T)
    @boundscheck if length(means) < D || length(vars) < D
        throw(BoundsError("Size of sample: ($D,$N); length of preallocated mean vector: $(length(means)); length of preallocated var vector: $(length(vars))"))
    end
    ptr_mean = pointer(means); ptr_vars = pointer(vars); ptr_smpl = pointer(sample)
    vNinv = vbroadcast(V, 1/N); vNm1inv = vbroadcast(V, 1/(N-1))
    for _ in 1:(D >>> (Wshift + 2)) # blocks of 4 vectors
        Base.Cartesian.@nexprs 4 i -> μ_i = vload(V, ptr_smpl + WT * (i-1))
        Base.Cartesian.@nexprs 4 i -> Σδ_i = vbroadcast(V, zero(T))
        Base.Cartesian.@nexprs 4 i -> Σδ²_i = vbroadcast(V, zero(T))
        for n ∈ 1:N-1
            Base.Cartesian.@nexprs 4 i -> δ_i = vsub(vload(V, ptr_smpl + WT * (i-1) + n*sample_stride), μ_i)
            Base.Cartesian.@nexprs 4 i -> Σδ_i = vadd(δ_i, Σδ_i)
            Base.Cartesian.@nexprs 4 i -> Σδ²_i = vmuladd(δ_i, δ_i, Σδ²_i)
        end
        Base.Cartesian.@nexprs 4 i -> xbar_i = vmuladd(vNinv, Σδ_i, μ_i)
        Base.Cartesian.@nexprs 4 i -> ΣδΣδ_i = vmul(Σδ_i, Σδ_i)
        Base.Cartesian.@nexprs 4 i -> s²_i = vmul(vNm1inv, vfnmadd(ΣδΣδ_i, vNinv, Σδ²_i))
        Base.Cartesian.@nexprs 4 i -> (vstore!(ptr_mean, xbar_i); ptr_mean += WT)
        Base.Cartesian.@nexprs 4 i -> (vstore!(ptr_vars, s²_i); ptr_vars += WT)
        ptr_smpl += 4WT
    end
    for _ in 1:((D & ((W << 2)-1)) >>> Wshift) # single vectors
        μ_i = vload(V, ptr_smpl)
        Σδ_i = vbroadcast(V, zero(T))
        Σδ²_i = vbroadcast(V, zero(T))
        for n ∈ 1:N-1
            δ_i = vsub(vload(V, ptr_smpl + n*sample_stride), μ_i)
            Σδ_i = vadd(δ_i, Σδ_i)
            Σδ²_i = vmuladd(δ_i, δ_i, Σδ²_i)
        end
        xbar_i = vmuladd(vNinv, Σδ_i, μ_i)
        ΣδΣδ_i = vmul(Σδ_i, Σδ_i)
        s²_i = vmul(vNm1inv, vfnmadd(ΣδΣδ_i, vNinv, Σδ²_i))
        vstore!(ptr_mean, xbar_i); ptr_mean += WT
        vstore!(ptr_vars, s²_i); ptr_vars += WT
        ptr_smpl += WT
    end
    r = D & (W-1)
    if r > 0 # remainder
        mask = VectorizationBase.mask(T, r)
        μ_i = vload(V, ptr_smpl, mask)
        Σδ_i = vbroadcast(V, zero(T))
        Σδ²_i = vbroadcast(V, zero(T))
        for n ∈ 1:N-1
            δ_i = vsub(vload(V, ptr_smpl + n*sample_stride, mask), μ_i)
            Σδ_i = vadd(δ_i, Σδ_i)
            Σδ²_i = vmuladd(δ_i, δ_i, Σδ²_i)
        end
        xbar_i = vmuladd(vNinv, Σδ_i, μ_i)
        ΣδΣδ_i = vmul(Σδ_i, Σδ_i)
        s²_i = vmul(vNm1inv, vfnmadd(ΣδΣδ_i, vNinv, Σδ²_i))
        vstore!(ptr_mean, xbar_i, mask)
        vstore!(ptr_vars, s²_i, mask)
    end
    nothing
end

This seems to generate more or less the same assembly.

However, check this out!
I put each function in a file, followed by:

A = randn(200,1000);
x = Vector{Float64}(undef, 200); y = similar(x)

@time mean_and_var!(x, y, A)

Then:

julia> function test_compilation(N)
           path = "/home/chriselrod/Documents/progwork/julia/"
           nogen_file = joinpath(path, "nogen_mean_var.jl")
           gen_file = joinpath(path, "gen_mean_var.jl")
           julia = joinpath(Sys.BINDIR, "julia")
           for _ in 1:N
               println("Testing no gen:")
               run(`$julia -O3 $nogen_file`)
               println("Testing generated:")
               run(`$julia -O3 $gen_file`)
           end
       end
test_compilation (generic function with 1 method)

julia> test_compilation(10)
Testing no gen:
  0.385905 seconds (1.17 M allocations: 57.096 MiB, 2.55% gc time)
Testing generated:
  1.063699 seconds (2.94 M allocations: 144.895 MiB, 8.16% gc time)
Testing no gen:
  0.372277 seconds (1.17 M allocations: 57.096 MiB, 2.64% gc time)
Testing generated:
  1.051894 seconds (2.94 M allocations: 144.895 MiB, 8.11% gc time)
Testing no gen:
  0.367097 seconds (1.17 M allocations: 57.096 MiB, 2.64% gc time)
Testing generated:
  1.047010 seconds (2.94 M allocations: 144.895 MiB, 8.17% gc time)
Testing no gen:
  0.370914 seconds (1.17 M allocations: 57.096 MiB, 2.66% gc time)
Testing generated:
  1.105556 seconds (2.94 M allocations: 144.895 MiB, 11.13% gc time)
Testing no gen:
  0.378023 seconds (1.17 M allocations: 57.096 MiB, 2.57% gc time)
Testing generated:
  1.048928 seconds (2.94 M allocations: 144.895 MiB, 8.22% gc time)
Testing no gen:
  0.367192 seconds (1.17 M allocations: 57.096 MiB, 2.65% gc time)
Testing generated:
  1.060198 seconds (2.94 M allocations: 144.895 MiB, 8.21% gc time)
Testing no gen:
  0.367748 seconds (1.17 M allocations: 57.096 MiB, 2.60% gc time)
Testing generated:
  1.087009 seconds (2.94 M allocations: 144.895 MiB, 10.84% gc time)
Testing no gen:
  0.371097 seconds (1.17 M allocations: 57.096 MiB, 2.70% gc time)
Testing generated:
  1.066088 seconds (2.94 M allocations: 144.895 MiB, 8.23% gc time)
Testing no gen:
  0.368057 seconds (1.17 M allocations: 57.096 MiB, 2.64% gc time)
Testing generated:
  1.092583 seconds (2.94 M allocations: 144.895 MiB, 11.02% gc time)
Testing no gen:
  0.368158 seconds (1.17 M allocations: 57.096 MiB, 2.62% gc time)
Testing generated:
  1.063671 seconds (2.94 M allocations: 144.895 MiB, 8.19% gc time)

That’s roughly 1.07 seconds for the generated function to compile, vs 0.37 for the non-generated function – that is approaching a 3x difference!
Needless to say, but I’ve already updated the file I took that example from.

This was an example that falls under the “obviously don’t use @generated”, but I am still curious about the better examples.
Do recursive functions stress the compiler?
What about having to rely on constant propagation? Isn’t it unwise to rely on compiler implementation details?

tkf · October 6, 2019, 8:32pm

Thanks a lot for looking into this! Good to have a practical example of heavily optimized code where this is important.

Yeah, I’d like to know if there is a plan to solve this as well. I know that there is an ongoing work on removing purity constraint of the generator:

github.com/JuliaLang/julia

WIP: Generated function recompilation with edges

JuliaLang:master ← NHDaly:generated_functions--recompilation-with-edges

opened 10:50PM - 02 Aug 19 UTC

NHDaly

+152 -1

# WIP: Generated function recompilation with edges This PR changes the implem…entation of Generated Functions to always set a backedge from the generator to the staged function itself, so that if the body of the generator is invalidated, the generated function is recompiled. This allows us to stop freezing the world-age on generated function bodies, which -- among other things -- allows generated functions to be used for computing _values_, as part of generic interfaces. In particular, the aim of this PR is to enable us to remove [this restriction](https://docs.julialang.org/en/v1.1.1/manual/metaprogramming/#Generated-functions-1) from Generated Functions: > 3. Instead of calculating something or performing some action, you return a quoted expression which, when evaluated, does what you want. An example using julia built from this PR: ```julia julia> @generated function zero_tuple(::Type{T}) where T<:Tuple Tuple(zero(t) for t in fieldtypes(T)) end zero_tuple (generic function with 1 method) julia> @code_typed zero_tuple(Tuple{Int, Float32}) CodeInfo( 1 ─ return (0, 0.0f0) ) => Tuple{Int64,Float32} julia> using FixedPointDecimals julia> @code_typed zero_tuple(Tuple{Int, Float32, FixedDecimal{Int,2}}) CodeInfo( 1 ─ return (0, 0.0f0, FixedDecimal{Int64,2}(0.00)) ) => Tuple{Int64,Float32,FixedDecimal{Int64,2}} ``` From what I understand, there were a lot of changes implemented in the compiler to fix the #265-style issues that plagued Cassette, culminating here: https://github.com/JuliaLang/julia/pull/32237. My understanding is that the work done to fix #265 for julia at large was not able to extend to generated functions for a number of reasons, but that the huge amount of work done for Cassette, above, has largely alleviated those problems. I understand that #32237 now allows generated functions to set forward edges for methods that should trigger recompilation of the generated function if they are invalidated. This mechanism is opt-in, and was not applied to all generated functions by default. In this PR, we simply use that mechanism to set a backedge from the `generator` function body to the staged function itself, so that if the generator is invalidated (because any functions _it_ calls are invalidated), then the generated function will be re-generated the next time it's called. This recompilation means that generated functions can participate in the same recompilation process as normal Julia functions, and therefor no longer need to have their world-age frozen. Lifting this restriction gives us these benefits: 1. Generated functions can safely be used to stage the computation of _values_, because the values will be re-computed if their dependent computations change, so there shouldn't be surpising/mismatched results. 2. Generated functions can call functions whose method-tables might be updated by users, allowing them to be part of generic interfaces, and perform reflection/inspection on user-defined types. 3. Generated functions will behave the same at the REPL as they do inside precompiled packages, eliminating this confusing behavior. 4. Interactive programming w/ generated functions becomes much easier, allowing you to experiment with generated functions at the REPL the same as you would with other functions. _Together, the first two benefits allows us to stage computation of values that depends on user-defined computations over user-defined types._ There is currently no good way to express such computations, such that they are _safe_ and _guaranteed to compute at compile-time_. ---------------------------------- This PR is still WIP. I would not be surprised if there are still some lingering roadblocks that make this more challenging than just slapping a backedge onto the generated functions. But the benefits of this PR would be very substantial, and would lead to significantly simpler, and easier to reason about, code. So I think it's worth working through those issues to get this right! :) --------------------------------- #### ¿¿Stretch goals??: 1. ??? With this well-defined recompilation behavior, maybe we can remove the mystery around whether generated functions can "be recompiled arbitrarily often", and provide stronger guarantees about that performance? 2. ??? Maybe this improvement/extension can be applied to `@pure` functions as well, so that users can write code that is intended to execute at compiletime, without the overhead of generating entire methodinstances just to make the result of a computation available to the compiler. ---------------- ## Motivating Examples I discussed a number of examples for which the existing mechanisms for controlling compiletime execution are not sufficiently satisfying in my talk at JuliaCon 2019. You can reference those here: https://github.com/NHDaly/juliaCon2019-If_Runtime_isn-t_Funtime-Slides A number of those examples would be greatly simplified or solved by this PR. Here are a couple of them reiterated: ### Calculating "magic numbers" ```julia coefficient(::Type{FD{T, f}}) where {T, f} = T(10)^f ``` After this PR, this could be written as ```julia @generated coefficient(::Type{FD{T, f}}) where {T, f} = T(10)^f ``` And if we were able to apply this same extension to `@pure` functions (see **stretch goals** above), this could become: ```julia @pure coefficient(::Type{FD{T, f}}) where {T, f} = T(10)^f ``` I used the coefficient example because it's simple yet it doesn't const-fold. But a bigger issue is this function, which is quite complex, and once could really never expect it to constfold: https://github.com/JuliaMath/FixedPointDecimals.jl/blob/v0.3.0/src/FixedPointDecimals.jl#L465 Or, dearer to my heart, this example, which we want to add as part of https://github.com/JuliaMath/FixedPointDecimals.jl/pull/45, where we want to precompute `2^64/10^f` so that we can replace `÷ 100` with `* 184467440737095516 << 64`, which computes the same value but much more cheaply: https://github.com/JuliaMath/FixedPointDecimals.jl/blob/026513ffb3a09e7b1e4943a3c15ffec8e3181b42/src/fldmod_by_const.jl#L126 ### Error-checking of types and compiler constants The error-checking computation in this function only refers to compiler constants and types, but there is no safe way to stage that computation such that it occurs only at compiletime: https://github.com/JuliaMath/FixedPointDecimals.jl/blob/07d24d994d67a6f0980ad127898f89c2b4767283/src/FixedPointDecimals.jl#L84-L98

But this is rather “use @generated more” direction, kind of opposite to the main argument here.

Raf · October 6, 2019, 8:36pm

I find the problem is knowing the point where propagation fails in complex recursive methods. It can be difficult to know why a particular recursive function compiles away, or not, and it seems that that point may subtly change between minor versions of Julia.

Topic		Replies	Views
Constant propagation vs generated functions General Usage	26	2685	August 1, 2022
Why are generated functions (maybe) impossible to statically compile? Internals & Design	18	1287	March 25, 2023
Solution concept for "world age" problem occuring with @generated and macros Performance discussion	7	1750	April 12, 2021
Understanding the limitations of generated functions General Usage question , performance , metaprogramming	17	1468	July 27, 2020
Why are functions that are never called getting compiled? General Usage compilation , lazy-evaluation , generated	8	626	June 10, 2023

Why not use @generated?

Related topics