Tricks to compile away step in calculation?

Is there a way to compile away a step in a calculation internal to some function called in a closure but without manually decomposing the function into its constituent calculations within the closure?

Below I provide a MWE in OLS because this is the simplest example I could think of: suppose we are going to run a loop of regressions using the same covariates but different outcome variables. We use a closure to create a new function that treats covariates as a constant.

my_pinv(X) = (X'X)^-1 * X'  # yes I know pinv is faster, this is just to drive a larger computational time discrepancy
ols(X, Y) = my_pinv(X) * Y 

N = 1_000
K = 100 
Y_vars = 100
X = hcat([rand(N) for k in 1:K]...)
Ys = hcat.([rand(N) for k in 1:Y_vars])

pre_computable_time = @timed pinv1 = my_pinv(X) 
time_faster_step = @timed pinv1 * Ys[1] 




# want to avoid: 
slow_time =  Y_vars * (pre_computable_time.time + time_faster_step.time) 
Ys .|> y -> ols(X, y) 

# want to achieve: 
pre_computable_time.time + (Y_vars * time_faster_step.time)

closed_ols = let x = X
    y -> ols(x, y) 
end 

@time Ys .|> closed_ols
# yeilds no improvement over slow time. 

# Although it works, I do not want to resort to rewriting the ols function in my let block 
closed_ols2 = let x = X 
    x_pinv = my_pinv(x)
    y ->  x_pinv * y 
end 
@time Ys .|> closed_ols2 

Naively, I would expect that on first call of closed_ols, the compiler would calculate my_pinv(X) and use this result in the method it generates to be used for all subsequent calls of closed_ols, given that the let block imposes x is a constant and therefore the my_pinv calculation within ols will be identical for all calls.

Is there any way to force the closure to precompute my_pinv(x) given x=X so as to avoid recomputing this within the loop, but doing so without explicitly rewriting the function ols within the closure? Is this even feasible? Obviously, I could manually rewrite the ols function wthin the closure (final example above), or I could create datatypes that store intermediate calculations, but I’d like a convenient way to be able to abstract away from those things.

This question was motivated by previous discussion on anonymous closures. If I can force the compiler to precompute pinv for closed_ols, then I could achieve performance goals with:

Ys .|> (let x=X; y -> ols(x, y) end)

while also maximizing reuse of package code and abstraction layers provided by package code. Creating the closure within the piped function would also allow the closure to be garbage collected sometime after ols is calculated.

This would be even better with syntactic sugar, which other people have proposed in other threads:

Ys .|>  ols(X, _) 

Which would allow high performance, high readability, and high code reuse.

This category of problem shows up all the time in my workflows. The MWE used ols, but I am interested in a more general answer.

you need to do this manually:

julia> ols_piv(X_piv, Y) = X_piv * Y
ols_piv (generic function with 1 method)

julia> closed_ols = let x = my_pinv(X)
           y -> ols_piv(x, y)
       end
#23 (generic function with 1 method)

julia> @time Ys .|> closed_ols
  0.027778 seconds (76.09 k allocations: 5.229 MiB, 98.64% compilation time)

btw,

maybe next time post some smaller number for a MWE

Thanks…

But what I would like to do is avoid manually decomposing and rewriting parts of the functions. What you are suggesting works for simplified problem above. It doesn’t work, practically, if I am using functions written in another package (e.g. lm from GLM.jl) – I’d have to figure out how the internals of GLM.jl work, and I’d have to reimplement parts of the functions in GLM. My question isn’t really about looping through OLS. It’s about understanding if there are generic ways I can use the Julia language to precompile computationally intensive steps without sacrificing the advantages of functions that abstract away from various intermediate calculations.

And sorry about large numbers in MWE… corrected… didn’t mean to tie up your repl… :frowning:

If there’s very obvious re-use interface GML.jl does not expose, then you should open a feature request. For example this is why we have rand() and rand!() – you shouldn’t need to copy-paste source code of rand() in order to reuse a buffer array.

Depends on your application, maybe memoization can help:

Thanks for recommend the Memoize.jl package, this looks helpful… but probably not for my specific question – caching solutions for ols(x,y) won’t improve performance when x is constant but y varies.

I’m guessing the answer to my (edited) question :

Is there any way to force the closure to precompute my_pinv(x) given x=X so as to avoid recomputing this within the loop, but doing so without explicitly rewriting the function ols within the closure?

is “no”. Is this category of thing infeasible for a programming language like Julia? Or is it a feature one could hope for in a later version of Julia?

In general

If you want the compiler to do something about this, you basically have to put the known argument (X) into the type domain. Then the compiler would be able to do constant propagation on X. However, this comes with significant caveats:

  1. Putting X into the type domain means putting it into a type parameter of a type. Only types, symbols, “plain data” (isbits) values and tuples thereof may be type parameters. In your case, this means that you can’t use Array, but you could use a static array, from the StaticArrays package. Use the isbits and isbitstype functions while developing to find out what is and what isn’t “plain data”.

  2. Getting the compiler involved means that you’ll have to pay the price of compilation and run time dispatch when constructing the closure. This may or may not be worth it, depending on how often you’ll run the closure after it’s constructed at great cost.

  3. For the Julia compiler to do constant propagation on a function (my_pinv in this case), it needs to know certain things about the function. If the compiler is not able to infer the necessary facts on it’s own, you may need to help it: Base.@assume_effects. While Base.@assume_effects is available in released Julia, IMO it’s significantly improved in the beta prerelease (v1.10), especially because there it’s possible to use Base.@assume_effects within the body of a function, which is safer. The function Base.infer_effects, which is not part of Julia’s public interface, must be used to find out what facts (“effects”) are inferred for a function.

  4. The Julia compiler doesn’t constant fold functions with @inbounds annotations. I hope a way around this will be made available.

These preconditions are tough to satisfy, but then, you’re definitely asking for much.

A simple example

This is an example that’s simpler than yours, but contrived.

# if necessary you can use a custom type instead of `Val`
to_value_domain(::Val{v}) where {v} = v

# this is instead of `ols` in your example
my_function(x, y) = (sin ∘ cos ∘ sin ∘ sin ∘ cos ∘ sin ∘ sin ∘ cos)(x) * y

# the usual way
function closed_function_without_run_time_dispatch(x)
  let x = x
    y -> my_function(x, y)
  end
end

# the "getting the compiler involved" way
function closed_function_with_run_time_dispatch(x)
  # warning: run time dispatch
  let x = Val(x)
    y -> my_function(to_value_domain(x), y)
  end
end
julia> f = closed_function_without_run_time_dispatch(0.3)
#1 (generic function with 1 method)

julia> g = closed_function_with_run_time_dispatch(0.3)
#3 (generic function with 1 method)

julia> f(0.1)
0.07238243561570594

julia> g(0.1)
0.07238243561570594

julia> @code_typed g(0.1)
CodeInfo(
1 ─ %1 = Base.mul_float(0.7238243561570593, y)::Float64
└──      return %1
) => Float64

julia> @code_typed f(0.1)
CodeInfo(
1 ─ %1 = Core.getfield(#self#, Symbol("#1#x"))::Float64
│   %2 = invoke Base.:(var"#_#103")($(QuoteNode(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}()))::Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, sin ∘ cos ∘ sin ∘ sin ∘ cos ∘ sin ∘ sin ∘ cos::ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{ComposedFunction{typeof(sin), typeof(cos)}, typeof(sin)}, typeof(sin)}, typeof(cos)}, typeof(sin)}, typeof(sin)}, typeof(cos)}, %1::Float64)::Float64
│   %3 = Base.mul_float(%2, y)::Float64
└──      return %3
) => Float64

Moving arrays into the type domain

As discussed above, Array is mutable, so not “plain data”. You could use something like this for moving arrays first into “plain data”, and then into the type domain:

using StaticArrays

function to_isbits(m::AbstractMatrix)
  s = size(m)

  # warning: run time dispatch
  SMatrix{s...}(m)
end

function to_type_domain(m::Matrix)
  # warning: run time dispatch
  Val{to_isbits(m)}()
end

Comparison with other languages

AFAIK Julia is currently the only programming language which allows the programmer to do something like I just described. In other languages you’d have to use metaprogramming, or templates or constexpr in C++, for example, while in Julia you can move values into the type domain. This is more powerful than other languages, because in other languages the precomputation would have to happen before compilation or during compilation, while in Julia the run time and compilation time are arbitrarily interleaved.

7 Likes

This is excellent on many levels. Thank you so much for your detailed response!

I’d ask you to elaborate on how to use Base.@assume_effects for a closure, but you’ve already provided so much, and I think I might be able to figure things out from here.

1 Like

May I ask a follow-up question?

In the example below, I define a function without any arguments, where the returned value is computed from a declared constant. Naively, I expected that the compiled method “const_prod_inv()” would store the inverse product as a 20 x 20 matrix, merely returning this matrix on function call. This does not happen – the function computes the product every time the function is called, despite the fact that, I would think, the compiler should know that the calculations in the function body depend upon only constants and thus can never be different than what they are the first time they are computed.

Can a macro such as…

While Base.@assume_effects is available in released Julia, IMO it’s significantly improved in the beta prerelease (v1.10), especially because there it’s possible to use Base.@assume_effects within the body of a function, which is safer.

… force the compiler to compute the result for the method body? If not, is there a fundamental limitation of the language or is this the sort of thing we might expect future versions of Julia to support?

Is my question about “constant propagation”? I’m fuzzy about what, exactly, that means…

I am asking because if it is likely that functions will be able to store computations of constants within method definitions in the future, then that will allow me to develop much simpler code which would be performant in future releases with minimal refactoring.

N = 1_000_000
K = 20 
X = hcat([rand(N) for k in 1:K]...)

const X_const2 = X 
function const_prod_inv() 
    result = (X_const2' * X_const2)^-1 
    return result 
end 

@time const_prod_inv() # 1) time to evaluate function that computes product and inverse of only constants
@time (X'*X)^-1 # 2) time to compute product and inverse
res = const_prod_inv() 
@time res # 3) time to return value that is product and inverse of the constants
#Naively, expected 1 to take same time as 3, but instead takes same time as 2.  

Not really. The return value of the function depends on the value of a global constant, yes, but that value is mutable (all Array values are mutable):

julia> N = 1_000_000
1000000

julia> K = 20
20

julia> const X = hcat([rand(N) for k in 1:K]...);

julia> typeof(X)
Matrix{Float64} (alias for Array{Float64, 2})

julia> ismutable(X)
true

julia> function const_prod_inv() 
         (X' * X)^-1
       end
const_prod_inv (generic function with 1 method)

julia> first(const_prod_inv())
1.1414670499650865e-5

julia> first(const_prod_inv())
1.1414670499650865e-5

julia> using Random

julia> rand!(X);

julia> first(const_prod_inv())
1.1411429377887483e-5

julia> rand!(X);

julia> first(const_prod_inv())
1.1403942788096892e-5

A few hours ago, in another thread on this forum, I, too, realized that I’m not clear on some distinctions regarding constant propagation vs constant folding.

Regarding your wider question, yes, Julia should, in principle, be able to replace the function body with just returning a constant, assuming the conditions that I stated above are satisfied. In particular note that SArrays, mentioned above, are not mutable, unlike Arrays.

I see. Because vectors are mutable (both their dimensions and their values), declaring them ‘constant’ does not declare their internals immutable. Therefore, it is very hard to provide the compiler the necessary information for it to infer that constant propagation/folding is indeed feasible.

Below, I test examples using a copied array, a StaticArrays, and a Tuple. None of these appear to induce the compiler to constant fold.

using StaticArrays  

N = 3_000 
X = rand(N)
X_tup = X |> Tuple;  
X_svec = SVector{length(X)}(X)

dot_prod(x) = sum(x .* x)

val = dot_prod(X) 

dot_prod_vec_closure = let x = copy(X); y -> dot_prod(x) end 
dot_prod_tup_closure = let x = X_tup; y -> dot_prod(x) end 
dot_prod_svec_closure = let x = X_svec; y -> dot_prod(x) end 
return_a_constant() = 1

@time dot_prod(X) 
@time dot_prod(X_tup) # really long compilation step (first call) b/c really long type signature??? 
@time dot_prod(X_svec) 
@time return_a_constant() 

@time dot_prod_vec_closure(nothing) # closure provides no improvement b/c, despite the copy step in closure, compiler doesn't know that nothing will change the copied values of the vector
@time dot_prod_tup_closure(nothing) # You would expect constant propagation/folding to work here b/c the tuple is immutable. 
@time dot_prod_svec_closure(nothing) # 
@time return_a_constant() 

Copying the vector does not help the compiler determine that values of an argument can’t change after the closure is constructed.

Static vectors are a bad idea if the vectors you are working with are large – construction takes a long time. They also don’t seem to help with constant folding.

Tuples are also a bad idea if the vectors are large – construction takes a long time. They also don’t seem to help with constant folding. They also seem to incur a lot of overhead during dispatch due to overly long type signatures (?).

It would be nice if there were a way to constrain or assert that, when the let block is called, something in a let block should be treated as immutable (including the things that it points to). Thus, giving the compiler sufficient information to constant fold/propagate.

What would be most helpful is if there were an Array type identical to standard Arrays except for preventing mutation on size of the array and values in the within array, including of the things pointed to by things in the array (StaticArrays seems to be very different from this). The only intended purposes of this type (call it “DeepImmutableArray”) would be to allow users write code (such as let blocks) for which the compiler has adequate information to perform constant folding:

let x = DeepImmutableArray(X)
y -> function_defined_on_X_and_y(x,y)
end 

Thanks again for your help!

1 Like

That’s to be expected given that you’re not applying the suggestions from my initial message above. This example is actually even simpler than my “simple” example, as the function, dot_prod, has just one argument instead of two.

Here you go:

julia> to_type_domain(v) = Val(v)
to_type_domain (generic function with 1 method)

julia> to_value_domain(::Val{v}) where {v} = v
to_value_domain (generic function with 1 method)

julia> function dot_prod(x)
         Base.@assume_effects :terminates_globally
         sum(x .* x)
       end
dot_prod (generic function with 1 method)

julia> dot_prod_closed(x) = let x = to_type_domain(x)
         () -> (dot_prod ∘ to_value_domain)(x)
       end
dot_prod_closed (generic function with 1 method)

julia> new_tuple(::V) where {V<:Val} = ntuple((_ -> rand()), V())
new_tuple (generic function with 1 method)

julia> dot_prod_closed_tuple(n) =
         (dot_prod_closed ∘ new_tuple ∘ to_type_domain)(n)
dot_prod_closed_tuple (generic function with 1 method)

julia> const f1 = @time dot_prod_closed_tuple(100)
  0.367480 seconds (534.99 k allocations: 22.948 MiB, 97.00% compilation time)
#1 (generic function with 1 method)

julia> @time f1()
  0.000001 seconds
34.52794800959923

julia> @code_typed f1()
CodeInfo(
1 ─     return 34.52794800959923
) => Float64

julia> const f2 = @time dot_prod_closed_tuple(1000)
  6.352952 seconds (5.12 M allocations: 213.805 MiB, 3.02% gc time, 100.00% compilation time)
#1 (generic function with 1 method)

julia> @time f2()
  0.000000 seconds
339.05977432638355

julia> @code_typed f2()
CodeInfo(
1 ─     return 339.05977432638355
) => Float64

I stopped at length 1000 because 3000 was taking too long to compile.

You need to make these globals const at least.

const dot_prod_vec_closure = let x = copy(X); y -> dot_prod(x) end 
const dot_prod_tup_closure = let x = X_tup; y -> dot_prod(x) end 
const dot_prod_svec_closure = let x = X_svec; y -> dot_prod(x) end 

They’re just calling the closures, so the fact that the closures are untyped globals doesn’t matter much.

Thanks again! Yes, mapping variables onto type domain allows the let block to compile out certain computations, as you pointed out, but that approach doesn’t work if the arrays are large, and the compilation step takes a very long time. Hence, I’ve been experimenting to see if there is another way.

Some interesting results:

const dot_prod_mult_closure_regular_vec = let x = X
    y -> dot_prod_mult(x, y) 
end 

@time dot_prod_mult(X, X) # 0.1) base case, ought to be slowest. has 4 allocations
@time dot_prod_mult(X, 1.0) # 0.2) base case, ought to also be slow. has 3 allocations
@time dot_prod_mult_closure_regular_vec(0.1) # 1) This function call is treated as scalar times scalar, so the let block actually did compile down the initial sum of dot product!!!!!  No allocations.  
@time dot_prod_mult_closure_regular_vec(X); # 2) this has 4 allocations... 
@time dot_prod_mult_closure_regular_vec(1.0) * X;  # 3) this is slightly faster than 2 and has only 2 allocations, indicating that Julia was able to compile down a method of dot_prod_mult_closure_regular_vec that precalculated sum(x .* x) for when multiplying times a Float64, but not for when multiplying a Vector{Float64}.  This is puzzling to me.  

So, the const let block compiled down a version of dot_prod_mult with sum(x.* x) precomputed – but only for the method in which y is a scalar. It does not compile down a version of dot_prod_mult with sum(x .* x) precomputed if y is a Vector{T}.

I experimented with

  • annotations Base.@assume_effects :foldable
  • adding and removing const from the closure definition
  • using PersistentVectors from the FunctionalCollections.jl package (to see if they provide enough information for the compiler to constant fold).

None of the above seemed to help. Strangely, Base.@assume_effects did not seem to do anything (not sure if this is because I am not on Julia v1.10-rc1, as I believe you are on). The const declaration is necessary to compile down a faster version, which is unfortunate because I cannot use a const declaration withing a function body.

Broader context for why I am so interested in this:

  • My work often involves optimizing objective functions that are products of many matrices. Most of these matrices are large and constant (with respect to the parameters I am optimizing over). If I could use a closure to precompute the products of constant matrices, this would allow significant performance improvements with minimal code overhead. Of course, I could decompose the objective function into parts that are constant with respect to parameters and parts that are not, such as
let constant_matrix = compute_constant_part(X)
   constant_matrix' * g(x, b) * constant_matrix 
end 

but doing so reduces code organization. In particular, I might be taking derivatives of the objective function with respect to different arguments in different parts of the analysis. Should I reimplement the objective function in multiple different ways? What if I define it inconsistently? This increases risks for hard-to-detect semantic errors. Alternatively, I could create a new data type called “ObjectiveFunction” and create functors that allow parts to pre-computed. This is a lot of work and sounds like it would restrict flexibility and reduce transparency of the code… Wouldn’t it be nice if I could write:

closed_obj = let x = X, y = Y, z = Z 
     params -> nasty_objective_function(x, y, z, params) 
end 
# minimize closed_obj w.r.t. params.  

precomputing terms within nasty_objective_function so that the optimization loop does not need to recompute these on every iteration?

  • I also simulate data at multiple different levels of observations. Some random variables are functions of other random variables at different levels of observation. I map data from one unit of observations to others by chaining integer arrays referencing the rows on which the data live. This is pretty fast, but some relationships involve nested getindex calls several layers deep. Closures could help to further boost performance. Again, I could get around this by precomputing certain steps and storing these to be called later. But doing so increases code complexity and increases risk of semantic errors, particularly when I now need to track state and might need to flag precomputed values to be recomputed.

It seems to me that, in principle, it should be possible for some future version of Julia to compile specialized methods for closures, which precompute certain values. The example above,

dot_prod_mult_closure_regular_vec(0.1)

actually did so!

It would be great if either there were more options to force the compiler to treat arrays as constants within a let block for certain functions. That’s risky, because, as you point out, arrays are not constant… although in particular applications it may be possible to do this, it does increase risk that sloppy coding would result in correctness issues… maybe if the language provided base support for something like PersistentArrays (Note, FunctionalCollections.jl has PersistentVectors but no general PersistentArrays), then we actually would have something that, within the let block, is a constant and therefore the compiler can reason about. E.g.:

closed_obj = let x = PersistentArray(X), y=PersistentArray(Y), z=PersistentArray(Z)
    params -> nasty_objective_function(x, y, z, params) 
end 

It’s possible, but wouldn’t be desirable.

It’s possible, but wouldn’t be desirable.

I’ve exhausted you with too many questions over too many days.

Thanks again for the significant commentary you’ve provided!

1 Like

I think your objective is to get this done automatically and implicitly, but you could just create a functor explicitly as follows.

julia> struct DotProdFunctor
           dot_prod::Float64
           function DotProdFunctor(x)
               new(sum(x .* x))
           end
       end

julia> (dpf::DotProdFunctor)(y) = dpf.dot_prod

julia> N = 3000
3000

julia> dpf1 = DotProdFunctor(rand(N))
DotProdFunctor(991.0494328007666)

julia> dpf1(500)
991.0494328007666

julia> @time dpf2 = DotProdFunctor(rand(N))
  0.000027 seconds (6 allocations: 47.047 KiB)
DotProdFunctor(1029.2723123825308)

julia> @time dpf2(600)
  0.000008 seconds (1 allocation: 16 bytes)
1029.2723123825308

Yes, on both counts. Maybe it would be interesting use metaprogramming like:

abstract type PartialFunctionEvaluationFunctor end 

@partial_functor dot_prod_sum(x, y) 
    sum(x.*x) * y 
end 

where @partial_functor writes structs for each possible combination of closure on the arguments of the function, and each struct has a functor for evaluating on remaining arguments.

That would be pretty cool but is not really what I want. I want to be able to dynamically create these closures/‘functors’ inside other functions (the metaprogramming approach wouldn’t work here b/c can’t define structs inside functions), and I want to be able to apply these closures/‘functors’ to functions written in other packages.

what would be great is if it were possible to write (in pseudo code)

x_datas = [x1, x2, ...]
y_data = ...
z_data = ...
x_datas .|> let y = y_data, z=zdata; x -> somebodys_function_where_y_and_z_computation_is_long; end 

where the closure constant folds into the appropriate method definitions (maybe compiled on first call of the anonymous function for given type of x argument it is called on).

You know what, maybe a general solution could be produced using metaprogramming. Sketch:

# metaprogramming let block
f = @let x=X
   y -> my_func(x, y) 
end


#generates function 
function my_func_closure(y) 
    gen_method!(my_func_closure, typeof(y))
   return my_func_closure(y) # does not cause infinite loop this method definition is untyped, but gen_method! creates a function for the specific type of y.  Maybe this would cause world age problems, but those could probably be worked around. 
end 

In turn, gen_method! parses method definition for my_func(::typeof(X), ::typeof(y)),
replacing functions that call x with what those functions evaluate to when called on X.
Maybe it would be necessary to recursively call @let on functions in the method of my_func that evaluate on y and other arguments… not certain.

Aside: No, pinv is considerably slower than this, but it is more accurate. See Efficient way of doing linear regression - #33 by stevengj

I think you’re making life complicated for yourself. Just compute the pseudo-inverse (or whatever) once and store it in a variable to re-use it. The code will be a lot clearer and simpler than trying for some magic syntax.

Expecting the compiler to figure out that a complicated calculation like this can be re-used is unrealistic and likely to be fragile even if it worked in some cases — to do an optimization like this, the compiler needs to be able to prove that it can never change any results, which is more delicate than you seem to realize. This is true in any language, not just Julia.

3 Likes