Compute functions on a grid of parameters

Hi! The problem sounds very simple on the surface: compute something like f1(f2(f3(a3), a2), a1) on a grid of values for a1, a2, a2, and save results into array of corresponding shape. Of course, it can be done with ((a1, a2, a3) -> f1(f2(f3(a3), a2), a1)).(product(1:n1, 1:n2, 1:n3), but this is really inefficient because it recomputes f3 and f2 for the same parameter values over and over. A for-loop is efficient is this sense:

for a3 in 1:n3
  f3_val = f3(a3)
  for a2 in 1:n2
    f2_val = f2(f3_val, a2)
    for a1 in 1:n1
      val = f1(f2_val, a1)
    end 
  end 
end

however the array creation and keeping track of its indices needs to be done manually. Also this is definitely not scalable with respect to adding new parameters or changing them, and parallelization using pmap does not fit well into the for-loop code (but it is easily aplicable to product example above). Loop comprehension has the same issue with parallelization.

Another drawback in this solution is that when f3 is type-unstable, types in inner loops are not inferred as well, and performance degrades a lot. Function barrier would solve this, but it increases and complicates the code even more.

Any suggestions on how to implement this in an easy to use and efficient way? The ideal way I can imagine this would be something along these lines:

@pipeline begin
  @for a3 in 1:n3, parallel=true
  f3_val = f3(a3)
  @for a2 in 1:n2
  f2_val = f2(f3_val, a2)
  @for a1 in 1:n1
  @collect f1(f2_val, a1)
end

which returns the final array of results (or even better, AxisArray with correctly named and valued axes). But not sure if this is possible or feasible.

have you considered https://github.com/JuliaCollections/Memoize.jl ?
I have not used it personally, but I think it might be helpful in such a case.

I considered memoization as well, but a huge drawback is that it implies keeping a lot of function results in memory, even when they are no longer needed. E.g. when f3 in my example generates a large array or other structure.

At some point I tried to tackle separable functions in a generic way, e.g. f(x,y,z) ~ g(x,y)h(z). In this case summing over a grid can be simplified.

One solution that worked well (i.e. zero overhead) is to use generated functions alongside a parametric type that describe the structure of the separable function into its parameters, such at it can be known at compile time. Then I was building up the correct number of loops and loop nesting in the generated function.

Your case seems similar, but I don’t know if it’s easy to encode the nested structure of your functions into parameters (can you enumerate all the possible combinations ?).

Yes, I certainly can enumerate all the combinations: each parameter is independent, so it’s just the cartesian product. But I don’t understand what you mean by “encode the nested structure of your functions into parameters”, could you elaborate a bit?

By combinations I mean combinations of nested functions. For example for separable functions I was enumerating them like this:

N=1:  f(x)
N=2:  f(x,y), f(x)g(y)
N=3:  f(x,y,z), f(x,y)g(z), f(x)g(y,z), f(x,z)g(y), f(x)g(y)h(z)
...

So if you ask me what’s the 3rd function with N=3, I can tell you it’s f(x)g(y,z), and that would be captured in a parametric type SeparableFunction{N,K} where N and K would be equal to 3 in this example. This means you can know at compile time the structure of the function and generate the right loops for it.

That said this approach is probably a bit overcomplicated, I did it more because I liked the idea and wanted to do some metaprogramming.

You could probably use Base.Cartesian, but it takes some getting used to

r = 1:10
val3 = f3.(r)
val2 = [f2(x, y) for x in val3, y in r]
val3 = [f1(x, y) for x in val2, y in r]

Note that after the first, each is exactly the same code with new constants. Obviously, if you passed a tuple of functions and ranges (I used the same r in each case, but it generalizes straightforwardly) you could automate this.

Lots of nice tricks like this possible in Julia. See Multidimensional algorithms and iteration

3 Likes

But it doesn’t help with the issue I mentioned above, right?

I considered memoization as well, but a huge drawback is that it implies keeping a lot of function results in memory, even when they are no longer needed. E.g. when f3 in my example generates a large array or other structure.

Oh, this indeed sounds complicated, I still don’t understand it completely :slight_smile: So your type doesn’t capture the specific functions, only the structure? How is it possible to generate efficient type-stable code then?

The type also holds the function themselves yes (f,g,h), because you need to call them at some point. It’s just that if you want to change to order/number of loops at compile time, you need to know how many function you have and what order they come in, so you have to encode that in the type somehow.

Array{T,N} work like this, you encode the dimension in the type. So for example when you have a vector (N=1) you can generate only one loop, when you have a matrix two, etc. Here’s a silly example that does this (add @generated in front to turn it into a generated function).

function mysum(x::Array{T,N}) where {T,N}
    if N == 1
        ex = quote
            for i=1:length(x)
                s += x[i]
            end
        end
    end
    if N == 2
        ex = quote
            for i=1:size(x,1), j=size(x,2)
                s += x[i,j]
            end
        end
    end
    quote
        s = zero($T)
        $ex
        return s
    end
end

The only difference in my separable function example is that you have two numbers (N and K) that defines the structure of the function instead of only one for Arrays, and the loop logic gets more complicated (sometimes loops are nested, sometimes not).

And of course you would write a generic function that works for arbitrary N instead of hardcoding two cases like I did here (that where stuff like Base.Cartesian comes in).

The following does not work yet, but shows you Roughly what Base.Cartesian will get you

using MacroTools
using Base.Cartesian: @nexprs, @nloops
# @generated function many(n::NTuple{N,T}) where {N,T} # Put in a generated function like this to have it compile automatically
ex1 = @macroexpand @nextract 5 f input
ex2 = @macroexpand @nloops 5 i n->1:r_n d -> fval_d = f_d(a_{d+1}, a_d) begin
    println(i_1,i_2,i_3)
end

prettify(Expr(:block, ex1,ex2)) |> display

quote
    f_1 = input[1]
    f_2 = input[2]
    f_3 = input[3]
    f_4 = input[4]
    f_5 = input[5]
    for i_5 = 1:r_5
        fval_5 = f_5(a_6, a_5)
        for i_4 = 1:r_4
            fval_4 = f_4(a_5, a_4)
            for i_3 = 1:r_3
                fval_3 = f_3(a_4, a_3)
                for i_2 = 1:r_2
                    fval_2 = f_2(a_3, a_2)
                    for i_1 = 1:r_1
                        fval_1 = f_1(a_2, a_1)
                        println(i_1, i_2, i_3)
                        nothing
                    end
                    nothing
                end
                nothing
            end
            nothing
        end
        nothing
    end
end
1 Like

This is how you could put that into a generated function and use it

@generated function many(input, ranges::NTuple{N,T} ) where {N,T}
    quote
        @nextract $N r ranges
        @nextract $N f input
        $(Expr(:(=), Symbol(:fval_,N+1) , 3))
        @nloops $N i n->1:r_n d -> fval_d = f_d(fval_{d+1}, i_d) begin
            @nexprs $N k->(println(fval_k))
        end
    end
end
in = (+,-,atan,/) # Some two-argument functions
many(in, (3,3,3,3))

Note that this does not actually calculate anything interesting at the moment, just a demo.

Thanks, this is quite promising!
Do you think it is possible to include pmap-like parallelization at the level (loop) specified by the user into the Cartesian-based code?

I’m not sure, I can’t see a straight forward way of doing it without manually building the expressions for the loop. Possibly with two separate calls to @nloops with a pmap in between.

You can set val3 = 0 after you use it, and it will be garbage-collected. But at the top you stated that you’re going to be storing the final results in an array. If so, then from a memory perspective the intermediate results are probably irrelevant anyway: if r has length 10 then each new “level” will be 10x larger than the previous. Holding all intermediate results represents a 10/9-fold (approx 11%) increase in resource utilization, which given that RAM is cheap is not worth worrying about. If r is longer, it’s even less worth worrying about (e.g., r of length 100 is basically a 1% effect); if r is short it still seems like not a big deal (e.g., 3 corresponds to a 50% increase). Thus in no circumstance is this a “huge drawback.”

Whether this trivial approach is preferred over a more laborious one depends on how performance-sensitive you are. But make sure you’re thinking carefully about what actually matters here.

1 Like

I cannot agree with this, as the intermediate results can be(and in my case are) much larger by themselves. I.e. the second-level function generates an array, and the bottom-level one outputs a single number, which is the final result put into an array. All intermediate results won’t even fit into memory in my case.