Speeding up combination of Funs

I’m building up an ApproxFun.Fun by successively multiplying a lot of individual factors along a graph. The individual factors are themselves also Funs. The situation is similar to this fake code

factors = []
for n in nodes
  factor :: Fun = make_factor(parameters(node))
  push!(factors, factor)
end
result::Fun = prod(factors)

I would like to speed this up as much as possible, possibly by reducing allocation.

I found out that make_factor allocates in the MB range and runs tens of ms. The return value factor has a modest number of coefficients below 100, and is just a few KB. make_factor involves arithmetic with Funs, including:

  • exp(make_fun(t2) - make_fun(t1)) where make_fun constructs a Fun from float parameters t1,t2 and
  • Fun(space) / (Fun(space) + make_fun(t3))

Basically, I’m wondering if I can fuse arithmetic operations on Fun to avoid constructing intermediate expansions, or use inplace transformation of expansions to make this faster. Also, is there a clever way to construct the product in the end?

The space in question is a Chebyshev(0..1)

Your “example code” is not very clear but I will say ApproxFun.jl is not ideal for high-performance manipulations.

There was some intention of making the newer package ClassicalOrthogonalPolynomials.jl better equipped for high-performance, allocation free computing but I don’t think it’s there yet.

I suspect you are best off working directly with FastTransforms.jl to do the transforms between Chebyshev points and coefficients manually. (These are exposed in ApproxFun by plan_transform! so that might also help.)

1 Like

Sorry for not being more clear; I just wanted to convey that there are algebraic operations mapped over many Fun objects and then reduced to yield one final Fun.

I will look into FastTransforms.jl; i did notice that most of the time is spent in transform, so that sounds promising.

I have looked at plan_transform and friends in ApproxFunBase/hBlrP/src/Space.jl. There is a trait supportsinplacetransform but I could not wrap my head around the code there. Is there a place I can get started? From the FAQ of ApproxFun.jl I know how to call transform(space, vals) to compute coefficients, but not how to make that allocate less.

I had a bit of success using transform! in ApproxFun.jl to do in-place update of function coefficients.

However, I didn’t get how exactly I would use FastTransforms.jl to multiply two expansions represented as vectors of coefficients. The API appears to concern transformation of coefficients between different polynomial bases, but not from coefficients to values at a grid of points. The latter is what I would need though, correct? Something like to_values(coeffs_1, space) .* to_values(coeffs_2, space) |> to_coefficients.

It may be faster to combine them lazily. i.e. rather than constructing a new Fun for each intermediate operation, construct only the final Fun all at once by evaluating the graph pointwise for each Chebyshev point that it requests.

2 Likes

Yes, that’s what I’ve been moving towards. I.e. keeping vectors of function values at grid points, mapping and reducing those, as in-place as feasible, and then finally constructing a vector of coefficients, from which it’s cheap to get a Fun. It’s not always easy for me to locate the functionality needed for that though, either in ApproxFun.jl or in FastTransforms.jl

If you fix the grid points you perhaps might as well use something like FastChebInterp.jl … the advantage of ApproxFun is that it adaptively figures out the number of grid points (the number of Chebyshev polynomials) in order to get the desired accuracy.

That’s why I was recommending doing things lazily — treat the graph as a lazy representation of a function that you can evaluate at will. Then finally when you create your Fun object, evaluate your graph at the points that it requests.

(It also depends on what you want to do with the resulting Fun object.)

1 Like

I’m beginning to get it now. Indeed I’m now fixing the degree beforehand, foregoing the adaptive approximation, which does worry me a little. I didn’t know about FastChebInterp.jl , which might be better suited for that.

The lazy pulling in of sub-calculations when constructing the final Fun sounds like an elegant way to keep the adaptivity. In my actual problem, the terms associated with the nodes of the graph are solutions to an ODE. The way I’ve set it up currently, the ODEs are already defined as ODE systems for sets of time-dependent Chebyshev coefficients, with predetermined degree. This works nicely and is quite fast. Redoing this lazily would require me to rethink that whole setup backwards. I think I would end up with scalar ODEs depending parametrically on the coordinate q of the Chebyshev domain, which would need to be solved on demand. I’m not saying it can’t be done, but it does sound tricky and/or hard to get performant.

Ultimately, I actually need to integrate the resulting Fun object over q, decorated with some q-dependent weight. My original implementation using Integrals.jl for direct numerical integration of integrands that were themselves ODE solutions was slow, which got me started on ApproxFun.jl in the first place.

Integrating a Fun object should be equivalent to Clenshaw–Curtis quadrature (and very similar to Gaussian quadrature). I don’t see why it would be more efficient to construct a Fun than to directly integrate with an appropriate algorithm. (Or do you mean you want the the integral as a function of the endpoint, via cumsum in ApproxFun, rather than a single number?)

Fair point. I don’t need cumsum.

I don’t quite know why my current implementation with fixed expansion order is at least 100x faster that the naive quadrature one I had before.

I do remember issues with a lot of allocations due to instantiating/remaking ODEProblems for different parameters q.

Perhaps it has to do with the fact that I was not parallelizing the ODE solutions over the quadrature grid points along q? I actually wouldn’t know how to achieve that. It would require the integration routine to output a set of grid points for parallel evaluation (via ODE), rather than just calling the integrand at individual q values. Can that be done?

For an ODE solution, you can take the analytical integral of the polynomial and just sum it. What are you trying to do? Maybe an MWE helps.

Yes.

As Chris says, something sounds strange with your problem, and a MWE would help.

Sorry again, there is some degree of misunderstanding. Integrating the polynomial expansion is no problem, this is what sum from ApproxFun does, and that works for me. Is this what @ChrisRackauckas meant?

What didn’t work well initially was directly numerically integrating the ODE solution over its parameter (not independent variable) q without polynomial expansion. That’s what I was referring to in the last paragraph above, in response to your point that an expansion should not be expected to help with efficiency.

I’ll report back when I can come up with an MWE.

No, did you make a new dispatch in the interpolation

and sum the k’s?

No, I did not.
AFAIU, the Hermite interpolation in SciML here interpolates along the independent variable of the ODE (in my case, time t).
I was talking about a Chebyshev expansion in terms of a parameter of the problem (called q).

Anyway, I’ve now finally come up with a MWE. The direct method of ODE integration driven by numerical quadrature proposed by @stevengj would look like this:

using OrdinaryDiffEq
using Integrals
using BenchmarkTools

# Goal: Integrate a function with a some weight function with a peak. 
wt(q) = sin(π*q)^151 + 0.1
# The function f to be integrated is an ODE solution, evaluated at a set of time
# points collected from some graph upstream.
const tpoints = rand(Float64, 24) * 5.
# (In real life, the peak occurs in f, not in wt.)
 
# right-hand side of the ODE
rhs(u) = - u^2 + 0.8 * u 
rhs_alternate(u) = exp(-0.1 - cos(u)^2)

# Approach using direct ODE solution driven by integration
function direct(rhs)
  # ODE initial condition depends on q
  prob1 = ODEProblem((u, _, _) -> rhs(u), 1., (0., 5.))
  prob(q) = remake(prob1, u0=q) # trying to save on problem construction
  sol(q) = solve(prob(q), Tsit5())
  f(q) = sum(sol(q).(tpoints))
  # now evaluate the integral
  iprob = IntegralProblem((q, _) -> wt(q)*f(q), (0.,1.))
  function ival(;reltol=nothing) 
    solve(iprob, QuadGKJL(); reltol)
  end
  return ival
end
# and benchmark
a = direct(rhs);
a(;reltol=1e-9); a(;reltol=1e-3);
@bprofile a(;reltol=1e-9) # this gives around 7ms and 8.5MB allocated on my computer
@bprofile a(;reltol=1e-3) # this gives around 300μs and 500KB 
aa = direct(rhs_alternate);
@bprofile aa(;reltol=1e-3) # this gives around 600μs and 500KB 

From this I’ve learned that the time depends strongly on the desired precision and less strongly on the rhs of the ODE. So it can indeed be quite fast for modest precision.

My current implementation is less nice and compact:

(code continued from above):

using ApproxFun
using ApproxFunBase: ClenshawPlan, chebyshev_clenshaw
# Alternative approach based on expansion
function expanded(rhs)
  qspace = Chebyshev(0.0..1.0)
  qfun = Fun(qspace)
  # k will be the fixed interpolation order
  qpts(k) = collect(points(qspace, k))
  # machinery for expansion and evaluation
  can_qpts(k) = tocanonical.(qspace, qpts(k))
  plan(k) = ClenshawPlan(Float64, qspace, k, k)
  # ODE in coefficient space
  Dop(g::Fun) = rhs(g)
  function rhs!(dcoeff, coeff::AbstractVector, _, _)
    ufun = Fun(qspace, coeff)
    dcoeff .= @view coefficients(Dop(ufun))[1:length(coeff)]
  end
  # initial condition
  function ginit(k::Integer) 
    cid = coefficients(qfun)
    g0 = zeros(k)
    g0[1:length(cid)] .= cid
    g0
  end
  # a coupled ODE system of size k
  aprob(k) = ODEProblem(rhs!, ginit(k), (0., 5.))
  asol(k) = solve(aprob(k), Tsit5())
  # collect terms over the time point list. This is where combination of
  # expansions comes into play. We now do it in value space
  function afpts(k)
    s = asol(k)  # one-time ODE solution for the coefficients
    cp = can_qpts(k)
    pl = plan(k)
    workspace = zeros(k)
    for t in tpoints
      workspace .+= chebyshev_clenshaw(s(t), cp, pl)
    end
    return workspace
    # sum(evaluate.((s(t),), qspace, p) for t in tpoints)
  end
  # expand also the weight, truncated to k, evaluated on the points
  wtpts(k) = let
    w = Fun(wt, qspace)
    c = coefficients(w)[1:k]
    return evaluate.((c,), qspace, qpts(k))
  end
  # and solve the integral
  function aival(k)
    vals = afpts(k) .* wtpts(k)
    return sum(Fun(qspace, transform(qspace, vals)))
  end
  return aival
end
b = expanded(); b(20); b(50);
@bprofile b(20) # around 1ms and 770KB allocated 
@bprofile b(50) # around 5ms and 3.2MB
ba = expanded(rhs_alternate); ba(20);
@bprofile ba(20) # around 3s and 266MB allocated (!)

I found it interesting that it can be faster than the direct version for a simple polynomial form of the ODE, but it really crawls for an exponential form. My guess is that the polynomial rhs is a narrowly banded operator in coefficient space? In my real problem, I in fact have such a simple rhs.
Also, I don’t quite know how to make a fair comparison in terms of precision achieved, as I don’t know the relation between expansion order k and reltol.

In my previous attempts, the direct approach was a lot slower than the expanded one; this may have been due to just requesting the default high precision, and having a simple ODE that favors expansion.

Any comments are very welcome! The first version is certainly much nicer to read. Is there something more I can do to speed it up?

Remember that the accuracy with which you solve the ODE and the accuracy with which you do the integration must be connected (the latter should generally be lower / larger tolerance). e.g. if you solve the ODE to 7 digits, but try to perform the quadrature to 8 digits, then quadgk is going to take forever trying to resolve the “noise” that comes from the errors in the ODE solver.

When you are nesting adaptive solvers, you need to be especially careful about the tolerances, and some empirical tweaking is usually required. Using the default tolerances for both ODEProblem and IntegralProblem here is especially risky.

(Indeed, I would say that you should never blindly use the default tolerances in serious work for any kind of adaptive solvers, whether it be integration, ODEs, optimization, root-finding, etc. You really should think about what kind of accuracy you need and what is practically attainable.)

2 Likes