Using Interpolations

Here is code for Julia and Python that uses the basic approach I’m trying. I’m attempting to use equivalent algorithms which is Levenberg-Marquardt for the curve fitting and linear interpolation for the input fits. Here is the code and results. As you can see, Python executes the curve fit in about 1.4 ms and Julia in about 30.8 ms. I’m quite new at Julia so please point out bad practices even if it doesn’t impact performance. Everything was run in JupyterLab with Python 3.7 and Julia 1.1

Julia

using LsqFit, Interpolations, BenchmarkTools
# Create sample data
x1 = range(0.0, step=0.0099, stop=6.);
y1 = 0.7*sin.(x1) .+ 0.1*rand(length(x1));
x2 = range(0.0, step=0.001, stop=6.);
y2 = 0.7*cos.(x2) .+ 0.07*rand(length(x2));
x3 = range(0.1, step=0.0015, stop=5.9);
y3 = 0.5*sin.(x3) .+ 0.3*cos.(x3) + 0.1*rand(length(x3));

# Create fits
fit1 = interpolate((x1,), y1, Gridded(Linear()));
fit2 = interpolate((x2,), y2, Gridded(Linear()));

# Create Model
model(x,p) = p[1]*fit1(x) .+ p[2]*fit2(x)

# Do curve fit
@btime rst = curve_fit(model, x3, y3, [1.0, 1.0]);
rst.param

30.766 ms (2128 allocations: 18.46 MiB)
2-element Array{Float64,1}:
 0.726988264770365  
 0.43123727421300484

Python

import numpy as np
from scipy.optimize import leastsq # Uses Levenberg Marquardt

# Create sample data
x1 = np.arange(0.0, 6.0, 0.0099)
y1 = 0.7*np.sin(x1) + 0.1*np.random.randn(len(x1))
x2 = np.arange(0.0, 6.0, 0.001)
y2 = 0.7*np.cos(x2) + 0.07*np.random.randn(len(x2))
x3 = np.arange(0.1, 5.9, 0.0015)
y3 = 0.5*np.sin(x3) + 0.3*np.cos(x3) + 0.1*np.random.randn(len(x3))

# Create fits
def fit1(x):
    return np.interp(x,x1,y1)
def fit2(x):
    return np.interp(x,x2,y2)

# Create Model
def model(x, p):
        return p[0]*fit1(x) + p[1]*fit2(x)
def residuals(p, y, x):
        err = y-model(x,p)
        return err
    
# Do curve fit
rst = leastsq(residuals, [1.0, 1.0], args=(y3, x3), maxfev=2000)
print(rst)

%timeit leastsq(residuals, [1.0, 1.0], args=(y3, x3), maxfev=2000)

(array([0.70528424, 0.41814   ]), 3)
1.38 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On this computer, running your Julia code…

julia> rst = @btime curve_fit($model, $x3, $y3, [1.0, 1.0]);
  22.094 ms (1837 allocations: 18.45 MiB)

julia> rst.param
2-element Array{Float64,1}:
 0.7124976352396952
 0.4280267262041546

Hiding first example to emphasize second:

Summary

Making a few changes:

julia> using StaticArrays, StaticOptim

julia> # ] add https://github.com/aaowens/StaticOptim.jl#master
       
       function scurvefit(fit1, fit2, x, y, params)
           fit1s = fit1.(x)
           fit2s = fit2.(x)
           soptimize(params, bto = StaticOptim.Order3()) do p
               obj = 0.0
               @inbounds @simd for i ∈ eachindex(x, y)
                   obj += (p[1]*fit1s[i] .+ p[2]*fit2s[i] - y[i])^2
               end
               obj
           end
       end
scurvefit (generic function with 2 methods)

julia> sres = @btime scurvefit($fit1, $fit2, $x3, $y3, @SVector [1.0, 1.0]);
  430.781 μs (7 allocations: 60.89 KiB)

julia> sres.minimizer

2-element SArray{Tuple{2},Float64,1,2} with indices SOneTo(2):
 0.7124976352397004 
 0.42802672620415283

That is about 50x faster than the original.

Least squares via Cholesky decomposition:

"""
Given 2 fits, it calculates a design matrix "X", and then solves the normal equations
X*beta = y
X'X*beta = X'y
beta = (X'X)^{-1}X'y
returning beta.
"""
julia> function curvefit2(fit1, fit2, x, y)
           fit1s = fit1.(x)
           fit2s = fit2.(x)
           # Let X'X = S
           # S is 2x2 and symmetric
           # Therefore, we only need to calculate 3 numbers.
           # BLAS is slow for such skinny matrices, so I use a for loop.
           S11, S12, S22 = 0.0, 0.0, 0.0
           # S11 means S[1,1], S12 = S[1,2], etc.
           # additionally, we need X'y
           # this is two numbers. Ditto on BLAS being slow, so we calculate it in the same loop.
           xty1 = 0.0
           xty2 = 0.0
           # @fastmath lets the compiler use fused multiply add instructions
           # @inbounds eliminates bounds checks. Removing these branches lets the compiler vectorize
           # the loop. That is, it can use simd instructions.
           # @simd also hints at the compiler to use these instructions.
           # `SIMD` stands for "Single Instruction Multiple Data".
           # Per CPU instruction, a modern computer can do 4-8 multiplications and additions
           # ie, up to 16 floating point operations with a single instruction (8 multiplications + 8 additions).
           # These macros let the compiler do that.
           @fastmath @inbounds @simd for i in eachindex(x)
               S11 += fit1s[i]*fit1s[i]
               S12 += fit1s[i]*fit2s[i]
               S22 += fit2s[i]*fit2s[i]
               xty1 += fit1s[i]*y[i]
               xty2 += fit2s[i]*y[i]
           end
           # Now I take the Cholesky decomposition of S, calculating the upper triangle.
           U11 = sqrt(S11)
           U12 = S12 / U11
           U22 = sqrt(S22 - U12^2)
           # U' * U = S
           # This gives us
           # U' * U * beta = X'y
           # triangular systems of equations are easy to solve, so below we solve them twice
           # we can do that in three steps.
           v = xty1 / U11
           # Now to get w for the first solve, we need
           # w = (xty2 - U12 * v) / U22
           # but then to get w for the next solve, we need w / U22
           # so we combine these into one step by squaring the denominator.
           w = (xty2 - U12 * v) / U22^2
           v = ( v -  U12 * w) / U11
           v, w
       end
curvefit2 (generic function with 2 methods)

julia> @btime curvefit2($fit1, $fit2, $x3, $y3)
  349.281 μs (4 allocations: 60.66 KiB)
(0.7129037092076219, 0.4294629282267536)

350 microseconds.

We probably cannot get much better than that; the fits alone take over 310 microseconds:

julia> @btime (fit1($x3),fit2($x3))
  313.138 μs (31 allocations: 261.63 KiB)
3 Likes

That is nice, but StaticOptim isn’t getting you much here. Regular Optim is almost as fast and is a more standard dependency.

using Optim
function ocurvefit(fit1, fit2, x, y, params)
           fit1s = fit1.(x)
           fit2s = fit2.(x)
          optimize(params, BFGS()) do p
               obj = 0.0
               @inbounds @simd for i ∈ eachindex(x, y)
                   obj += (p[1]*fit1s[i] .+ p[2]*fit2s[i] - y[i])^2
               end
               obj
           end
       end
julia> sres = @btime ocurvefit($fit1, $fit2, $x3, $y3,  [1.0, 1.0]);
  688.529 μs (126 allocations: 67.63 KiB)

Fair. I think it’d be much better to follow my second approach using StaticArrays.jl. StaticArrays is a fairly standard dependency, and if all you’re doing is least squares, you should probably just use the closed form solution.

Linear least squares problems can be solved very efficiently with just \ (via QR least squares); it’s pretty suboptimal to use a generic nonlinear optimization method. No need for StaticArrays either.

1 Like

My service example, curvefit2, solved the least squares problem (normal equations) using a Cholesky decomposition.
It was simple to work the math out by hand, and an approach I’m more familiar with.

I suggested StaticArrays to allow for variable numbers of parameters while maintaining more or less the performance of my second example.
With only two parameters, BLAS was slow.

I did not try the QR decomposition.

Thanks for the effort and responsiveness. There is a fair bit to comprehend here. Definitely some Julia syntax I’m unfamiliar with. I suppose it is good that fast performance is possible with Julia, but it seems a lot less straightforward than the Python code. Is this just a corner case where the Python function is particularly well optimized? I’m still confused by the poor performance of the LsqFilt implementation.

I believe curve_fit uses an iterative algorithm.
By far the slowest part of the code is interpolating.
My code (and @aaowens’) gained mostly by interpolating only once per fit, and not once per iteration of the iterative algorithm.

I (and others) can answer any specific questions you may have after looking through the documentation.

The performance deficit between the straightforward Julia and Python implementations comes largely from the fact that Numpy offers a compiled, carefully-tuned interpolation routine with implicit multithreading. Julia’s native threading infrastructure is still young, and consequently threading isn’t yet baked into the package ecosystem.

1 Like

A number of performance problems have now been fixed and are available in Interpolations 0.12.

Just as Interpolations was “cheating” on your first example, Dierckx is “cheating” on your second. You can see this if you try

p = randperm(length(xs))
xsr = xs[p]
@btime $fit1($xs)
@btime $fit1($xsr)

This explains most of the remaining gap; Interpolations relies on searchsortedfirst to find the bracketing knots for the interpolation point, but if you traverse them in order you can find the next knot more efficiently. It would be good if someone implements this form of “cheating” for Interpolations, too.

For 1d gridded interpolation, almost all of the time is spent in searchsortedfirst. Someone who wants to make it this package faster in such cases might well take a careful look at this function and see if there are untapped performance opportunities. (As well as adding the “cheating” for sorted vector inputs.)

4 Likes

This is my experience for higher dimensions, too, when solving dynamic programming problems in economics using Interpolations.jl (interpolating a value function).

FWIW, in case the user has control over the grid, then using a uniform grid, represented by an AbstractRange, is almost always worth it (if the problem is heavily nonlinear, this should be combined with a variable transformation). searchsortedfirst is O(1) for this case.

2 Likes

Yes indeed. However, in higher dimensions you can amortize some of this cost if your evaluation points are on a cartesian grid:

julia> knots = ([0;sort(rand(8));1], [0;sort(rand(8));1])
([0.0, 0.00193145, 0.240481, 0.38677, 0.396281, 0.462792, 0.875859, 0.943712, 0.960067, 1.0], [0.0, 0.092186, 0.200238, 0.528178, 0.554724, 0.617535, 0.775003, 0.946701, 0.991103, 1.0])

julia> A = rand(10, 10);

julia> itp = interpolate(knots, A, Gridded(Linear()));

julia> xs, ys = rand(100), rand(100);

julia> @btime $itp($xs, $ys);
  30.332 μs (12 allocations: 83.39 KiB)

julia> interp_each(itp, xs, ys) = [itp(x, y) for x in xs, y in ys]
interp_each (generic function with 1 method)

julia> @btime interp_each($itp, $xs, $ys);
  301.992 μs (6 allocations: 78.30 KiB)

In the first case, searchsortedfirst is a nearly-negligible fraction of the cost.

You are of course right; but in the application domain I mentioned above (value/policy iteration, discrete time) unfortunately you are effectively just evaluating single points.

I wonder if Interpolations.jl could have (already has?) a mechanism for caching the grid-bin lookup. Eg if I am interpolating V(x, y), and want to calculate

\arg\max_{y \in [a, b]} V(x, y) \quad \text{given} \quad x

then I could somehow do (mockup code)

xg = lookup(x_grid, x)
optimize(y -> V(xg, y), a, b)

where V is the interpolated object.

I wonder if Interpolations.jl could have (already has?) a mechanism for caching the grid-bin lookup.

It does for scaled but not gridded: https://github.com/JuliaMath/Interpolations.jl/blob/d1ad2a1409ce6cea96cd304a540c485d2b393f5d/src/scaling/scaling.jl#L157-L226

Worth noting that in 2D, Interpolations is ~4x faster (on my machine) than Dierckx for both point-by-point and grid evaluation. And that for both there’s a 10x gap between point-by-point and grid. It’s quite remarkable what a difference the presence or absence of a couple of optimizations can have in this space.

2 Likes

To aid in my comprehension, a few comments in the code would be helpful.

fit1 = interpolate((x1,), y1, Gridded(Linear()));
fit2 = interpolate((x2,), y2, Gridded(Linear()));

function myfit(fit1, fit2, x, y)
    fity1 = fit1(x)
    fity2 = fit2(x)
    
    A = [fity1 fity2]
    A\y
end

ps = @btime myfit(fit1, fit2, x3, y3); # ~440µs

fity1 = fit1(x3);
fity2 = fit2(x3);
model(x,p) = p[1]*fity1 .+ p[2]*fity2

rst = @btime curve_fit(model, x3, y3, [1.0, 1.0]); # ~800µs
rst.param ≈ ps # true

model2(x,p) = p[1]*fit1(x) .+ p[2]*fit2(x)
rst = @btime curve_fit(model2, x3, y3, [1.0, 1.0]); # ~20ms
rst.param ≈ ps # true

myfit in the code above is maybe a bit easier to understand and almost as fast as @Elrod’s code (which takes ~350µs on my machine).

Now this is really great. I actually understand what’s going on too.

I added comments.
My code is actually doing the same thing, but I turned the matrix operations into for loops / wrote them out element by element.

I fixed a bug, so it should also get the same answer.

I do have one question. What if you wanted to have a parameter adjust not only magnitude of the two fits but their position in time. Like this:

model(x,p) = p[1]*fit1(x.+p[3]).+p[2]*fit2(x.+p[4]);

This is my more complete problem. It is also why I originally wanted extrapolation. How would you handle this in myfit?

Currently the code is

LinearInterpolation(range::AbstractRange, vs::AbstractVector; extrapolation_bc = Throw()) =
    extrapolate(scale(interpolate(vs, BSpline(Linear())), range), extrapolation_bc)
LinearInterpolation(range::AbstractVector, vs::AbstractVector; extrapolation_bc = Throw()) =
    extrapolate(interpolate((range, ), vs, Gridded(Linear())), extrapolation_bc)

i.e. pass in a vector if irregular grid or a range otherwise.

To be honest, I think that the Throw() is probably a bad default if it has these performance characteristics. What if it was changed to something like

LinearInterpolation2(range::AbstractVector, vs::AbstractVector; extrapolation_bc = nothing) =
    extrapolation_bc === nothing ? interpolate((range, ), vs, Gridded(Linear())) : extrapolate(interpolate((range, ), vs, Gridded(Linear())), extrapolation_bc)

itp3 = LinearInterpolation2(x, y)
@btime $itp3($x2);

itp4 = LinearInterpolation2(x, y, extrapolation_bc = Throw())
@btime $itp4($x2);

If you agree to this in principle, we could prepare a PR that covers the cases.