Why are my linear interpolations 10x faster in MATLAB?

The consensus on this forum seems to be that Interpolations.jl is extremely fast, so I must not be using it in a smart way.

Here’s an MWE of the Matlab code I’m trying to adapt to Julia. If some stuff looks weird (e.g. generating random numbers inside a loop), it’s because I’m using that stuff as a runnable placeholder for more complex operations in my application that would just distract.

f=@(a,b) a.^2 + b.^2;

xs = linspace(0,5,100);
ys = linspace(-3,3,200);

[XX,YY] = meshgrid(xs,ys);
vals = f(XX,YY);

fct = griddedInterpolant( {xs, ys} ,vals', 'linear','none');

Npt = 1e4;
pts = [ rand(Npt,1)*5, zeros(Npt,1) ];

Ntrials = 10;
tic;
for i=1:Ntrials
	N = 1e3;
	results = zeros(Npt,N);
	for j=1:1e3
		pts(:,2) = rand(Npt,1) * 6 - 3;
		results(:,j) = fct(pts);
	end
end
elapsedtime=toc/Ntrials;
fprintf('mean time: %0.2f ms\n', elapsedtime*1000 );

When I run this, I get:

mean time: 284.85 ms

Now, here’s how I implemented this in Julia:

using Interpolations, BenchmarkTools
f(a,b) = a^2 + b^2
const xs = range(0,5,length=100)
const ys = range(-3,3,length=200)
const vals = [f(x,y) for x in xs, y in ys]
const fct = LinearInterpolation( (xs,ys), vals)
const Npt = 10_000;
pts = [rand(1,Npt); zeros(1,Npt)]

function evaluate!(pts,fct)
    N = 1_000
    result = Matrix{Float64}(undef,size(pts,2),N)
    for j in 1:N
        pts[2,:] = rand(1,Npt)*6 .- 3
        # Need this loop b/c unlike Matlab fct here requires scalar args
        for p in axes(pts,2)
            result[p,j] = fct(pts[:,p]...)
        end
    end
    return result
end

benchres = @benchmark evaluate!(pts,fct)

The output of @show benchres is

benchres = Trial(2.393 s)
BenchmarkTools.Trial:
  memory estimate:  1.64 GiB
  allocs estimate:  40006002
  --------------
  minimum time:     2.393 s (27.67% GC)
  median time:      2.415 s (27.72% GC)
  mean time:        2.408 s (27.91% GC)
  maximum time:     2.416 s (28.33% GC)
  --------------
  samples:          3
  evals/sample:     1

I can reduce this ~3x down to 840 ms and only 6000 allocations by replacing pts[:,p]... with pts[1,p],pts[2,p] but (1) that’s still >3x slower than Matlab, and (2) the evaluate! function now only works for 2-D functions (yes, I know I can replace splats with dispatching on the type parameter of the interpolant object, but that’s pretty ugly and the resulting time is still bounded below by 840 ms)

Where did I mess up?

You’re splatting an array, i.e. the call fct(pts[:,p]...) allocates a copy, and array splatting is generally slow. Do fct(pts[1,p], pts[2,p]) instead. An throw in an @inbounds to avoid array bounds checking. This speeds it up by a factor of 10 on my machine.


function evaluate!(pts,fct)
    N = 1_000
    result = Matrix{Float64}(undef,size(pts,2),N)
    @inbounds for j in 1:N
        pts[2,:] = rand(1,Npt)*6 .- 3
        # Need this loop b/c unlike Matlab fct here requires scalar args
        for p in axes(pts,2)
            result[p,j] = fct(pts[1,p],pts[2,p])
        end
    end
    return result
end

Thanks. As I mentioned in my post, I tried that:

I can reduce this ~3x down to 840 ms and only 6000 allocations by replacing pts[:,p]… with pts[1,p],pts[2,p] but (1) that’s still >3x slower than Matlab, and (2) the evaluate! function now only works for 2-D functions (yes, I know I can replace splats with dispatching on the type parameter of the interpolant object, but that’s pretty ugly and the resulting time is still bounded below by 840 ms)

In other words, it does help considerably but it still leaves me with 3x the time of Matlab, and comes at the expense of generic applicability (i.e. I won’t always have 2 dimensions).

Adding @inbounds in addition lowers it to 694 ms, still about 2.5x slower.

MATLAB:

pts(:,2) = rand(Npt,1) * 6 - 3;

Julia:

pts[2,:] = rand(1,Npt)*6 .- 3

Given that pts is 2 x 10_000, assigning to 10_000 elements on each information is going to hurt performance (and give you the wrong answer).

2 Likes

Ugh duh, sorry. The vector vs. one-row matrix always confuses me when I’m going back and forth between languages. Fixed:

pts[2,:] = rand(Npt)*6 .- 3

With no splatting and with @inbounds, it’s now 650 ms. If I further replace = with .=, I can shave off another 2000 allocations and 50 ms. But we’re still 2x MATLAB time, and the code is still less flexible b/c it’s hard-coding two dimensions.

Ok, try replacing the inner loop with

result[:,j] .= fct.(eachrow(pts)...)

What are the dimensions of pts in Julia and MATLAB?
If in MATLAB it is 10_000 x 2, and in Julia it is 2 x 10_000, I’d like to point out that:

julia> x1 = rand(2,10_000);

julia> x2 = rand(10_000,2);

julia> @benchmark $x1[1,:] .= 0.5
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.142 μs (0.00% GC)
  median time:      3.431 μs (0.00% GC)
  mean time:        3.454 μs (0.00% GC)
  maximum time:     6.034 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @benchmark $x2[:,1] .= 0.5
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.132 μs (0.00% GC)
  median time:      1.175 μs (0.00% GC)
  mean time:        1.179 μs (0.00% GC)
  maximum time:     2.933 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Of course, doing it a thousand times will only make up another couple of milliseconds.

2 Likes

Why would x1 and x2 have different performance? Seems like that shouldnt be a thing, and if it is a thing then it should be documented somewhere.

1 Like

Row-major vs column-major storage. It is faster to access items in the array in the sequence that they are stored.

5 Likes

This is general (other than requiring at least 2 dimensions, which the MATLAB code also does), and fairly fast:

using Interpolations, Random, BenchmarkTools
f(a,b) = a^2 + b^2
const xs = range(0,5,length=100);
const ys = range(-3,3,length=200);
const vals = [f(x,y) for x in xs, y in ys];
const fct = LinearInterpolation( (xs,ys), vals);
const Npt = 10_000;
pts1 = rand(Npt);
pts2 = zeros(Npt);
function evaluate!(fct, pts::Vararg{<:AbstractVector,N}) where {N}
    Niter = 1_000
    pts2 = pts[2]
    result = Matrix{Float64}(undef,length(pts2),Niter)
    @inbounds for j in 1:Niter
        rand!(pts2)
        # Need this loop b/c unlike Matlab fct here requires scalar args
        for p in axes(pts2,1)
            pts2[p] = 6pts2[p] - 3
            result[p,j] = fct(ntuple(n -> pts[n][p], Val(N))...)
        end
    end
    return result
end

This yields:

julia> @benchmark evaluate!(fct, $pts1, $pts2)
BenchmarkTools.Trial:
  memory estimate:  76.29 MiB
  allocs estimate:  2
  --------------
  minimum time:     138.638 ms (0.00% GC)
  median time:      140.861 ms (0.66% GC)
  mean time:        144.204 ms (3.04% GC)
  maximum time:     208.384 ms (32.85% GC)
  --------------
  samples:          35
  evals/sample:     1

@chakravala
Here is the documentation.
Also worth pointing out that how matrices/arrays are stored varies by language. Python is row-major, for example, and C++ libraries vary or let you choose.

3 Likes

It’s a cpu-cache issue. Memory is transferred between the cpu and memory in chunks of 64 consecutive bytes or so (a cache line). Each such transfer is a “cache fill”. If possible you should use all the bytes in a cache line before you advance to the next. That is what you do in x2, but in x1 only half of them are used. So you need twice as many cache fills (which may take a significant number of cpu clock cycles, like 50 or 100 or 300 depending on your cpu-architecture. Then “prefetching” is important, but may not alleviate the problem entirely).

5 Likes

It’s also a vectorization issue.

If you have AVX2, you can do:

@fastmath begin
x[1,1] = 6x[1,1] - 3
x[2,1] = 6x[2,1] - 3
x[3,1] = 6x[3,1] - 3
x[4,1] = 6x[4,1] - 3
end

with 1 load instruction, 1 fused multiply add, and 1 store (you’ll also have to load the constants, but that’s only necessary once if this happens in a loop).
Meanwhile, this:

@fastmath begin
x[1,1] = 6x[1,1] - 3
x[1,2] = 6x[1,2] - 3
x[1,3] = 6x[1,3] - 3
x[1,4] = 6x[1,4] - 3
end

will probably use 4 load instructions, 4 fused multiply add instructions, and 4 stores.

With big matrices, the first one is going to be memory bound rather than compute bound, so the cache issue you describe will be more important.

6 Likes

Man, this is a powerful demonstration of how important it is to store the dimensions in a type rather than have it be determined at runtime. Since your code has a couple of different optimizations (including the mutating rand, which I hadn’t thought of), I decided to go from my code to yours step by step. I also added a line inside each function to fix the seed so that I can make sure all of them produce the same results (they do).

using Interpolations, BenchmarkTools, Random
f(a,b) = a^2 + b^2
const xs = range(0,5,length=100)
const ys = range(-3,3,length=200)
const vals = [f(x,y) for x in xs, y in ys]
const fct = LinearInterpolation( (xs,ys), vals)
const Npt = 10_000;
pts = [rand(1,Npt); zeros(1,Npt)]

function evaluate!(pts,fct)
    Random.seed!(100)
    Niter = 1_000
    nd = size(pts,1)
    result = Matrix{Float64}(undef,size(pts,2),Niter)
    @inbounds for j in 1:Niter
        pts[2,:] .= rand(Npt)*6 .- 3
        for p in axes(pts,2)
            result[p,j] = fct(pts[:,p]...)
            #result[p,j] = fct(pts[1,p],pts[2,p])
        end
    end
    return result
end

function evaluate2!(pts::NTuple{N,<:AbstractVector},fct) where {N}
    Random.seed!(100)
    Niter = 1_000
    Npt=length(pts[2])
    result = Matrix{Float64}(undef,Npt,Niter)
    @inbounds for j in 1:Niter
        pts[2] .= 6rand(Npt) .- 3
        for p in 1:Npt
            result[p,j] = fct(ntuple(n -> pts[n][p], Val(N))...)
        end
    end
    return result
end

function evaluate3!(pts::NTuple{N,<:AbstractVector},fct) where {N}
    Random.seed!(100)
    Niter = 1_000
    pts2=pts[2]
    result = Matrix{Float64}(undef,length(pts2),Niter)
    @inbounds for j in 1:Niter
        rand!(pts2)
        for p in axes(pts2,1)
            pts2[p] = 6pts2[p] - 3
            result[p,j] = fct(ntuple(n -> pts[n][p], Val(N))...)
        end
    end
    return result
end

function evaluateElrod!(fct, pts::Vararg{<:AbstractVector,N}) where {N}
    Random.seed!(100)
    Niter = 1_000
    pts2 = pts[2]
    result = Matrix{Float64}(undef,length(pts2),Niter)
    @inbounds for j in 1:Niter
        rand!(pts2)
        # Need this loop b/c unlike Matlab fct here requires scalar args
        for p in axes(pts2,1)
            pts2[p] = 6pts2[p] - 3
            result[p,j] = fct(ntuple(n -> pts[n][p], Val(N))...)
        end
    end
    return result
end
  1. evaluate! is my original function.
  2. evaluate2! replaces the matrix with a tuple of vectors and splats a tuple instead of of a SubArray (btw, I also tried a vector of vectors instead of a tuple of vectors, and that was 10x slower than my original code, so I’m not even showing that result)
  3. evaluate3! is evaluate2! but with the mutating rand()
  4. evaluateElrod! is @Elrod 's function, which is like evaluate3! except it collects dimensions using Vararg instead of just taking a tuple as an argument.

Here are the results:

julia> @benchmark evaluate!($pts,fct)
BenchmarkTools.Trial:
  memory estimate:  1.56 GiB
  allocs estimate:  40004004
  --------------
  minimum time:     1.961 s (6.56% GC)
  median time:      2.023 s (15.61% GC)
  mean time:        2.095 s (13.33% GC)
  maximum time:     2.303 s (17.07% GC)
  --------------
  samples:          3
  evals/sample:     1
julia> @benchmark evaluate2!(ntuple(n->$pts[n,:],2),fct)
@show benchres3
benchres3 = Trial(391.896 ms)
BenchmarkTools.Trial:
  memory estimate:  229.19 MiB
  allocs estimate:  4010
  --------------
  minimum time:     391.896 ms (3.32% GC)
  median time:      396.804 ms (3.35% GC)
  mean time:        414.187 ms (7.33% GC)
  maximum time:     586.854 ms (36.01% GC)
  --------------
  samples:          13
  evals/sample:     1
julia> @benchmark evaluate3!(ntuple(n->$pts[n,:],2),fct)
BenchmarkTools.Trial:
  memory estimate:  76.45 MiB
  allocs estimate:  10
  --------------
  minimum time:     337.228 ms (0.00% GC)
  median time:      350.747 ms (1.46% GC)
  mean time:        361.871 ms (5.70% GC)
  maximum time:     460.437 ms (26.81% GC)
  --------------
  samples:          14
  evals/sample:     1
julia> @benchmark evaluateElrod!(fct,$pts[1,:],$pts[2,:])
BenchmarkTools.Trial:
  memory estimate:  76.45 MiB
  allocs estimate:  8
  --------------
  minimum time:     342.385 ms (0.18% GC)
  median time:      352.621 ms (1.47% GC)
  mean time:        366.353 ms (5.62% GC)
  maximum time:     468.002 ms (26.52% GC)
  --------------
  samples:          14
  evals/sample:     1

The extra allocations in my benchmark of @Elrod 's code relative to his are due to (1) slicing pts inside the benchmark as opposed to outside (4 allocs), and (2) fixing the seed (2 allocs). These extra allocations do not affect the speed by more than a few milliseconds i.e. within a 95% CI of the mean.

Upshot: by far the most important improvement is from evaluate! to evaluate2!, where the matrix is replaced with the tuple of vectors. This is all about N being in the type declaration. As I mentioned, if I replace the tuple with a vector of vectors and Val(N) with Val(ND) where ND = length(pts) somewhere outside the loops, that takes 20 seconds! But given N in the type, it doesn’t matter much whether there are variable number of arguments or a tuple passed in. I prefer the tuple because it allows the code calling the function to also be agnostic about the number of dimensions, without splatting.

What would be really nice is if there was some cousin of StaticArray that would allow you to “type” all but one dimension. E.g. for an MMatrix, you could say that it will have 5 columns and an unspecified number of rows. Otherwise, I have to keep doing stuff like ntuple(n->$pts[n,:],2), which is kinda boilerplate.

Btw, all of these are still marginally slower than MATLAB. I guess the griddedInterpolant object is just crazy fast. It’s definitely not written in MATLAB itself.

2 Likes

I suspect someone can make this faster but interestingly, nobody knows what it’s written in / what algorithm is used:

1 Like
4 Likes

Small nitpicking: this line in evaluate! can be replaced by

pts[2,:] .= rand.() .* 6 .- 3

which is equivalent (there is a new call to rand() in every iteration) and doesn’t allocate an Npt-element array in each iteration. This doesn’t speedup much your code, just saves a few allocations, but it’s a good trick to keep in mind in general

10 Likes

I’ve heard that broadcasting is multithreaded in MATLAB by default. Any chance this is?
Either way, we could multithread it in Julia:

julia> using Interpolations, Random, BenchmarkTools

julia> f(a,b) = a^2 + b^2
f (generic function with 1 method)

julia> const xs = range(0,5,length=100);

julia> const ys = range(-3,3,length=200);

julia> const vals = [f(x,y) for x in xs, y in ys];

julia> const fct = LinearInterpolation( (xs,ys), vals);

julia> const Npt = 10_000;

julia> pts1 = rand(Npt);

julia> pts2 = zeros(Npt);

julia> function evaluateElrod!(fct, pts::Vararg{<:AbstractVector,N}) where {N}
           Random.seed!(100)
           Niter = 1_000
           pts2 = pts[2]
           result = Matrix{Float64}(undef,length(pts2),Niter)
           @inbounds for j in 1:Niter
               rand!(pts2)
               # Need this loop b/c unlike Matlab fct here requires scalar args
               Threads.@threads for p in axes(pts2,1)
                   pts2[p] = 6pts2[p] - 3
                   result[p,j] = fct(ntuple(n -> pts[n][p], Val(N))...)
               end
           end
           return result
       end
evaluateElrod! (generic function with 1 method)

julia> @benchmark evaluateElrod!(fct,$pts1,$pts2)
BenchmarkTools.Trial:
  memory estimate:  83.62 MiB
  allocs estimate:  51517
  --------------
  minimum time:     41.556 ms (0.00% GC)
  median time:      43.659 ms (1.21% GC)
  mean time:        44.964 ms (4.54% GC)
  maximum time:     116.069 ms (60.29% GC)
  --------------
  samples:          112
  evals/sample:     1

julia> versioninfo()
Julia Version 1.6.0-DEV.205
Commit a018dc0ad5* (2020-06-10 04:25 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.0 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 10

I saw much better performance by setting the number of threads to equal the number of physical cores than the number of logical cores, but you could benchmark both as results may vary.

Also, the original evaluate! was actually about the same fast on my computer as yours, or maybe a little slower:

julia> @benchmark evaluate!($pts,fct)
BenchmarkTools.Trial:
  memory estimate:  2.76 GiB
  allocs estimate:  50004004
  --------------
  minimum time:     1.875 s (9.27% GC)
  median time:      2.223 s (13.94% GC)
  mean time:        2.149 s (12.63% GC)
  maximum time:     2.349 s (14.07% GC)
  --------------
  samples:          3
  evals/sample:     1

So I’d hoped you’d see a similar speed up:

julia> @benchmark evaluate3!(ntuple(n->$pts[n,:],2),fct)
BenchmarkTools.Trial:
  memory estimate:  76.45 MiB
  allocs estimate:  9
  --------------
  minimum time:     136.669 ms (0.00% GC)
  median time:      139.478 ms (0.00% GC)
  mean time:        143.595 ms (3.01% GC)
  maximum time:     208.128 ms (33.02% GC)
  --------------
  samples:          35
  evals/sample:     1

If you had, then that’d already have been faster than MATLAB’s time, without the need for threading.
I do not have a MATLAB license, so I have no idea how fast MATLAB’s griddedInterpolant would be on this computer.

4 Likes

This has been a minor but still weird “pain point” of mine while moving form MATLAB to Julia. I kind of liked griddedInterpolant but there isn’t any direct equivalent in Julia and there isn’t enough information that exists about its source to remake it! There is a longstanding open issue in Interpolations.jl regarding cubic splines on gridded data but nothing concrete has emerged from it.

2 Likes

Thanks for your response.

Here’s my benchmark for the multi-threaded version of your function. Is it correct to compare Julia’s Threads.@threads for to MATLAB’s parfor?

BenchmarkTools.Trial: 
  memory estimate:  79.75 MiB
  allocs estimate:  23664
  --------------
  minimum time:     137.605 ms (0.00% GC)
  median time:      166.774 ms (0.35% GC)
  mean time:        175.249 ms (9.40% GC)
  maximum time:     301.986 ms (50.10% GC)
  --------------
  samples:          29
  evals/sample:     1

But our setups are quite different, so these numbers should be compared with my earlier ones, not with yours.

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4 #physical cores

If MATLAB default is indeed to multi-thread vector query (“evaluate”) calls, how would that work if the queries themselves happen in a multi-threaded loop e.g. in parfor? I guess I don’t actually care that much about MATLAB’s internals – I care about what would happen if I did this in Julia i.e. if I nested multithreaded loops (across a function barrier, presumably).

Re irregular grid splines: We came across this a few years ago to speed up our gridded spline interpolations in MATLAB.

I have basically no knowledge of C or C++ so I don’t think I’d be able to adapt this to Julia myself, but if lack of a good algorithm is what’s holding folks back, maybe this helps. This is an order of magnitude faster than the built-in MATLAB functions it’s replacing.