Why are my linear interpolations 10x faster in MATLAB?

That is exactly what I had in mind but, at least in the current implementation, it doesn’t look like it speeds up splatting at all, unlike how tuples do:

julia> hybridpts = HybridMatrix{2,StaticArrays.Dynamic()}(pts);

julia> ftest(x,y) = x+y
ftest (generic function with 1 method)

julia> @benchmark [ i -> ftest($pts[:,i]...) for i in 1:size($pts,2) ]
BenchmarkTools.Trial:
  memory estimate:  234.50 KiB
  allocs estimate:  10004
  --------------
  minimum time:     56.199 μs (0.00% GC)
  median time:      62.299 μs (0.00% GC)
  mean time:        77.428 μs (10.67% GC)
  maximum time:     4.965 ms (98.02% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark [ i -> ftest($hybridpts[:,i]...) for i in 1:size($hybridpts,2) ]
BenchmarkTools.Trial:
  memory estimate:  234.50 KiB
  allocs estimate:  10004
  --------------
  minimum time:     58.600 μs (0.00% GC)
  median time:      62.200 μs (0.00% GC)
  mean time:        76.910 μs (10.66% GC)
  maximum time:     3.062 ms (96.60% GC)
  --------------
  samples:          10000
  evals/sample:     1

There are some rough edges at the moment. I managed to speed this up:

julia> f(u) = map(i -> ftest(u[:, i].data...), 1:size(u, 2))
f (generic function with 2 methods)

julia> @benchmark f($hybridpts)
BenchmarkTools.Trial: 
  memory estimate:  78.25 KiB
  allocs estimate:  4
  --------------
  minimum time:     11.845 μs (0.00% GC)
  median time:      12.793 μs (0.00% GC)
  mean time:        14.871 μs (7.99% GC)
  maximum time:     759.355 μs (93.22% GC)
  --------------
  samples:          10000
  evals/sample:     1

map is many cases easier to optimize for the compiler. You can also get fast tuple splatting by accessing the data field of a StaticArray (which is a tuple).

1 Like

Very cool! See comparison below to disentangle the effect of map from the effect of .data.

julia> fbench(u::HybridMatrix) = map(i -> ftest(u[:, i].data...), 1:size(u, 2))
fbench (generic function with 2 methods)

julia> fbench(u::Matrix) = map(i -> ftest(u[:, i]...), 1:size(u, 2))
fbench (generic function with 3 methods)

julia> @benchmark fbench($hybridpts)
BenchmarkTools.Trial: 
  memory estimate:  78.25 KiB
  allocs estimate:  4
  --------------
  minimum time:     10.101 μs (0.00% GC)
  median time:      13.001 μs (0.00% GC)
  mean time:        25.946 μs (18.66% GC)
  maximum time:     9.484 ms (99.45% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark fbench($pts)
BenchmarkTools.Trial:
  memory estimate:  1.45 MiB
  allocs estimate:  40004
  --------------
  minimum time:     1.122 ms (0.00% GC)
  median time:      1.173 ms (0.00% GC)
  mean time:        1.353 ms (7.93% GC)
  maximum time:     11.083 ms (75.14% GC)
  --------------
  samples:          3688
  evals/sample:     1
2 Likes

So if I take my original evaluate! function, require pts to be a HybridMatrix and add the .data to the splat, I beat ALL of the single-threaded results above!

function evaluate5!(pts::HybridMatrix,fct)
    Random.seed!(100)
    Niter = 1_000
    result = Matrix{Float64}(undef,size(pts,2),Niter)
    @inbounds for j in 1:Niter
        pts[2,:] .= rand(Npt)*6 .- 3
        for p in axes(pts,2)
            result[p,j] = fct(pts[:,p].data...)
        end
    end
    return result
end

Benchmark gives:

julia> @benchmark evaluate5!(hybridpts,fct)
BenchmarkTools.Trial:
  memory estimate:  229.03 MiB
  allocs estimate:  4004
  --------------
  minimum time:     236.584 ms (2.57% GC)
  median time:      262.411 ms (6.30% GC)
  mean time:        272.907 ms (12.09% GC)
  maximum time:     536.012 ms (53.52% GC)
  --------------
  samples:          19
  evals/sample:     1

Multi-threading can shave off an additional 40 ms or so – a much smaller impact than for the tuple implementations above. So those are still a bit faster, but the benefit of evaluate5! is that (1) it allows for multi-threading upstream, where it would likely be more effective, (2) its constructor is 2,000 times faster than ntuple so “preparing” the original pts matrix for the call takes way less time:

julia> @benchmark ntuple(n->$pts[n,:],2)
BenchmarkTools.Trial: 
  memory estimate:  156.45 KiB
  allocs estimate:  6
  --------------
  minimum time:     23.399 μs (0.00% GC)
  median time:      30.000 μs (0.00% GC)
  mean time:        48.854 μs (20.82% GC)
  maximum time:     6.363 ms (98.31% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark HybridMatrix{2,StaticArrays.Dynamic()}($pts)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     9.499 ns (0.00% GC)
  median time:      15.199 ns (0.00% GC)
  mean time:        15.518 ns (8.67% GC)
  maximum time:     6.867 μs (99.32% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Thank you, @mateuszbaran. This is awesome.

4 Likes

No, as far as I’m aware, parfor uses multi-processing, while @threads uses shared-memory threads. They are different things. In Matlab you have no way of explicitly using threads, you can only reap the benefit of functions that have been implemented with multithreading in e.g. C++, or the like.

@elenev You should test to see whether turning off Matlab multi-threading has any effect on it’s timings:

maxNumCompThreads(1) % Use single thread

Good point. Didn’t know I could control this.

With multi-threading on (4 threads = 4 cores): 260 ms
Without: 456 ms

Since my MATLAB code had some other operations (sampling a vector of rands) that we know are multi-threaded by default, I also did this with a loop just around querying the interpolant and got the same 1.5x-2.0x speed up. So yes, griddedInterpolant does seem to be multi-threaded like @Elrod suspected.

2 Likes