That is exactly what I had in mind but, at least in the current implementation, it doesn’t look like it speeds up splatting at all, unlike how tuples do:
julia> hybridpts = HybridMatrix{2,StaticArrays.Dynamic()}(pts);
julia> ftest(x,y) = x+y
ftest (generic function with 1 method)
julia> @benchmark [ i -> ftest($pts[:,i]...) for i in 1:size($pts,2) ]
BenchmarkTools.Trial:
memory estimate: 234.50 KiB
allocs estimate: 10004
--------------
minimum time: 56.199 μs (0.00% GC)
median time: 62.299 μs (0.00% GC)
mean time: 77.428 μs (10.67% GC)
maximum time: 4.965 ms (98.02% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark [ i -> ftest($hybridpts[:,i]...) for i in 1:size($hybridpts,2) ]
BenchmarkTools.Trial:
memory estimate: 234.50 KiB
allocs estimate: 10004
--------------
minimum time: 58.600 μs (0.00% GC)
median time: 62.200 μs (0.00% GC)
mean time: 76.910 μs (10.66% GC)
maximum time: 3.062 ms (96.60% GC)
--------------
samples: 10000
evals/sample: 1
There are some rough edges at the moment. I managed to speed this up:
julia> f(u) = map(i -> ftest(u[:, i].data...), 1:size(u, 2))
f (generic function with 2 methods)
julia> @benchmark f($hybridpts)
BenchmarkTools.Trial:
memory estimate: 78.25 KiB
allocs estimate: 4
--------------
minimum time: 11.845 μs (0.00% GC)
median time: 12.793 μs (0.00% GC)
mean time: 14.871 μs (7.99% GC)
maximum time: 759.355 μs (93.22% GC)
--------------
samples: 10000
evals/sample: 1
map is many cases easier to optimize for the compiler. You can also get fast tuple splatting by accessing the data field of a StaticArray (which is a tuple).
So if I take my original evaluate! function, require pts to be a HybridMatrix and add the .data to the splat, I beat ALL of the single-threaded results above!
function evaluate5!(pts::HybridMatrix,fct)
Random.seed!(100)
Niter = 1_000
result = Matrix{Float64}(undef,size(pts,2),Niter)
@inbounds for j in 1:Niter
pts[2,:] .= rand(Npt)*6 .- 3
for p in axes(pts,2)
result[p,j] = fct(pts[:,p].data...)
end
end
return result
end
Benchmark gives:
julia> @benchmark evaluate5!(hybridpts,fct)
BenchmarkTools.Trial:
memory estimate: 229.03 MiB
allocs estimate: 4004
--------------
minimum time: 236.584 ms (2.57% GC)
median time: 262.411 ms (6.30% GC)
mean time: 272.907 ms (12.09% GC)
maximum time: 536.012 ms (53.52% GC)
--------------
samples: 19
evals/sample: 1
Multi-threading can shave off an additional 40 ms or so – a much smaller impact than for the tuple implementations above. So those are still a bit faster, but the benefit of evaluate5! is that (1) it allows for multi-threading upstream, where it would likely be more effective, (2) its constructor is 2,000 times faster than ntuple so “preparing” the original pts matrix for the call takes way less time:
No, as far as I’m aware, parfor uses multi-processing, while @threads uses shared-memory threads. They are different things. In Matlab you have no way of explicitly using threads, you can only reap the benefit of functions that have been implemented with multithreading in e.g. C++, or the like.
With multi-threading on (4 threads = 4 cores): 260 ms
Without: 456 ms
Since my MATLAB code had some other operations (sampling a vector of rands) that we know are multi-threaded by default, I also did this with a loop just around querying the interpolant and got the same 1.5x-2.0x speed up. So yes, griddedInterpolant does seem to be multi-threaded like @Elrod suspected.