I managed to have a minimal example which shows the issue. The source code for the example is here. The InterpolationKernels
package is not needed to run the tests, you just have to install MayOptimize
which is a convenient way to try different loop optimization settings with the exact same code.
The following results (on a Linux workstation with an Intel Core i9-9900KF @ 3.60GHz) show that:
-
With Julia-1.5.4 and Julia-1.6.0, when calling the spline
function in a loop (the 2 first blocks of tests), the function is short enough that it does not need to be inlined (the timings are the same for spline
and inlined_spline
). Timings are much shorter when loop vectorization is trigerred (by @inbounds
and by @inbounds @simd
) by a factor ~4.8 in single precisison and ~2.5 in double precisison which is consistent with the fact that there are twice as many Float32
than Float64
in AVX registers.
-
With Julia-1.5.4, it is possible to achieve amazing loop vectorization when computing the weights (which amounts to calling a short function that yields a 4-tuple of values) provided the function which calls compute_weights
in a loop is itself inlined and uses @inbounds
or @inbounds @simd
for its loop. I estimate that the best timings correspond to 21.5 Gflops in single precision and 17.9 Gflops in double precision which is really great.
-
With Julia-1.6.0, it is not possible to achieve this computing power whatever the combination of @inline
, @inbounds
and @simd
macros. The best timings correspond to 7.7 Gflops (in double and single precision) which is rather disapointing considering those obtained with Julia-1.5.4.
As you suggested, I checked with @code_warntype
, that the compiler (with the 2 versions of Julia) correctly inferred that the fucntions spline
and compute_weights(spline, ...)
respectively return a float T
and a 4-tuple of floats NTuple{4,T}
both in single precisoion (T=Float32
) and double precisison (T=Float64
).
My feeling is that, in Julia-1.6.0, loop-vectorization has some issues with functions that return tuples of values. I hope this, not so short, example, helps to show thatβ¦
P.S. I tried to explicitly retrieve the 4 weights w1,w2,w3,w4 = compute_weights(spline,x)
but this does not help with Julia-1.6.0 and (of course) makes no difference with Julia-1.5.4.
Results for Julia-1.5.4
Tests with Julia-1.5.4, T=Float32, n=1000
ββ Call spline (1000 times):
β ββ Debug: 864.644 ns (0 allocations: 0 bytes)
β ββ InBounds: 181.772 ns (0 allocations: 0 bytes)
β ββ Vectorize: 179.852 ns (0 allocations: 0 bytes)
ββ Call inlined_spline (1000 times):
β ββ Debug: 864.383 ns (0 allocations: 0 bytes)
β ββ InBounds: 181.745 ns (0 allocations: 0 bytes)
β ββ Vectorize: 180.986 ns (0 allocations: 0 bytes)
ββ Computation of weights with spline (1000 times):
β ββ Debug: 1.619 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.531 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.530 ΞΌs (0 allocations: 0 bytes)
ββ Computation of weights with inlined_spline (1000 times):
β ββ Debug: 1.616 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.585 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.587 ΞΌs (0 allocations: 0 bytes)
ββ Inlined computation of weights with spline (1000 times):
β ββ Debug: 1.296 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 448.538 ns (0 allocations: 0 bytes)
β ββ Vectorize: 463.217 ns (0 allocations: 0 bytes)
ββ Inlined computation of weights with inlined_spline (1000 times):
ββ Debug: 1.296 ΞΌs (0 allocations: 0 bytes)
ββ InBounds: 464.584 ns (0 allocations: 0 bytes)
ββ Vectorize: 463.462 ns (0 allocations: 0 bytes)
Tests with Julia-1.5.4, T=Float64, n=1000
ββ Call spline (1000 times):
β ββ Debug: 864.390 ns (0 allocations: 0 bytes)
β ββ InBounds: 354.118 ns (0 allocations: 0 bytes)
β ββ Vectorize: 353.341 ns (0 allocations: 0 bytes)
ββ Call inlined_spline (1000 times):
β ββ Debug: 864.254 ns (0 allocations: 0 bytes)
β ββ InBounds: 341.552 ns (0 allocations: 0 bytes)
β ββ Vectorize: 353.995 ns (0 allocations: 0 bytes)
ββ Computation of weights with spline (1000 times):
β ββ Debug: 1.631 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.547 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.548 ΞΌs (0 allocations: 0 bytes)
ββ Computation of weights with inlined_spline (1000 times):
β ββ Debug: 1.702 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.605 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.621 ΞΌs (0 allocations: 0 bytes)
ββ Inlined computation of weights with spline (1000 times):
β ββ Debug: 1.311 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 564.941 ns (0 allocations: 0 bytes)
β ββ Vectorize: 560.113 ns (0 allocations: 0 bytes)
ββ Inlined computation of weights with inlined_spline (1000 times):
ββ Debug: 1.311 ΞΌs (0 allocations: 0 bytes)
ββ InBounds: 565.086 ns (0 allocations: 0 bytes)
ββ Vectorize: 560.476 ns (0 allocations: 0 bytes)
Results for Julia-1.6.0
Tests with Julia-1.6.0, T=Float32, n=1000
ββ Call spline (1000 times):
β ββ Debug: 864.288 ns (0 allocations: 0 bytes)
β ββ InBounds: 182.463 ns (0 allocations: 0 bytes)
β ββ Vectorize: 182.490 ns (0 allocations: 0 bytes)
ββ Call inlined_spline (1000 times):
β ββ Debug: 864.458 ns (0 allocations: 0 bytes)
β ββ InBounds: 177.442 ns (0 allocations: 0 bytes)
β ββ Vectorize: 182.168 ns (0 allocations: 0 bytes)
ββ Computation of weights with spline (1000 times):
β ββ Debug: 1.488 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.301 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.300 ΞΌs (0 allocations: 0 bytes)
ββ Computation of weights with inlined_spline (1000 times):
β ββ Debug: 1.573 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.523 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.523 ΞΌs (0 allocations: 0 bytes)
ββ Inlined computation of weights with spline (1000 times):
β ββ Debug: 1.296 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.308 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.292 ΞΌs (0 allocations: 0 bytes)
ββ Inlined computation of weights with inlined_spline (1000 times):
ββ Debug: 1.296 ΞΌs (0 allocations: 0 bytes)
ββ InBounds: 1.296 ΞΌs (0 allocations: 0 bytes)
ββ Vectorize: 1.299 ΞΌs (0 allocations: 0 bytes)
Tests with Julia-1.6.0, T=Float64, n=1000
ββ Call spline (1000 times):
β ββ Debug: 864.831 ns (0 allocations: 0 bytes)
β ββ InBounds: 355.095 ns (0 allocations: 0 bytes)
β ββ Vectorize: 354.104 ns (0 allocations: 0 bytes)
ββ Call inlined_spline (1000 times):
β ββ Debug: 863.033 ns (0 allocations: 0 bytes)
β ββ InBounds: 354.991 ns (0 allocations: 0 bytes)
β ββ Vectorize: 354.754 ns (0 allocations: 0 bytes)
ββ Computation of weights with spline (1000 times):
β ββ Debug: 1.505 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.315 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.316 ΞΌs (0 allocations: 0 bytes)
ββ Computation of weights with inlined_spline (1000 times):
β ββ Debug: 1.588 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.504 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.504 ΞΌs (0 allocations: 0 bytes)
ββ Inlined computation of weights with spline (1000 times):
β ββ Debug: 1.279 ΞΌs (0 allocations: 0 bytes)
β ββ InBounds: 1.308 ΞΌs (0 allocations: 0 bytes)
β ββ Vectorize: 1.313 ΞΌs (0 allocations: 0 bytes)
ββ Inlined computation of weights with inlined_spline (1000 times):
ββ Debug: 1.312 ΞΌs (0 allocations: 0 bytes)
ββ InBounds: 1.325 ΞΌs (0 allocations: 0 bytes)
ββ Vectorize: 1.309 ΞΌs (0 allocations: 0 bytes)