Drop of performances with Julia 1.6.0 for InterpolationKernels

I was happy to switch from Julia 1.5.4 to Julia 1.6.0 but I experienced significant drops of performances for InterpolationKernels by up to a factor 3.6…

Below are tables summarizing my tests with BenchmarkTools. The columns give the minimum times in nanoseconds, all functions should be in-lined and are called on a vector (of 1000 elements) in a vectorized loop (@simd @inbounds).

I ran benchmarks on two different Linux machines and with more or less the same conclusions. The CPUs were:

  • i7: Intel Core i7-5500U @ 2.40GHz (laptop)
  • i9: Intel Core i9-9900KF @ 3.60GHz (workstation)

I first benchmarked calling interpolation kernels as simple functions. The results are very similar for the 2 versions of Julia so I see no issue there.

CPU i7 i7 i9 i9
Julia version 1.5.4 1.6.0 1.5.4 1.6.0
BSpline{1,Float32} 96 101 49 51
BSpline{1,Float64} 185 189 99 95
BSpline{2,Float32} 110 104 58 51
BSpline{2,Float64} 186 188 89 91
BSpline{3,Float32} 235 236 121 144
BSpline{3,Float64} 459 453 231 228
BSpline{4,Float32} 330 321 186 168
BSpline{4,Float64} 622 616 364 331
CardinalCubicSpline{Float32} 306 307 179 175
CardinalCubicSpline{Float64} 594 595 349 341
CatmullRomSpline{Float32} 322 326 195 196
CatmullRomSpline{Float64} 628 627 382 382
CubicSpline{Float32} 331 330 191 185
CubicSpline{Float32} 334 330 190 184
CubicSpline{Float64} 647 642 373 371
CubicSpline{Float64} 647 645 372 373

Now benchmarking InterpolationKernels.compute_weights (which computes several interpolation weights at the same time) yields:

CPU i7 i7 i9 i9
Julia version 1.5.4 1.6.0 1.5.4 1.6.0
BSpline{1,Float32} 48 49 29 29
BSpline{1,Float64} 94 91 55 53
BSpline{2,Float32} 244 702 137 409
BSpline{2,Float64} 356 703 208 410
BSpline{3,Float32} 490 1362 299 803
BSpline{3,Float64} 903 1400 528 814
BSpline{4,Float32} 752 2544 448 1615
BSpline{4,Float64} 1147 2709 593 1629
CardinalCubicSpline{Float32} 751 2571 449 1222
CardinalCubicSpline{Float64} 1111 2580 546 1235
CatmullRomSpline{Float32} 772 2343 466 1295
CatmullRomSpline{Float64} 1133 2354 561 1312
CubicSpline{Float32} 844 2607 502 1521
CubicSpline{Float32} 852 2606 507 1522
CubicSpline{Float64} 1148 2627 606 1531
CubicSpline{Float64} 1170 2625 608 1530

Hence, except for BSpline{1,T}, the code takes more than 2 and up to 3.6 more time to execute with Julia 1.6.0 compared to 1.5.4. This is probably due to innefective loop vectorization or in-lining of functions (in spite of the @inline macro).

Benchmarking InterpolationKernels.compute_offset_and_weights (which calls InterpolationKernels.compute_weights) confirms the issue:

CPU i7 i7 i9 i9
Julia version 1.5.4 1.6.0 1.5.4 1.6.0
BSpline{1,Float32} 191 1016 106 412
BSpline{1,Float64} 324 1016 178 459
BSpline{2,Float32} 473 1684 270 681
BSpline{2,Float64} 849 1689 504 693
BSpline{3,Float32} 878 2804 541 1503
BSpline{3,Float64} 1147 2819 644 1555
BSpline{4,Float32} 1448 4009 882 2495
BSpline{4,Float64} 1977 4026 1238 2572
CardinalCubicSpline{Float32} 1440 3376 830 2014
CardinalCubicSpline{Float64} 1837 3396 1176 2030
CatmullRomSpline{Float32} 1446 3684 849 2067
CatmullRomSpline{Float64} 1876 3696 1199 2085
CubicSpline{Float32} 1438 4059 853 2395
CubicSpline{Float32} 1460 4054 857 2393
CubicSpline{Float64} 1955 4071 1220 2400
CubicSpline{Float64} 1955 4076 1223 2406

If someone could explained me what I have done wrong I would certainly learn a lot!

BTW the benchmarking code is in the test directory of the InterpolationKernels package (InterpolationKernels.jl/benchmarks.jl at master Β· emmt/InterpolationKernels.jl Β· GitHub).

2 Likes

I am not sure you did anything wrong, but a lot of compilation heuristics changed between 1.5 and 1.6. Sometimes this means that you can get the same result with much simpler code, but occasionally it means that if your code relied on the compiler trying inference very hard it will give up sooner.

In any case, the first thing I would do is investigate with the standard tools (@code_warntype, BenchmarkTools.@btime), narrow down the problem to specific parts with packages like

and then look at the generated code. If you narrow down the problem to something specific, you are more likely to get a suggestion about it here.

3 Likes

Thanks for your advices. I forgot to mention that all testing (I was using @benchmark macro) reported that no additional memory was allocated (with Julia 1.5.4 and 1.6.0). My current understanding is that the problem may be that the offending functions return tuples of values (of type and size known in principle). I agree with you that I should limit my example to a simpler code example that has a similar issue.

I managed to have a minimal example which shows the issue. The source code for the example is here. The InterpolationKernels package is not needed to run the tests, you just have to install MayOptimize which is a convenient way to try different loop optimization settings with the exact same code.

The following results (on a Linux workstation with an Intel Core i9-9900KF @ 3.60GHz) show that:

  • With Julia-1.5.4 and Julia-1.6.0, when calling the spline function in a loop (the 2 first blocks of tests), the function is short enough that it does not need to be inlined (the timings are the same for spline and inlined_spline). Timings are much shorter when loop vectorization is trigerred (by @inbounds and by @inbounds @simd) by a factor ~4.8 in single precisison and ~2.5 in double precisison which is consistent with the fact that there are twice as many Float32 than Float64 in AVX registers.

  • With Julia-1.5.4, it is possible to achieve amazing loop vectorization when computing the weights (which amounts to calling a short function that yields a 4-tuple of values) provided the function which calls compute_weights in a loop is itself inlined and uses @inbounds or @inbounds @simd for its loop. I estimate that the best timings correspond to 21.5 Gflops in single precision and 17.9 Gflops in double precision which is really great.

  • With Julia-1.6.0, it is not possible to achieve this computing power whatever the combination of @inline, @inbounds and @simd macros. The best timings correspond to 7.7 Gflops (in double and single precision) which is rather disapointing considering those obtained with Julia-1.5.4.

As you suggested, I checked with @code_warntype, that the compiler (with the 2 versions of Julia) correctly inferred that the fucntions spline and compute_weights(spline, ...) respectively return a float T and a 4-tuple of floats NTuple{4,T} both in single precisoion (T=Float32) and double precisison (T=Float64).

My feeling is that, in Julia-1.6.0, loop-vectorization has some issues with functions that return tuples of values. I hope this, not so short, example, helps to show that…

P.S. I tried to explicitly retrieve the 4 weights w1,w2,w3,w4 = compute_weights(spline,x) but this does not help with Julia-1.6.0 and (of course) makes no difference with Julia-1.5.4.

Results for Julia-1.5.4

Tests with Julia-1.5.4, T=Float32, n=1000
 β”œβ”€ Call spline (1000 times):
 β”‚   β”œβ”€ Debug:      864.644 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   181.772 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  179.852 ns (0 allocations: 0 bytes)
 β”œβ”€ Call inlined_spline (1000 times):
 β”‚   β”œβ”€ Debug:      864.383 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   181.745 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  180.986 ns (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.619 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.531 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.530 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with inlined_spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.616 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.585 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.587 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Inlined computation of weights with spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.296 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   448.538 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  463.217 ns (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     β”œβ”€ Debug:      1.296 ΞΌs (0 allocations: 0 bytes)
     β”œβ”€ InBounds:   464.584 ns (0 allocations: 0 bytes)
     └─ Vectorize:  463.462 ns (0 allocations: 0 bytes)

Tests with Julia-1.5.4, T=Float64, n=1000
 β”œβ”€ Call spline (1000 times):
 β”‚   β”œβ”€ Debug:      864.390 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   354.118 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  353.341 ns (0 allocations: 0 bytes)
 β”œβ”€ Call inlined_spline (1000 times):
 β”‚   β”œβ”€ Debug:      864.254 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   341.552 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  353.995 ns (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.631 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.547 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.548 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with inlined_spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.702 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.605 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.621 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Inlined computation of weights with spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.311 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   564.941 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  560.113 ns (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     β”œβ”€ Debug:      1.311 ΞΌs (0 allocations: 0 bytes)
     β”œβ”€ InBounds:   565.086 ns (0 allocations: 0 bytes)
     └─ Vectorize:  560.476 ns (0 allocations: 0 bytes)

Results for Julia-1.6.0

Tests with Julia-1.6.0, T=Float32, n=1000
 β”œβ”€ Call spline (1000 times):
 β”‚   β”œβ”€ Debug:      864.288 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   182.463 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  182.490 ns (0 allocations: 0 bytes)
 β”œβ”€ Call inlined_spline (1000 times):
 β”‚   β”œβ”€ Debug:      864.458 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   177.442 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  182.168 ns (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.488 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.301 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.300 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with inlined_spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.573 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.523 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.523 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Inlined computation of weights with spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.296 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.308 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.292 ΞΌs (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     β”œβ”€ Debug:      1.296 ΞΌs (0 allocations: 0 bytes)
     β”œβ”€ InBounds:   1.296 ΞΌs (0 allocations: 0 bytes)
     └─ Vectorize:  1.299 ΞΌs (0 allocations: 0 bytes)

Tests with Julia-1.6.0, T=Float64, n=1000
 β”œβ”€ Call spline (1000 times):
 β”‚   β”œβ”€ Debug:      864.831 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   355.095 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  354.104 ns (0 allocations: 0 bytes)
 β”œβ”€ Call inlined_spline (1000 times):
 β”‚   β”œβ”€ Debug:      863.033 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   354.991 ns (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  354.754 ns (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.505 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.315 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.316 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with inlined_spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.588 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.504 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.504 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Inlined computation of weights with spline (1000 times):
 β”‚   β”œβ”€ Debug:      1.279 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ InBounds:   1.308 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ Vectorize:  1.313 ΞΌs (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     β”œβ”€ Debug:      1.312 ΞΌs (0 allocations: 0 bytes)
     β”œβ”€ InBounds:   1.325 ΞΌs (0 allocations: 0 bytes)
     └─ Vectorize:  1.309 ΞΌs (0 allocations: 0 bytes)
2 Likes

It would be great if you could make an MWE without MayOptimize, to isolate whether this is an issue with 1.6 or that package.

Sure, the code without MayOptimize is here. The results show exactly the same behavior as before.

Julia 1.5.4

On a Linux workstation with an Intel Core i9-9900KF @ 3.60GHz, the command:

julia-1.5 -O3 inline-issue-solo.jl

yields:

Tests with Julia-1.5.4, T=Float32, n=1000
 β”œβ”€ Call spline (1000 times):
 β”‚   β”œβ”€ simple:    863.350 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  175.171 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      180.769 ns (0 allocations: 0 bytes)
 β”œβ”€ Call inlined_spline (1000 times):
 β”‚   β”œβ”€ simple:    863.267 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  180.689 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      181.230 ns (0 allocations: 0 bytes)
 β”‚
 β”œβ”€ Computation of weights with spline (1000 times):
 β”‚   β”œβ”€ simple:    1.613 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.531 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.531 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with inlined_spline (1000 times):
 β”‚   β”œβ”€ simple:    1.646 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.583 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.594 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Inlined computation of weights with spline (1000 times):
 β”‚   β”œβ”€ simple:    1.297 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  464.107 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      463.629 ns (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     β”œβ”€ simple:    1.297 ΞΌs (0 allocations: 0 bytes)
     β”œβ”€ inbounds:  464.315 ns (0 allocations: 0 bytes)
     └─ simd:      463.350 ns (0 allocations: 0 bytes)

Tests with Julia-1.5.4, T=Float64, n=1000
 β”œβ”€ Call spline (1000 times):
 β”‚   β”œβ”€ simple:    863.100 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  351.967 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      350.812 ns (0 allocations: 0 bytes)
 β”œβ”€ Call inlined_spline (1000 times):
 β”‚   β”œβ”€ simple:    862.933 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  351.869 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      351.042 ns (0 allocations: 0 bytes)
 β”‚
 β”œβ”€ Computation of weights with spline (1000 times):
 β”‚   β”œβ”€ simple:    1.632 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.545 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.546 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with inlined_spline (1000 times):
 β”‚   β”œβ”€ simple:    1.689 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.597 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.598 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Inlined computation of weights with spline (1000 times):
 β”‚   β”œβ”€ simple:    1.311 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  564.568 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      561.427 ns (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     β”œβ”€ simple:    1.312 ΞΌs (0 allocations: 0 bytes)
     β”œβ”€ inbounds:  566.147 ns (0 allocations: 0 bytes)
     └─ simd:      563.843 ns (0 allocations: 0 bytes)

Julia 1.6.0

On a Linux workstation with an Intel Core i9-9900KF @ 3.60GHz, the command:

julia-1.6 -O3 inline-issue-solo.jl

yields:

Tests with Julia-1.6.0, T=Float32, n=1000
 β”œβ”€ Call spline (1000 times):
 β”‚   β”œβ”€ simple:    863.300 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  180.448 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      181.093 ns (0 allocations: 0 bytes)
 β”œβ”€ Call inlined_spline (1000 times):
 β”‚   β”œβ”€ simple:    863.400 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  175.745 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      180.984 ns (0 allocations: 0 bytes)
 β”‚
 β”œβ”€ Computation of weights with spline (1000 times):
 β”‚   β”œβ”€ simple:    1.487 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.298 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.298 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with inlined_spline (1000 times):
 β”‚   β”œβ”€ simple:    1.570 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.520 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.521 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Inlined computation of weights with spline (1000 times):
 β”‚   β”œβ”€ simple:    1.296 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.297 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.296 ΞΌs (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     β”œβ”€ simple:    1.297 ΞΌs (0 allocations: 0 bytes)
     β”œβ”€ inbounds:  1.298 ΞΌs (0 allocations: 0 bytes)
     └─ simd:      1.296 ΞΌs (0 allocations: 0 bytes)

Tests with Julia-1.6.0, T=Float64, n=1000
 β”œβ”€ Call spline (1000 times):
 β”‚   β”œβ”€ simple:    863.083 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  352.774 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      352.255 ns (0 allocations: 0 bytes)
 β”œβ”€ Call inlined_spline (1000 times):
 β”‚   β”œβ”€ simple:    863.617 ns (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  352.670 ns (0 allocations: 0 bytes)
 β”‚   └─ simd:      352.368 ns (0 allocations: 0 bytes)
 β”‚
 β”œβ”€ Computation of weights with spline (1000 times):
 β”‚   β”œβ”€ simple:    1.493 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.309 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.310 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Computation of weights with inlined_spline (1000 times):
 β”‚   β”œβ”€ simple:    1.580 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.503 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.456 ΞΌs (0 allocations: 0 bytes)
 β”œβ”€ Inlined computation of weights with spline (1000 times):
 β”‚   β”œβ”€ simple:    1.309 ΞΌs (0 allocations: 0 bytes)
 β”‚   β”œβ”€ inbounds:  1.308 ΞΌs (0 allocations: 0 bytes)
 β”‚   └─ simd:      1.306 ΞΌs (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     β”œβ”€ simple:    1.308 ΞΌs (0 allocations: 0 bytes)
     β”œβ”€ inbounds:  1.308 ΞΌs (0 allocations: 0 bytes)
     └─ simd:      1.306 ΞΌs (0 allocations: 0 bytes)
4 Likes

I’m also seeing a performance drop of around 30% running the same benchmark code on Julia 16 compared to Julia 15. I haven’t managed to narrow it down yet, but it does seem to involve @inline and @simd blocks.

What kind of machine (OS-CPU) are you using? On all Linux machines I have been able to run the tests on (with relatively new processors implementing AVX2 instructions), the performance drop is over 300%, not 30%.

In the mean time I have also tested that the issue is not related to the stride of the elements in the destination array: by permuting these dimensions, the results are the same.

The β€œbrute force” approach which can be tedious but also very effective is to do a β€œgit bisect” on Julia to try find the exact commit that introduced the regression. Basically, you give git two commits, one good and one bad and then you can bisect the ranges iteratively until you find the offending commit (How to discover a bug using git bisect).

It is also possible to automate this (Fully automated bisecting with "git bisect run" [LWN.net]) but automating based on performance benchmark is always a bit hard. Instead, one can sometimes look at e.g. @code_llvm for the presence of e.g. vectorized (SIMD) intrinsics and use that as a marker for success/failure.

8 Likes

This is on an old Mac (1.6 GHz Intel Core i5). The benchmark I’ve run is very high-level, so a 300% slow down in one loop might cause an overall drop of 30% in my function. I am profiling the benchmark now to narrow down, and I was mostly joining in on this thread to get some clues as to what might be going on.

Excellent suggestion. I like this git bisect strategy! I guess that I have to clone Julia source and build the different versions corresponding to the commits chosen by git bisect. This will take a while but it is worth it…

I’ve tracked it down. Indeed, I’m seeing a 3x slowdown for this β€œtrivialized” loop function.

function increment!(p::AbstractArray,a::AbstractArray)
    for I ∈ CartesianIndices(p)
        @inbounds p[I] += a[I]
    end
end
n = 32
@benchmark increment!(a,b) setup=(a=rand(n,n,n);b=rand(n,n,n))

This gives a 3x slow down from Julia 1.5 to 1.6. The amount of slowdown does depend on n: The slowdown is only 1.7x when n=64 and is 4.3x when n=16. Using 2D arrays with the same total memory shows a similar slow down, but using 1D arrays makes this slowdown vanish.

13 Likes

Great! I admit your function is more likely to be more common than my, quite intricated, example.

I am currently running git bisect and that’s really a great tool. It is a bit long because building Julia is not a small task (and, from where I started, 10 trials are needed). But I already find that versions in the range Julia-1.6.0-DEV.1157 to Julia-1.6.0-DEV.1585 were more efficient than Julia-1.5.4: they can effectively vectorize loops inside my compute_weights! function without the needs to make it explicitly in-lined.

1 Like

If you set JULIA_PRECOMPILE=0 in Make.user it should go a bit faster to build (REPL will be more laggy though).

OK I’ll try it. Thanks!

Only 6 tries left :wink:

Bisecting: 53 revisions left to test after this (roughly 6 steps)
1 Like

My guess is that it is https://github.com/JuliaLang/julia/pull/37829 but I haven’t verified.

And now I found performance regression on high-dimensional array iteration using CartesianIndices (no simd) Β· Issue #38073 Β· JuliaLang/julia Β· GitHub.

Reverting the PR seems to fix it. Sorry @emmt for making you bisect. If you want, you could finish it to confirm.

3 Likes

A workaround on 1.6 is to write the loop as:

function increment!(p::AbstractArray,a::AbstractArray)
    @inbounds @simd for I ∈ CartesianIndices(p)
        p[I] += a[I]
    end
end

That seems to restore SIMD-ability. It is similar to Fix performance regression in broadcasting with CartesianIndices by kimikage Β· Pull Request #39333 Β· JuliaLang/julia Β· GitHub.

No problems, I will go to the end of the git bisect process. I learned a lot of fancy things thanks to you and I really want to confirm this issue.

BTW your suggestion:

really speeds things!

4 Likes

Automating a tool that can find heavy performance regressions by git bisecting sounds like a great idea for a package if anyone wants to help out.

2 Likes

Yes, I confirm that fixed it for Julia 1.6.

However… I have around 30 similar loops in my package. Before I go hunt them all down, can I confirm that this is the β€œright” way to use @inbounds @simd. It seems like I see them scattered around in different places in different code bases.

1 Like