Drop of performances with Julia 1.6.0 for InterpolationKernels

emmt · March 27, 2021, 4:45pm

I was happy to switch from Julia 1.5.4 to Julia 1.6.0 but I experienced significant drops of performances for InterpolationKernels by up to a factor 3.6…

Below are tables summarizing my tests with BenchmarkTools. The columns give the minimum times in nanoseconds, all functions should be in-lined and are called on a vector (of 1000 elements) in a vectorized loop (@simd @inbounds).

I ran benchmarks on two different Linux machines and with more or less the same conclusions. The CPUs were:

i7: Intel Core i7-5500U @ 2.40GHz (laptop)
i9: Intel Core i9-9900KF @ 3.60GHz (workstation)

I first benchmarked calling interpolation kernels as simple functions. The results are very similar for the 2 versions of Julia so I see no issue there.

CPU	i7	i7	i9	i9
Julia version	1.5.4	1.6.0	1.5.4	1.6.0
BSpline{1,Float32}	96	101	49	51
BSpline{1,Float64}	185	189	99	95
BSpline{2,Float32}	110	104	58	51
BSpline{2,Float64}	186	188	89	91
BSpline{3,Float32}	235	236	121	144
BSpline{3,Float64}	459	453	231	228
BSpline{4,Float32}	330	321	186	168
BSpline{4,Float64}	622	616	364	331
CardinalCubicSpline{Float32}	306	307	179	175
CardinalCubicSpline{Float64}	594	595	349	341
CatmullRomSpline{Float32}	322	326	195	196
CatmullRomSpline{Float64}	628	627	382	382
CubicSpline{Float32}	331	330	191	185
CubicSpline{Float32}	334	330	190	184
CubicSpline{Float64}	647	642	373	371
CubicSpline{Float64}	647	645	372	373

Now benchmarking InterpolationKernels.compute_weights (which computes several interpolation weights at the same time) yields:

CPU	i7	i7	i9	i9
Julia version	1.5.4	1.6.0	1.5.4	1.6.0
BSpline{1,Float32}	48	49	29	29
BSpline{1,Float64}	94	91	55	53
BSpline{2,Float32}	244	702	137	409
BSpline{2,Float64}	356	703	208	410
BSpline{3,Float32}	490	1362	299	803
BSpline{3,Float64}	903	1400	528	814
BSpline{4,Float32}	752	2544	448	1615
BSpline{4,Float64}	1147	2709	593	1629
CardinalCubicSpline{Float32}	751	2571	449	1222
CardinalCubicSpline{Float64}	1111	2580	546	1235
CatmullRomSpline{Float32}	772	2343	466	1295
CatmullRomSpline{Float64}	1133	2354	561	1312
CubicSpline{Float32}	844	2607	502	1521
CubicSpline{Float32}	852	2606	507	1522
CubicSpline{Float64}	1148	2627	606	1531
CubicSpline{Float64}	1170	2625	608	1530

Hence, except for BSpline{1,T}, the code takes more than 2 and up to 3.6 more time to execute with Julia 1.6.0 compared to 1.5.4. This is probably due to innefective loop vectorization or in-lining of functions (in spite of the @inline macro).

Benchmarking InterpolationKernels.compute_offset_and_weights (which calls InterpolationKernels.compute_weights) confirms the issue:

CPU	i7	i7	i9	i9
Julia version	1.5.4	1.6.0	1.5.4	1.6.0
BSpline{1,Float32}	191	1016	106	412
BSpline{1,Float64}	324	1016	178	459
BSpline{2,Float32}	473	1684	270	681
BSpline{2,Float64}	849	1689	504	693
BSpline{3,Float32}	878	2804	541	1503
BSpline{3,Float64}	1147	2819	644	1555
BSpline{4,Float32}	1448	4009	882	2495
BSpline{4,Float64}	1977	4026	1238	2572
CardinalCubicSpline{Float32}	1440	3376	830	2014
CardinalCubicSpline{Float64}	1837	3396	1176	2030
CatmullRomSpline{Float32}	1446	3684	849	2067
CatmullRomSpline{Float64}	1876	3696	1199	2085
CubicSpline{Float32}	1438	4059	853	2395
CubicSpline{Float32}	1460	4054	857	2393
CubicSpline{Float64}	1955	4071	1220	2400
CubicSpline{Float64}	1955	4076	1223	2406

If someone could explained me what I have done wrong I would certainly learn a lot!

BTW the benchmarking code is in the test directory of the InterpolationKernels package (InterpolationKernels.jl/benchmarks.jl at master · emmt/InterpolationKernels.jl · GitHub).

Tamas_Papp · March 28, 2021, 9:07am

I am not sure you did anything wrong, but a lot of compilation heuristics changed between 1.5 and 1.6. Sometimes this means that you can get the same result with much simpler code, but occasionally it means that if your code relied on the compiler trying inference very hard it will give up sooner.

In any case, the first thing I would do is investigate with the standard tools (@code_warntype, BenchmarkTools.@btime), narrow down the problem to specific parts with packages like

and then look at the generated code. If you narrow down the problem to something specific, you are more likely to get a suggestion about it here.

emmt · March 28, 2021, 9:15am

Thanks for your advices. I forgot to mention that all testing (I was using @benchmark macro) reported that no additional memory was allocated (with Julia 1.5.4 and 1.6.0). My current understanding is that the problem may be that the offending functions return tuples of values (of type and size known in principle). I agree with you that I should limit my example to a simpler code example that has a similar issue.

emmt · March 28, 2021, 12:06pm

I managed to have a minimal example which shows the issue. The source code for the example is here. The InterpolationKernels package is not needed to run the tests, you just have to install MayOptimize which is a convenient way to try different loop optimization settings with the exact same code.

The following results (on a Linux workstation with an Intel Core i9-9900KF @ 3.60GHz) show that:

With Julia-1.5.4 and Julia-1.6.0, when calling the spline function in a loop (the 2 first blocks of tests), the function is short enough that it does not need to be inlined (the timings are the same for spline and inlined_spline). Timings are much shorter when loop vectorization is trigerred (by @inbounds and by @inbounds @simd) by a factor ~4.8 in single precisison and ~2.5 in double precisison which is consistent with the fact that there are twice as many Float32 than Float64 in AVX registers.
With Julia-1.5.4, it is possible to achieve amazing loop vectorization when computing the weights (which amounts to calling a short function that yields a 4-tuple of values) provided the function which calls compute_weights in a loop is itself inlined and uses @inbounds or @inbounds @simd for its loop. I estimate that the best timings correspond to 21.5 Gflops in single precision and 17.9 Gflops in double precision which is really great.
With Julia-1.6.0, it is not possible to achieve this computing power whatever the combination of @inline, @inbounds and @simd macros. The best timings correspond to 7.7 Gflops (in double and single precision) which is rather disapointing considering those obtained with Julia-1.5.4.

As you suggested, I checked with @code_warntype, that the compiler (with the 2 versions of Julia) correctly inferred that the fucntions spline and compute_weights(spline, ...) respectively return a float T and a 4-tuple of floats NTuple{4,T} both in single precisoion (T=Float32) and double precisison (T=Float64).

My feeling is that, in Julia-1.6.0, loop-vectorization has some issues with functions that return tuples of values. I hope this, not so short, example, helps to show that…

P.S. I tried to explicitly retrieve the 4 weights w1,w2,w3,w4 = compute_weights(spline,x) but this does not help with Julia-1.6.0 and (of course) makes no difference with Julia-1.5.4.

Results for Julia-1.5.4

Tests with Julia-1.5.4, T=Float32, n=1000
 ├─ Call spline (1000 times):
 │   ├─ Debug:      864.644 ns (0 allocations: 0 bytes)
 │   ├─ InBounds:   181.772 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  179.852 ns (0 allocations: 0 bytes)
 ├─ Call inlined_spline (1000 times):
 │   ├─ Debug:      864.383 ns (0 allocations: 0 bytes)
 │   ├─ InBounds:   181.745 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  180.986 ns (0 allocations: 0 bytes)
 ├─ Computation of weights with spline (1000 times):
 │   ├─ Debug:      1.619 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.531 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.530 μs (0 allocations: 0 bytes)
 ├─ Computation of weights with inlined_spline (1000 times):
 │   ├─ Debug:      1.616 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.585 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.587 μs (0 allocations: 0 bytes)
 ├─ Inlined computation of weights with spline (1000 times):
 │   ├─ Debug:      1.296 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   448.538 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  463.217 ns (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     ├─ Debug:      1.296 μs (0 allocations: 0 bytes)
     ├─ InBounds:   464.584 ns (0 allocations: 0 bytes)
     └─ Vectorize:  463.462 ns (0 allocations: 0 bytes)

Tests with Julia-1.5.4, T=Float64, n=1000
 ├─ Call spline (1000 times):
 │   ├─ Debug:      864.390 ns (0 allocations: 0 bytes)
 │   ├─ InBounds:   354.118 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  353.341 ns (0 allocations: 0 bytes)
 ├─ Call inlined_spline (1000 times):
 │   ├─ Debug:      864.254 ns (0 allocations: 0 bytes)
 │   ├─ InBounds:   341.552 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  353.995 ns (0 allocations: 0 bytes)
 ├─ Computation of weights with spline (1000 times):
 │   ├─ Debug:      1.631 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.547 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.548 μs (0 allocations: 0 bytes)
 ├─ Computation of weights with inlined_spline (1000 times):
 │   ├─ Debug:      1.702 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.605 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.621 μs (0 allocations: 0 bytes)
 ├─ Inlined computation of weights with spline (1000 times):
 │   ├─ Debug:      1.311 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   564.941 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  560.113 ns (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     ├─ Debug:      1.311 μs (0 allocations: 0 bytes)
     ├─ InBounds:   565.086 ns (0 allocations: 0 bytes)
     └─ Vectorize:  560.476 ns (0 allocations: 0 bytes)

Results for Julia-1.6.0

Tests with Julia-1.6.0, T=Float32, n=1000
 ├─ Call spline (1000 times):
 │   ├─ Debug:      864.288 ns (0 allocations: 0 bytes)
 │   ├─ InBounds:   182.463 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  182.490 ns (0 allocations: 0 bytes)
 ├─ Call inlined_spline (1000 times):
 │   ├─ Debug:      864.458 ns (0 allocations: 0 bytes)
 │   ├─ InBounds:   177.442 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  182.168 ns (0 allocations: 0 bytes)
 ├─ Computation of weights with spline (1000 times):
 │   ├─ Debug:      1.488 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.301 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.300 μs (0 allocations: 0 bytes)
 ├─ Computation of weights with inlined_spline (1000 times):
 │   ├─ Debug:      1.573 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.523 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.523 μs (0 allocations: 0 bytes)
 ├─ Inlined computation of weights with spline (1000 times):
 │   ├─ Debug:      1.296 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.308 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.292 μs (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     ├─ Debug:      1.296 μs (0 allocations: 0 bytes)
     ├─ InBounds:   1.296 μs (0 allocations: 0 bytes)
     └─ Vectorize:  1.299 μs (0 allocations: 0 bytes)

Tests with Julia-1.6.0, T=Float64, n=1000
 ├─ Call spline (1000 times):
 │   ├─ Debug:      864.831 ns (0 allocations: 0 bytes)
 │   ├─ InBounds:   355.095 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  354.104 ns (0 allocations: 0 bytes)
 ├─ Call inlined_spline (1000 times):
 │   ├─ Debug:      863.033 ns (0 allocations: 0 bytes)
 │   ├─ InBounds:   354.991 ns (0 allocations: 0 bytes)
 │   └─ Vectorize:  354.754 ns (0 allocations: 0 bytes)
 ├─ Computation of weights with spline (1000 times):
 │   ├─ Debug:      1.505 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.315 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.316 μs (0 allocations: 0 bytes)
 ├─ Computation of weights with inlined_spline (1000 times):
 │   ├─ Debug:      1.588 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.504 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.504 μs (0 allocations: 0 bytes)
 ├─ Inlined computation of weights with spline (1000 times):
 │   ├─ Debug:      1.279 μs (0 allocations: 0 bytes)
 │   ├─ InBounds:   1.308 μs (0 allocations: 0 bytes)
 │   └─ Vectorize:  1.313 μs (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     ├─ Debug:      1.312 μs (0 allocations: 0 bytes)
     ├─ InBounds:   1.325 μs (0 allocations: 0 bytes)
     └─ Vectorize:  1.309 μs (0 allocations: 0 bytes)

Tamas_Papp · March 28, 2021, 1:21pm

It would be great if you could make an MWE without MayOptimize, to isolate whether this is an issue with 1.6 or that package.

emmt · March 28, 2021, 3:54pm

Sure, the code without MayOptimize is here. The results show exactly the same behavior as before.

Julia 1.5.4

On a Linux workstation with an Intel Core i9-9900KF @ 3.60GHz, the command:

julia-1.5 -O3 inline-issue-solo.jl

yields:

Tests with Julia-1.5.4, T=Float32, n=1000
 ├─ Call spline (1000 times):
 │   ├─ simple:    863.350 ns (0 allocations: 0 bytes)
 │   ├─ inbounds:  175.171 ns (0 allocations: 0 bytes)
 │   └─ simd:      180.769 ns (0 allocations: 0 bytes)
 ├─ Call inlined_spline (1000 times):
 │   ├─ simple:    863.267 ns (0 allocations: 0 bytes)
 │   ├─ inbounds:  180.689 ns (0 allocations: 0 bytes)
 │   └─ simd:      181.230 ns (0 allocations: 0 bytes)
 │
 ├─ Computation of weights with spline (1000 times):
 │   ├─ simple:    1.613 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.531 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.531 μs (0 allocations: 0 bytes)
 ├─ Computation of weights with inlined_spline (1000 times):
 │   ├─ simple:    1.646 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.583 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.594 μs (0 allocations: 0 bytes)
 ├─ Inlined computation of weights with spline (1000 times):
 │   ├─ simple:    1.297 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  464.107 ns (0 allocations: 0 bytes)
 │   └─ simd:      463.629 ns (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     ├─ simple:    1.297 μs (0 allocations: 0 bytes)
     ├─ inbounds:  464.315 ns (0 allocations: 0 bytes)
     └─ simd:      463.350 ns (0 allocations: 0 bytes)

Tests with Julia-1.5.4, T=Float64, n=1000
 ├─ Call spline (1000 times):
 │   ├─ simple:    863.100 ns (0 allocations: 0 bytes)
 │   ├─ inbounds:  351.967 ns (0 allocations: 0 bytes)
 │   └─ simd:      350.812 ns (0 allocations: 0 bytes)
 ├─ Call inlined_spline (1000 times):
 │   ├─ simple:    862.933 ns (0 allocations: 0 bytes)
 │   ├─ inbounds:  351.869 ns (0 allocations: 0 bytes)
 │   └─ simd:      351.042 ns (0 allocations: 0 bytes)
 │
 ├─ Computation of weights with spline (1000 times):
 │   ├─ simple:    1.632 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.545 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.546 μs (0 allocations: 0 bytes)
 ├─ Computation of weights with inlined_spline (1000 times):
 │   ├─ simple:    1.689 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.597 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.598 μs (0 allocations: 0 bytes)
 ├─ Inlined computation of weights with spline (1000 times):
 │   ├─ simple:    1.311 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  564.568 ns (0 allocations: 0 bytes)
 │   └─ simd:      561.427 ns (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     ├─ simple:    1.312 μs (0 allocations: 0 bytes)
     ├─ inbounds:  566.147 ns (0 allocations: 0 bytes)
     └─ simd:      563.843 ns (0 allocations: 0 bytes)

Julia 1.6.0

On a Linux workstation with an Intel Core i9-9900KF @ 3.60GHz, the command:

julia-1.6 -O3 inline-issue-solo.jl

yields:

Tests with Julia-1.6.0, T=Float32, n=1000
 ├─ Call spline (1000 times):
 │   ├─ simple:    863.300 ns (0 allocations: 0 bytes)
 │   ├─ inbounds:  180.448 ns (0 allocations: 0 bytes)
 │   └─ simd:      181.093 ns (0 allocations: 0 bytes)
 ├─ Call inlined_spline (1000 times):
 │   ├─ simple:    863.400 ns (0 allocations: 0 bytes)
 │   ├─ inbounds:  175.745 ns (0 allocations: 0 bytes)
 │   └─ simd:      180.984 ns (0 allocations: 0 bytes)
 │
 ├─ Computation of weights with spline (1000 times):
 │   ├─ simple:    1.487 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.298 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.298 μs (0 allocations: 0 bytes)
 ├─ Computation of weights with inlined_spline (1000 times):
 │   ├─ simple:    1.570 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.520 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.521 μs (0 allocations: 0 bytes)
 ├─ Inlined computation of weights with spline (1000 times):
 │   ├─ simple:    1.296 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.297 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.296 μs (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     ├─ simple:    1.297 μs (0 allocations: 0 bytes)
     ├─ inbounds:  1.298 μs (0 allocations: 0 bytes)
     └─ simd:      1.296 μs (0 allocations: 0 bytes)

Tests with Julia-1.6.0, T=Float64, n=1000
 ├─ Call spline (1000 times):
 │   ├─ simple:    863.083 ns (0 allocations: 0 bytes)
 │   ├─ inbounds:  352.774 ns (0 allocations: 0 bytes)
 │   └─ simd:      352.255 ns (0 allocations: 0 bytes)
 ├─ Call inlined_spline (1000 times):
 │   ├─ simple:    863.617 ns (0 allocations: 0 bytes)
 │   ├─ inbounds:  352.670 ns (0 allocations: 0 bytes)
 │   └─ simd:      352.368 ns (0 allocations: 0 bytes)
 │
 ├─ Computation of weights with spline (1000 times):
 │   ├─ simple:    1.493 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.309 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.310 μs (0 allocations: 0 bytes)
 ├─ Computation of weights with inlined_spline (1000 times):
 │   ├─ simple:    1.580 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.503 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.456 μs (0 allocations: 0 bytes)
 ├─ Inlined computation of weights with spline (1000 times):
 │   ├─ simple:    1.309 μs (0 allocations: 0 bytes)
 │   ├─ inbounds:  1.308 μs (0 allocations: 0 bytes)
 │   └─ simd:      1.306 μs (0 allocations: 0 bytes)
 └─ Inlined computation of weights with inlined_spline (1000 times):
     ├─ simple:    1.308 μs (0 allocations: 0 bytes)
     ├─ inbounds:  1.308 μs (0 allocations: 0 bytes)
     └─ simd:      1.306 μs (0 allocations: 0 bytes)

weymouth · March 29, 2021, 4:24pm

I’m also seeing a performance drop of around 30% running the same benchmark code on Julia 16 compared to Julia 15. I haven’t managed to narrow it down yet, but it does seem to involve @inline and @simd blocks.

emmt · March 29, 2021, 4:41pm

What kind of machine (OS-CPU) are you using? On all Linux machines I have been able to run the tests on (with relatively new processors implementing AVX2 instructions), the performance drop is over 300%, not 30%.

In the mean time I have also tested that the issue is not related to the stride of the elements in the destination array: by permuting these dimensions, the results are the same.

kristoffer.carlsson · March 29, 2021, 4:51pm

The “brute force” approach which can be tedious but also very effective is to do a “git bisect” on Julia to try find the exact commit that introduced the regression. Basically, you give git two commits, one good and one bad and then you can bisect the ranges iteratively until you find the offending commit (How to discover a bug using git bisect).

It is also possible to automate this (Fully automated bisecting with "git bisect run" [LWN.net]) but automating based on performance benchmark is always a bit hard. Instead, one can sometimes look at e.g. @code_llvm for the presence of e.g. vectorized (SIMD) intrinsics and use that as a marker for success/failure.

weymouth · March 29, 2021, 5:04pm

This is on an old Mac (1.6 GHz Intel Core i5). The benchmark I’ve run is very high-level, so a 300% slow down in one loop might cause an overall drop of 30% in my function. I am profiling the benchmark now to narrow down, and I was mostly joining in on this thread to get some clues as to what might be going on.

emmt · March 29, 2021, 6:10pm

Excellent suggestion. I like this git bisect strategy! I guess that I have to clone Julia source and build the different versions corresponding to the commits chosen by git bisect. This will take a while but it is worth it…

weymouth · March 29, 2021, 6:49pm

I’ve tracked it down. Indeed, I’m seeing a 3x slowdown for this “trivialized” loop function.

function increment!(p::AbstractArray,a::AbstractArray)
    for I ∈ CartesianIndices(p)
        @inbounds p[I] += a[I]
    end
end
n = 32
@benchmark increment!(a,b) setup=(a=rand(n,n,n);b=rand(n,n,n))

This gives a 3x slow down from Julia 1.5 to 1.6. The amount of slowdown does depend on n: The slowdown is only 1.7x when n=64 and is 4.3x when n=16. Using 2D arrays with the same total memory shows a similar slow down, but using 1D arrays makes this slowdown vanish.

emmt · March 29, 2021, 6:57pm

Great! I admit your function is more likely to be more common than my, quite intricated, example.

I am currently running git bisect and that’s really a great tool. It is a bit long because building Julia is not a small task (and, from where I started, 10 trials are needed). But I already find that versions in the range Julia-1.6.0-DEV.1157 to Julia-1.6.0-DEV.1585 were more efficient than Julia-1.5.4: they can effectively vectorize loops inside my compute_weights! function without the needs to make it explicitly in-lined.

kristoffer.carlsson · March 29, 2021, 7:25pm

If you set JULIA_PRECOMPILE=0 in Make.user it should go a bit faster to build (REPL will be more laggy though).

emmt · March 29, 2021, 7:29pm

OK I’ll try it. Thanks!

Only 6 tries left

Bisecting: 53 revisions left to test after this (roughly 6 steps)

kristoffer.carlsson · March 29, 2021, 7:29pm

My guess is that it is https://github.com/JuliaLang/julia/pull/37829 but I haven’t verified.

And now I found performance regression on high-dimensional array iteration using CartesianIndices (no simd) · Issue #38073 · JuliaLang/julia · GitHub.

Reverting the PR seems to fix it. Sorry @emmt for making you bisect. If you want, you could finish it to confirm.

kristoffer.carlsson · March 29, 2021, 7:45pm

A workaround on 1.6 is to write the loop as:

function increment!(p::AbstractArray,a::AbstractArray)
    @inbounds @simd for I ∈ CartesianIndices(p)
        p[I] += a[I]
    end
end

That seems to restore SIMD-ability. It is similar to Fix performance regression in broadcasting with CartesianIndices by kimikage · Pull Request #39333 · JuliaLang/julia · GitHub.

emmt · March 29, 2021, 7:45pm

No problems, I will go to the end of the git bisect process. I learned a lot of fancy things thanks to you and I really want to confirm this issue.

BTW your suggestion:

really speeds things!

miguelraz · March 29, 2021, 7:51pm

Automating a tool that can find heavy performance regressions by git bisecting sounds like a great idea for a package if anyone wants to help out.

weymouth · March 29, 2021, 8:08pm

Yes, I confirm that fixed it for Julia 1.6.

However… I have around 30 similar loops in my package. Before I go hunt them all down, can I confirm that this is the “right” way to use @inbounds @simd. It seems like I see them scattered around in different places in different code bases.

Topic		Replies	Views
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2099	August 28, 2021
Broadcast vs. scalar loop, can Julia vectorize better? Internals & Design	8	1920	February 15, 2020
Poor performance of SIMD vectorization in the latest version of Julia (v1.11.2) Performance performance	19	815	January 8, 2025
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2906	July 16, 2018
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1881	June 27, 2021

Drop of performances with Julia 1.6.0 for InterpolationKernels

Results for Julia-1.5.4

Results for Julia-1.6.0

Julia 1.5.4

Julia 1.6.0

Related topics