Drop of performances with Julia 1.6.0 for InterpolationKernels

I got catch by the following error:

julia/cli/trampolines/trampolines_x86_64.S:44: Error: no such instruction: `endbr64'

and cannot build Julia for commits:

  • 2e3364e02f1dc3777926590c5484e7342bc0285d
  • 9528ac2785f214cdc188a835deed3d93e0b012c6
  • 7c17bb361e859aa834034c977ca683a78f17d506

I will start over…

Is there a way to guess which commit correspond to a given Julia DEV version?

I don’t think it is an β€œofficial way”. The official way would be to have the performance regression properly fixed. I just noticed that this particular rewrite fixed the example you gave and offered it as a potential workaround, while the issue is being looked at more carefully.

2 Likes

This is exaclty what I am doing and it does not fix the issue.

I have pushed here the reduced example I am using to track down the issue so you can have alook at the loops.

The narrower git bisect range before I could no longer build Julia was (note that the best one was better than Julia-1.5.4):

  • the good:

    Tests with Julia-1.6.0-DEV.1638, T=Float32, n=1000
     β”œβ”€ Call spline (1000 times):
     β”‚   β”œβ”€ map! ───────────────────────  235.385 ns (0 allocations: 0 bytes)
     β”‚   β”œβ”€ inbounds_map! ──────────────  235.322 ns (0 allocations: 0 bytes)
     β”‚   └─ simd_map! ──────────────────  235.706 ns (0 allocations: 0 bytes)
     └─ Computation of weights with spline (1000 times):
         β”œβ”€ compute_weights! ───────────  621.935 ns (0 allocations: 0 bytes)
         └─ inlined_compute_weights! ───  624.953 ns (0 allocations: 0 bytes)
    
  • the bad (260% slower):

     Tests with Julia-1.6.0-DEV.1651, T=Float32, n=1000
     β”œβ”€ Call spline (1000 times):
     β”‚   β”œβ”€ map! ───────────────────────  234.869 ns (0 allocations: 0 bytes)
     β”‚   β”œβ”€ inbounds_map! ──────────────  235.224 ns (0 allocations: 0 bytes)
     β”‚   └─ simd_map! ──────────────────  235.012 ns (0 allocations: 0 bytes)
     └─ Computation of weights with spline (1000 times):
         β”œβ”€ compute_weights! ───────────  1.636 ΞΌs (0 allocations: 0 bytes)
         └─ inlined_compute_weights! ───  1.630 ΞΌs (0 allocations: 0 bytes)
    

You can post the git bisect log if you want to β€œsave your progress”.

1 Like

Oops I have cleaned everything to restart. I’ll give another try tomorrow, it’s late here :wink:

2 Likes

OK done, commit fe1253ee258674844b8c0350deb05018909e823e is the first bad commit. But looking at what has changed I really do not see why it breaks vectorization.

Here is the output of git bisect log:

git bisect start
# good: [2e3364e02f1dc3777926590c5484e7342bc0285d] [loader]: Re-export symbols for C embedding, rename to `libjulia-internal` (#38160)
git bisect good 2e3364e02f1dc3777926590c5484e7342bc0285d
# good: [7c17bb361e859aa834034c977ca683a78f17d506] Add examples for endswith and startswith (#38255)
git bisect good 7c17bb361e859aa834034c977ca683a78f17d506
# bad: [9631a9fcee01643ccffc8e3c4a7f34b659fa2580] add __CET__ check guards to trampoline assembly (#38683)
git bisect bad 9631a9fcee01643ccffc8e3c4a7f34b659fa2580
# good: [49b8e61a80b8108ca0a23f8075a0d0508b6947c7] Fix out-of-tree compilation of loader library. (#38677)
git bisect good 49b8e61a80b8108ca0a23f8075a0d0508b6947c7
# bad: [8ffcc0ea9274203420e407d0f921cdd4c346fa22] Fix `stdlib/Makefile` rules for JLLs (#38688)
git bisect bad 8ffcc0ea9274203420e407d0f921cdd4c346fa22
# bad: [fe1253ee258674844b8c0350deb05018909e823e] fix #38664, regression in `===` codegen for Bool (#38686)
git bisect bad fe1253ee258674844b8c035

For the reccord:

  • timings for first version showing the issue:
    Tests with Julia-1.6.0-DEV.1648, T=Float32, n=1000
     β”œβ”€ Call spline (1000 times):
     β”‚   β”œβ”€ map! ───────────────────────  195.234 ns (0 allocations: 0 bytes)
     β”‚   β”œβ”€ inbounds_map! ──────────────  206.776 ns (0 allocations: 0 bytes)
     β”‚   └─ simd_map! ──────────────────  198.064 ns (0 allocations: 0 bytes)
     └─ Computation of weights with spline (1000 times):
         β”œβ”€ compute_weights! ───────────  1.450 ΞΌs (0 allocations: 0 bytes)
         └─ inlined_compute_weights! ───  1.440 ΞΌs (0 allocations: 0 bytes)
    
  • timings for the previous version:
    Tests with Julia-1.6.0-DEV.1647, T=Float32, n=1000
     β”œβ”€ Call spline (1000 times):
     β”‚   β”œβ”€ map! ───────────────────────  202.012 ns (0 allocations: 0 bytes)
     β”‚   β”œβ”€ inbounds_map! ──────────────  201.413 ns (0 allocations: 0 bytes)
     β”‚   └─ simd_map! ──────────────────  193.534 ns (0 allocations: 0 bytes)
     └─ Computation of weights with spline (1000 times):
         β”œβ”€ compute_weights! ───────────  501.495 ns (0 allocations: 0 bytes)
         └─ inlined_compute_weights! ───  487.655 ns (0 allocations: 0 bytes)
    
8 Likes

Shall I report an issue now?

You can. I am pretty surprised about that bisection but stranger things have happened. Please add enough information so that someone could reproduce the slowdown to confirm the identified commit.

5 Likes

I was also surprised when I looked at the tiny changes implemented by this commit. But I double checked that the previous commit has no such issue and that the issue is there after this commint. I will write a small test case to demonstrate this.

5 Likes

I’m not sure if eachindex(p) will fulfill your requirements, but it seems as fast as @inbounds @simd with CartesianIndices(p).

Thanks for the thought, but no, this was a contrived example. I only use loops like this instead of broadcasting when I need a sub-range of the array.

After checking everything, I submitted an issue #40276 with a shorter example.

12 Likes

In case it’s worth, I also experienced significant slowdown in Julia 1.6 on loops where no tuples were involved:

using StatsBase:sample
using BenchmarkTools

n_obs = Int(1e6)
n_vars = 100
n_bins = 64
K = 3
𝑖 = collect(1:n_obs)
Ξ΄ = rand(n_obs, K)
hist = zeros(K, n_bins, n_vars);
X_bin = sample(UInt8.(1:n_bins), n_obs * n_vars);
X_bin = reshape(X_bin, n_obs, n_vars);

function iter_1(X_bin, hist, Ξ΄, 𝑖)
    hist .= 0.0
    @inbounds for i in 𝑖
        @inbounds for k in 1:3
            hist[k, X_bin[i,1], 1] += Ξ΄[i,k]
        end
    end
end

𝑖_sample = sample(𝑖, Int(n_obs / 2), ordered=true)

`
Julia 1.5.3:

julia> @btime iter_1($X_bin, $hist, $Ξ΄, $𝑖_sample)
  1.224 ms (0 allocations: 0 bytes)

Julia 1.6.0:

julia> @btime iter_1($X_bin, $hist, $Ξ΄, $𝑖_sample)
  1.648 ms (0 allocations: 0 bytes)

Adding @simd into the loop had little or no effect on performance (from 1.22ms to 1.19ms on 1.5.3 and 1.64ms to 1.62ms on 1.6.0)

Is this is same multi-dimensional array issue @kristoffer.carlsson pointed to previously?

I don’t think so. Looking at it a bit, it might be this change:

https://github.com/JuliaLang/julia/commit/cd134044fae2934dc5318af6da365109f114d232#diff-7aed4267c497767034ec7a7bb680d6dd2b5ca54cc26e377f174e7a63f66f2034R1064

I will verify.

3 Likes

From my understanding, it would be different considering that adding the @inbounds and/or @simd annotations didn’t bring any significant change in performance. For example, the following displays the same slowdown:

function iter_2(X_bin, hist, Ξ΄, 𝑖)
    hist .= 0.0
    @inbounds @simd for i in CartesianIndices(𝑖)
        @inbounds @simd for k in 1:3
            hist[k, X_bin[𝑖[i],1], 1] += Ξ΄[𝑖[i],k]
        end
    end
end

add a missing propagate_inbounds to a getindex method by KristofferC Β· Pull Request #40281 Β· JuliaLang/julia Β· GitHub should fix it.

3 Likes

I changed the line in abstractarray.jl according to your PR (and rebuild the 2 versions of Julia) but it does not change my timings although it does not hurt. Really the only difference between version 1.6.0-DEV.1647 and version 1.6.0-DEV.1648 is the modifications around line 2582 of src/codegen.cpp.

In case it may helps to figure out the origin of the issue, I benchmarked a function that combine the computed weights into a single value (instead of storing them in some destination array) and it is as fast with either version of Julia.

function sum_prod_weights(src::Array{T,1}) where {T<:AbstractFloat}
    s = zero(T)
    @inbounds @simd for i in eachindex(src)
        w1, w2, w3, w4 = compute_weights(src[i])
        s += w1*w2*w3*w4
    end
    return s
end

Sorry, I was talking about the regression reported in Drop of performances with Julia 1.6.0 for InterpolationKernels - #33 by jeremiedb.

OK I understand…