Drop of performances with Julia 1.6.0 for InterpolationKernels

emmt · March 29, 2021, 8:08pm

I got catch by the following error:

julia/cli/trampolines/trampolines_x86_64.S:44: Error: no such instruction: `endbr64'

and cannot build Julia for commits:

2e3364e02f1dc3777926590c5484e7342bc0285d
9528ac2785f214cdc188a835deed3d93e0b012c6
7c17bb361e859aa834034c977ca683a78f17d506

I will start over…

Is there a way to guess which commit correspond to a given Julia DEV version?

kristoffer.carlsson · March 29, 2021, 8:15pm

I don’t think it is an “official way”. The official way would be to have the performance regression properly fixed. I just noticed that this particular rewrite fixed the example you gave and offered it as a potential workaround, while the issue is being looked at more carefully.

emmt · March 29, 2021, 8:19pm

This is exaclty what I am doing and it does not fix the issue.

I have pushed here the reduced example I am using to track down the issue so you can have alook at the loops.

The narrower git bisect range before I could no longer build Julia was (note that the best one was better than Julia-1.5.4):

the good:

Tests with Julia-1.6.0-DEV.1638, T=Float32, n=1000
 ├─ Call spline (1000 times):
 │   ├─ map! ───────────────────────  235.385 ns (0 allocations: 0 bytes)
 │   ├─ inbounds_map! ──────────────  235.322 ns (0 allocations: 0 bytes)
 │   └─ simd_map! ──────────────────  235.706 ns (0 allocations: 0 bytes)
 └─ Computation of weights with spline (1000 times):
     ├─ compute_weights! ───────────  621.935 ns (0 allocations: 0 bytes)
     └─ inlined_compute_weights! ───  624.953 ns (0 allocations: 0 bytes)

the bad (260% slower):

 Tests with Julia-1.6.0-DEV.1651, T=Float32, n=1000
 ├─ Call spline (1000 times):
 │   ├─ map! ───────────────────────  234.869 ns (0 allocations: 0 bytes)
 │   ├─ inbounds_map! ──────────────  235.224 ns (0 allocations: 0 bytes)
 │   └─ simd_map! ──────────────────  235.012 ns (0 allocations: 0 bytes)
 └─ Computation of weights with spline (1000 times):
     ├─ compute_weights! ───────────  1.636 μs (0 allocations: 0 bytes)
     └─ inlined_compute_weights! ───  1.630 μs (0 allocations: 0 bytes)

kristoffer.carlsson · March 29, 2021, 8:24pm

You can post the git bisect log if you want to “save your progress”.

emmt · March 29, 2021, 8:34pm

Oops I have cleaned everything to restart. I’ll give another try tomorrow, it’s late here

emmt · March 29, 2021, 10:39pm

OK done, commit fe1253ee258674844b8c0350deb05018909e823e is the first bad commit. But looking at what has changed I really do not see why it breaks vectorization.

Here is the output of git bisect log:

git bisect start
# good: [2e3364e02f1dc3777926590c5484e7342bc0285d] [loader]: Re-export symbols for C embedding, rename to `libjulia-internal` (#38160)
git bisect good 2e3364e02f1dc3777926590c5484e7342bc0285d
# good: [7c17bb361e859aa834034c977ca683a78f17d506] Add examples for endswith and startswith (#38255)
git bisect good 7c17bb361e859aa834034c977ca683a78f17d506
# bad: [9631a9fcee01643ccffc8e3c4a7f34b659fa2580] add __CET__ check guards to trampoline assembly (#38683)
git bisect bad 9631a9fcee01643ccffc8e3c4a7f34b659fa2580
# good: [49b8e61a80b8108ca0a23f8075a0d0508b6947c7] Fix out-of-tree compilation of loader library. (#38677)
git bisect good 49b8e61a80b8108ca0a23f8075a0d0508b6947c7
# bad: [8ffcc0ea9274203420e407d0f921cdd4c346fa22] Fix `stdlib/Makefile` rules for JLLs (#38688)
git bisect bad 8ffcc0ea9274203420e407d0f921cdd4c346fa22
# bad: [fe1253ee258674844b8c0350deb05018909e823e] fix #38664, regression in `===` codegen for Bool (#38686)
git bisect bad fe1253ee258674844b8c035

For the reccord:

timings for first version showing the issue:

Tests with Julia-1.6.0-DEV.1648, T=Float32, n=1000
 ├─ Call spline (1000 times):
 │   ├─ map! ───────────────────────  195.234 ns (0 allocations: 0 bytes)
 │   ├─ inbounds_map! ──────────────  206.776 ns (0 allocations: 0 bytes)
 │   └─ simd_map! ──────────────────  198.064 ns (0 allocations: 0 bytes)
 └─ Computation of weights with spline (1000 times):
     ├─ compute_weights! ───────────  1.450 μs (0 allocations: 0 bytes)
     └─ inlined_compute_weights! ───  1.440 μs (0 allocations: 0 bytes)

timings for the previous version:

Tests with Julia-1.6.0-DEV.1647, T=Float32, n=1000
 ├─ Call spline (1000 times):
 │   ├─ map! ───────────────────────  202.012 ns (0 allocations: 0 bytes)
 │   ├─ inbounds_map! ──────────────  201.413 ns (0 allocations: 0 bytes)
 │   └─ simd_map! ──────────────────  193.534 ns (0 allocations: 0 bytes)
 └─ Computation of weights with spline (1000 times):
     ├─ compute_weights! ───────────  501.495 ns (0 allocations: 0 bytes)
     └─ inlined_compute_weights! ───  487.655 ns (0 allocations: 0 bytes)

emmt · March 30, 2021, 4:43pm

Shall I report an issue now?

kristoffer.carlsson · March 30, 2021, 4:50pm

You can. I am pretty surprised about that bisection but stranger things have happened. Please add enough information so that someone could reproduce the slowdown to confirm the identified commit.

emmt · March 30, 2021, 4:54pm

I was also surprised when I looked at the tiny changes implemented by this commit. But I double checked that the previous commit has no such issue and that the issue is there after this commint. I will write a small test case to demonstrate this.

Seif_Shebl · March 30, 2021, 9:17pm

I’m not sure if eachindex(p) will fulfill your requirements, but it seems as fast as @inbounds @simd with CartesianIndices(p).

weymouth · March 30, 2021, 9:35pm

Thanks for the thought, but no, this was a contrived example. I only use loops like this instead of broadcasting when I need a sub-range of the array.

emmt · March 31, 2021, 7:35am

After checking everything, I submitted an issue #40276 with a shorter example.

jeremiedb · March 31, 2021, 1:28pm

In case it’s worth, I also experienced significant slowdown in Julia 1.6 on loops where no tuples were involved:

using StatsBase:sample
using BenchmarkTools

n_obs = Int(1e6)
n_vars = 100
n_bins = 64
K = 3
𝑖 = collect(1:n_obs)
δ = rand(n_obs, K)
hist = zeros(K, n_bins, n_vars);
X_bin = sample(UInt8.(1:n_bins), n_obs * n_vars);
X_bin = reshape(X_bin, n_obs, n_vars);

function iter_1(X_bin, hist, δ, 𝑖)
    hist .= 0.0
    @inbounds for i in 𝑖
        @inbounds for k in 1:3
            hist[k, X_bin[i,1], 1] += δ[i,k]
        end
    end
end

𝑖_sample = sample(𝑖, Int(n_obs / 2), ordered=true)

`
Julia 1.5.3:

julia> @btime iter_1($X_bin, $hist, $δ, $𝑖_sample)
  1.224 ms (0 allocations: 0 bytes)

Julia 1.6.0:

julia> @btime iter_1($X_bin, $hist, $δ, $𝑖_sample)
  1.648 ms (0 allocations: 0 bytes)

Adding @simd into the loop had little or no effect on performance (from 1.22ms to 1.19ms on 1.5.3 and 1.64ms to 1.62ms on 1.6.0)

weymouth · March 31, 2021, 1:39pm

Is this is same multi-dimensional array issue @kristoffer.carlsson pointed to previously?

kristoffer.carlsson · March 31, 2021, 2:00pm

I don’t think so. Looking at it a bit, it might be this change:

https://github.com/JuliaLang/julia/commit/cd134044fae2934dc5318af6da365109f114d232#diff-7aed4267c497767034ec7a7bb680d6dd2b5ca54cc26e377f174e7a63f66f2034R1064

I will verify.

jeremiedb · March 31, 2021, 2:01pm

From my understanding, it would be different considering that adding the @inbounds and/or @simd annotations didn’t bring any significant change in performance. For example, the following displays the same slowdown:

function iter_2(X_bin, hist, δ, 𝑖)
    hist .= 0.0
    @inbounds @simd for i in CartesianIndices(𝑖)
        @inbounds @simd for k in 1:3
            hist[k, X_bin[𝑖[i],1], 1] += δ[𝑖[i],k]
        end
    end
end

kristoffer.carlsson · March 31, 2021, 2:15pm

add a missing propagate_inbounds to a getindex method by KristofferC · Pull Request #40281 · JuliaLang/julia · GitHub should fix it.

emmt · March 31, 2021, 3:39pm

I changed the line in abstractarray.jl according to your PR (and rebuild the 2 versions of Julia) but it does not change my timings although it does not hurt. Really the only difference between version 1.6.0-DEV.1647 and version 1.6.0-DEV.1648 is the modifications around line 2582 of src/codegen.cpp.

In case it may helps to figure out the origin of the issue, I benchmarked a function that combine the computed weights into a single value (instead of storing them in some destination array) and it is as fast with either version of Julia.

function sum_prod_weights(src::Array{T,1}) where {T<:AbstractFloat}
    s = zero(T)
    @inbounds @simd for i in eachindex(src)
        w1, w2, w3, w4 = compute_weights(src[i])
        s += w1*w2*w3*w4
    end
    return s
end

kristoffer.carlsson · March 31, 2021, 3:43pm

Sorry, I was talking about the regression reported in Drop of performances with Julia 1.6.0 for InterpolationKernels - #33 by jeremiedb.

emmt · March 31, 2021, 4:35pm

OK I understand…

Topic		Replies	Views
Julia multithreading no performance improvement General Usage multithreading	15	891	May 31, 2022
Compiler Performance Internals & Design	20	2020	June 2, 2018
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	199	June 4, 2025
Large performance regression of vectorized functions on master General Usage	4	614	July 25, 2018
Float64 comparison operator performance Performance	8	1061	September 26, 2019

Drop of performances with Julia 1.6.0 for InterpolationKernels

Related topics