Help with optimising a simple function

Recently my colleagues asked me to translate my code for estimation to Python. Making it run with any reasonable speed was truly a pain and advertisement for Julia, but at the end it became few times faster with numba. It is strange, because the algorithm is very elementary, you’d expect similar resulting LLVM. The main culprit seems to be here:

function theorCovEff3(ts,k,l,ln,α)
    K(s,t) = (α ≈ 1.0) ? 2min(s,t) : (s^α + t^α - abs(s-t)^α)
    function incrCov(ts,i,j,k,l,K) 
        a, b, c, d = ts[i], ts[j], ts[k], ts[l]
        K(a,b) + K(a+c,b+d) - K(a,b+d) - K(a+c,b)
    end

    if k > l
        k, l = l, k
    end
    N1 = h -> ln-l-h+1
    N2 = h -> (h <= l-k+1) ? ( ln-l ) : ( ln-k-h+1 )

    return 2/((ln-k)*(ln-l)) * ( 
          sum( N1(h)*incrCov(ts,1,h,k,l,K)^2 for h in 2:ln-l; init=0 ) 
        + sum( N2(h)*incrCov(ts,h,1,k,l,K)^2 for h in 1:ln-k )
    )
end

Example use theorCovEff3(1:100, 10, 2, 100, 0.7). I’ve tried @inbounds and @fastmath, it does not help much. I don’t understand why, but replacing sum with loops makes it twice faster, but still slower than numba. Any tips?

1 Like

You have several closures. Performance Tips · The Julia Language

3 Likes

It is more complicated than that… I think this is somewhat of a bug? For instance, by simply changing the way l,k are permuted, the instability disappears:

N1(h,ln,l) = ln-l-h+1
N2(h,l,k,ln) = (h <= l-k+1) ? ( ln-l ) : ( ln-k-h+1 )
K(s,t,α) = (α ≈ 1.0) ? 2min(s,t) : (s^α + t^α - abs(s-t)^α)
function incrCov(ts,i,j,k,l,K,α)
    a, b, c, d = ts[i], ts[j], ts[k], ts[l]
    K(a,b,α) + K(a+c,b+d,α) - K(a,b+d,α) - K(a+c,b,α)
end

function theorCovEff3(ts,k,l,ln,α)
    k, l = (k > l) ? (l, k) : (k, l) # Type stable
    #if k > l; k, l = l, k; end # Type unstable
    return 2/((ln-k)*(ln-l)) * ( 
          sum( N1(h,ln,l)*incrCov(ts,1,h,k,l,K,α)^2 for h in 2:ln-l) 
        + sum( N2(h,l,k,ln)*incrCov(ts,h,1,k,l,K,α)^2 for h in 1:ln-k)
    )
end

(this version has no closures, to simplify things)

I managed to make it 2x faster, but it sounds like your gap was larger than that

function theorCovEff_mikm(ts, k, l, ln, K)

    incrCov(a, b, c, d) = K(a, b) + K(a+c, b+d) - K(a, b+d) - K(a+c, b)

    k, l = minmax(k, l)

    N1 = let ln=ln, l=l;
        h -> ln-l+1-h
    end
    N2 = let ln=ln, l=l, k=k
        h -> (h <= l-k+1) ? ( ln-l ) : ( ln-k-h+1 )
    end

    ts1 = ts[1]
    tsk = ts[k]
    tsl = ts[l]

    s1 = sum(h -> N1(h)*incrCov(ts1, ts[h], tsk, tsl)^2, 2:ln-l)
    s2 = sum(h -> N2(h)*incrCov(ts[h], ts1, tsk, tsl)^2, 1:ln-k)

    return 2/((ln-k)*(ln-l)) * (s1 + s2)
end
julia> using BenchmarkTools

julia> @btime theorCovEff3($(1:100), $10, $2, $100, 0.7)
  51.413 μs (1133 allocations: 20.91 KiB)
0.7765443891896003

julia> @btime theorCovEff_mikm($(1:100), $10, $2, $100, (s,t) -> s^0.7 + t^0.7 - abs(s-t)^0.7)
  29.637 μs (0 allocations: 0 bytes)
0.7765443891896003

I pass the K function in externally (making it branch on a constant α every call seems silly), used the sum(f, iter) forms instead, and pulled a couple of constant accesses out. I suspect the material difference here was to do with the handling of closures over l and k (although I used let blocks, I believe these are unnecessary).

I’m quite sure this has to do with closures, still. Because l and k might be swapped or might not, they must be boxed by any closure. Using an unconditional reassignment avoids this issue.

3 Likes

I removed the closures from the above (I think). But still, l and k are of the same type, I don’t see why they should be boxed.

note: the 2x improvement is all due to that:

julia> @b theorCovEff3(1:100, 2, 10, 100, 0.7)
49.941 μs (1133 allocs: 20.906 KiB)

julia> @b theorCovEff3(1:100, 2, 10, 100, 0.7)
34.744 μs

where these were run with the original implementation, just by swapping these lines:

    k, l = (k > l) ? (l,k) : (k,l)
    #if k > l; k, l = l, k; end

typeof((N1(h, ln, l) * incrCov(ts, 1, h, k, l, K, α) ^ 2 for h = 2:ln - l)) = Base.Generator{UnitRange{Int64}, var"#106#109"{UnitRange{Int64}, Int64, Int64, Int64, Float64}}

The Generator still makes a closure. And the boxes still happen because it might need to access the data originally assigned to k or originally to l wherever k or l are used. The unconditional assignment fixes this by anchoring them into solid locations.

1 Like

Well, ok, but honestly I don’t see any good reason for the lowering of one case be any different from the other. When using k, l = (k > l) ? (l,k) : (k,l) the values still might be interchanged, and they are of the same type, I don’t see why having the else statement there guarantees anything to the compiler that is not guaranteed in the original version.

This is an actual case where I would expect that specialization to all input variable types would actually help the compiler to know what to do.

Well, this is MWE if anyone feels there’s something to be done there:

function unstable(x,k,l)
    if k > l; k, l = l, k; end
    return sum((x[k]+x[l])^2 for _ in eachindex(x)) 
end

function stable(x,k,l)
    k, l = if k > l; (l, k); else (k, l); end
    return sum((x[k]+x[l])^2 for _ in eachindex(x)) 
end
julia> @b unstable(1:100, 2, 10)
4.478 μs (299 allocs: 7.797 KiB)

julia> @b stable(1:100, 2, 10)
1.335 ns
6 Likes

Honestly, I’m surprised too. It’s not like the if <bad>; <swap>; end pattern is uncommon. And in any case it isn’t type unstable (it’s probably “binding unstable,” which is what I suspect caused the problem, but the compiler could probably work that out).

1 Like

Filled an issue here. Probably a duplicate of something known, but it doesn’t hurt.

The original function I used actually was theorCovEff(ts,k,ln,K::Function). The reason I defined K in the body here is that it makes it twice faster when used on its own. However this function is mostly used inside other function, and I checked in these conditions both versions behave the same way. The fastest version I currently have is

function theorCovEff4(ts,k,l,ln,α)
    K(s,t) = (α ≈ 1.0) ? 2min(s,t) : (s^α + t^α - abs(s-t)^α)
    incrCov(a,b,c,d) =  K(a,b) + K(a+c,b+d) - K(a,b+d) - K(a+c,b)
    k, l = minmax(k, l)
    ts1, tsk, tsl = ts[1], ts[k], ts[l]

    S1 = sum(h -> (ln-l-h+1) * incrCov(ts1, ts[h], tsk, tsl)^2, 2:ln-1; init=0)
    S2 = sum(h -> ((h <= l-k+1) ? ( ln-l ) : ( ln-k-h+1 )) * incrCov(ts[h], ts1, tsk, tsl)^2, 1:ln-k)

    return 2/((ln-k)*(ln-l)) * (S1 + S2)
end

in Julia 1.11. Using loop insted of sum(f,iter) makes it slightly faster. Different variants of incrCov do not matter. It is nasty that sum(iter) is twice worse, intuitively you’d think it should be the same as loop… In the version as above on my machine it is still around twice slower than numba code, which is

@njit
def theorCovEff(ts, k, l, ln, alpha):
    K = lambda s, t: 2 * np.minimum(s, t) if np.abs(alpha - 1.0) < 1e-8 else (s**alpha + t**alpha - np.abs(s - t)**alpha)
    if k > l:
        k, l = l, k
    N1 = lambda h,k,l,ln: ln - l - h + 1
    N2 = lambda h,k,l,ln: (ln - l) if h <= l - k + 1 else (ln - k - h + 1)

    S1 = 0.
    for h in range(2, ln - l + 1):
        S1 += N1(h,k,l,ln) * (K(ts[0], ts[h-1]) + K(ts[0] + ts[k-1], ts[h-1] + ts[l-1]) - K(ts[0], ts[h-1] + ts[l-1]) - K(ts[0] + ts[k-1], ts[h-1]))**2
    S2 = 0.
    for  h in range(1, ln - k + 1):
        S2 += N2(h,k,l,ln) * (K(ts[h-1], ts[0]) + K(ts[h-1] + ts[k-1], ts[0] + ts[l-1]) - K(ts[h-1], ts[0] + ts[l-1]) - K(ts[h-1] + ts[k-1], ts[0]))**2
    return 2 / ((ln - k) * (ln - l)) * (S1 + S2)

Keep in mind lowering happens quite early, before type inference. And currently (until JuliaLowering.jl becomes a viable replacement, I guess) it’s not as well integrated with the rest of the compiler as one might hope for, as type inference is written in Julia (Compiler.jl), while the lowering is (partly?) written in flisp (except for JuliaLowering.jl).

2 Likes

There’s still a layer of Julia+C or Numba+NumPy+C between your code and the LLVM IR, so differences on this level aren’t unusual. Some years ago, NumPy’s sum was measured in a demo to be faster than Julia’s due to some extra optimizations in its C implementation; no idea if that held up, but it just goes to show that “compiles to LLVM” isn’t the whole picture.

Something you can control are the semantic differences between your Numba and Julia implementations. I don’t know if that’ll translate to any significant differences in performance, but it would affect your results, even subtly, so it’s worth examining. These are the differences I could spot:

  • np.abs(alpha - 1.0) < 1e-8 versus (α ≈ 1.0)
    • aka isapprox is not implemented like that. It uses <= instead of < and considers a relative tolerance (defaults to square root of the input types’ epsilon) in addition to an absolute tolerance (defaults to 0). The relative tolerance isn’t compared directly, it scales the runtime values of the inputs for the comparison, which is more work.
  • if k > l: k, l = l, k versus k, l = minmax(k, l)
    • minmax compares with isless, not < / >, which corresponds to sorting order versus comparisons of floating point values per the IEEE 754 standard. They differ on special values, so they do different work. k, l = (k > l) ? (l, k) : (k, l) is the strict equivalent.
  • S1 Numba loop iterates over range(2, ln - l + 1), but the Julia sum input is the longer2:ln-1. That’s a one, not a lower-case L, and I assume that’s a typo.
  • S1 Numba loop starts at 0.0, but the Julia sum takes init=0of the type Int, not Float64. Julia’s type inference can’t make semantic switches for you;init=0.0 would be the equivalent. The Julia sum for S2 doesn’t have an explicit init at all, which branches to a different initialization strategy.

I’m not the last word on anything, but I did also check your Python lambdas and Julia closures. Possible boxing/type stability issues for the closures aside, they look equivalent. Just curious, is there a reason why the Python version didn’t have a incrCov and the Julia version didn’t have N1 and N2? Visual parallels like that and writing the loop in Julia has its perks, especially when translating.

About the boxing/type stability issues, those are fairly easy to work around if you understand the lowerer’s current limitations, but it’s also a valid choice to avoid dealing with capturing variables. Couple ways:

  1. You can just throw in extra arguments like your N1and N2 lambdas until no variables are captured.
  2. Instantiate callable structinstances to directly contain values at the sites where a closure definition captures variables, and being immutable, those values can’t change the way captured variables could be reassigned conditionally.
3 Likes

This is the faster I got can you try

@fastmath function theorCovEff3(ts,k,l,ln,α)
    K(s,t) = ifelse(α ≈ 1.0, 2min(s,t) , (s^α + t^α - abs(s-t)^α))
    @inbounds @inline function incrCov(ts,i,j,k,l,K) 
        a, b, c, d = ts[i], ts[j], ts[k], ts[l]
        K(a,b) + K(a+c,b+d) - K(a,b+d) - K(a+c,b)
    end

    if k > l
        k, l = l, k
    end
    s1 = 0.0
    s2 = 0.0
    @simd for h in 2:ln-l
        s1 += incrCov(ts,1,h,k,l,K)^2*(ln-l-h+1)
    end
    @simd for h in 1:ln-k
        s2 += incrCov(ts,h,1,k,l,K)^2*ifelse(h <= l-k+1 , (ln-l) , (ln-k-h+1))
    end

    return 2/((ln-k)*(ln-l)) * ( s1 + s2)
end

on my computer : numba : 27 μs, julia : 24 μs, without fastmath I get same speed for both. Also, you’re doing a lot too much ops, maybe you know it but it is equivalent to

 @fastmath @inbounds function theorCovEff3(ts,k,l,ln,α::T) where T
    if k > l
        k, l = l, k
    end
    @inline K(α, a,b,c,d) =  α ≈ 1 ? 2min(a,b) + 2min(a+c,b+d) - 2min(a,b+d) - 2min(a+c,b) : abs(a-b-d)^α + abs(a+c-b)^α - abs(a-b)^α   - abs(a+c-b-d)^α 
    s1 = T(0)
    s2 = T(0)
    c = ts[k]
    d = ts[l]
    @inbounds @simd for h in 2:ln-l
        a = ts[1]
        b = ts[h]
        i1 =  K(α, a,b,c,d)
        s1 += i1*i1*(ln-l-h+1)
    end
    @inbounds @simd for h in 1:ln-k
        a = ts[h]
        b = ts[1]
        i2 =  K(α, a,b,c,d)
        s2 += i2*i2*ifelse(h <= l-k+1 , (ln-l) , (ln-k-h+1))
    end
    return T(2)/((ln-k)*(ln-l)) * ( s1 + s2)
end

which then leads to 8 μs on my end

3 Likes