Help with optimising a simple function

Recently my colleagues asked me to translate my code for estimation to Python. Making it run with any reasonable speed was truly a pain and advertisement for Julia, but at the end it became few times faster with numba. It is strange, because the algorithm is very elementary, you’d expect similar resulting LLVM. The main culprit seems to be here:

function theorCovEff3(ts,k,l,ln,α)
    K(s,t) = (α ≈ 1.0) ? 2min(s,t) : (s^α + t^α - abs(s-t)^α)
    function incrCov(ts,i,j,k,l,K) 
        a, b, c, d = ts[i], ts[j], ts[k], ts[l]
        K(a,b) + K(a+c,b+d) - K(a,b+d) - K(a+c,b)
    end

    if k > l
        k, l = l, k
    end
    N1 = h -> ln-l-h+1
    N2 = h -> (h <= l-k+1) ? ( ln-l ) : ( ln-k-h+1 )

    return 2/((ln-k)*(ln-l)) * ( 
          sum( N1(h)*incrCov(ts,1,h,k,l,K)^2 for h in 2:ln-l; init=0 ) 
        + sum( N2(h)*incrCov(ts,h,1,k,l,K)^2 for h in 1:ln-k )
    )
end

Example use theorCovEff3(1:100, 10, 2, 100, 0.7). I’ve tried @inbounds and @fastmath, it does not help much. I don’t understand why, but replacing sum with loops makes it twice faster, but still slower than numba. Any tips?

1 Like

You have several closures. Performance Tips · The Julia Language

3 Likes

It is more complicated than that… I think this is somewhat of a bug? For instance, by simply changing the way l,k are permuted, the instability disappears:

N1(h,ln,l) = ln-l-h+1
N2(h,l,k,ln) = (h <= l-k+1) ? ( ln-l ) : ( ln-k-h+1 )
K(s,t,α) = (α ≈ 1.0) ? 2min(s,t) : (s^α + t^α - abs(s-t)^α)
function incrCov(ts,i,j,k,l,K,α)
    a, b, c, d = ts[i], ts[j], ts[k], ts[l]
    K(a,b,α) + K(a+c,b+d,α) - K(a,b+d,α) - K(a+c,b,α)
end

function theorCovEff3(ts,k,l,ln,α)
    k, l = (k > l) ? (l, k) : (k, l) # Type stable
    #if k > l; k, l = l, k; end # Type unstable
    return 2/((ln-k)*(ln-l)) * ( 
          sum( N1(h,ln,l)*incrCov(ts,1,h,k,l,K,α)^2 for h in 2:ln-l) 
        + sum( N2(h,l,k,ln)*incrCov(ts,h,1,k,l,K,α)^2 for h in 1:ln-k)
    )
end

(this version has no closures, to simplify things)

1 Like

I managed to make it 2x faster, but it sounds like your gap was larger than that

function theorCovEff_mikm(ts, k, l, ln, K)

    incrCov(a, b, c, d) = K(a, b) + K(a+c, b+d) - K(a, b+d) - K(a+c, b)

    k, l = minmax(k, l)

    N1 = let ln=ln, l=l;
        h -> ln-l+1-h
    end
    N2 = let ln=ln, l=l, k=k
        h -> (h <= l-k+1) ? ( ln-l ) : ( ln-k-h+1 )
    end

    ts1 = ts[1]
    tsk = ts[k]
    tsl = ts[l]

    s1 = sum(h -> N1(h)*incrCov(ts1, ts[h], tsk, tsl)^2, 2:ln-l)
    s2 = sum(h -> N2(h)*incrCov(ts[h], ts1, tsk, tsl)^2, 1:ln-k)

    return 2/((ln-k)*(ln-l)) * (s1 + s2)
end
julia> using BenchmarkTools

julia> @btime theorCovEff3($(1:100), $10, $2, $100, 0.7)
  51.413 μs (1133 allocations: 20.91 KiB)
0.7765443891896003

julia> @btime theorCovEff_mikm($(1:100), $10, $2, $100, (s,t) -> s^0.7 + t^0.7 - abs(s-t)^0.7)
  29.637 μs (0 allocations: 0 bytes)
0.7765443891896003

I pass the K function in externally (making it branch on a constant α every call seems silly), used the sum(f, iter) forms instead, and pulled a couple of constant accesses out. I suspect the material difference here was to do with the handling of closures over l and k (although I used let blocks, I believe these are unnecessary).

I’m quite sure this has to do with closures, still. Because l and k might be swapped or might not, they must be boxed by any closure. Using an unconditional reassignment avoids this issue.

5 Likes

I removed the closures from the above (I think). But still, l and k are of the same type, I don’t see why they should be boxed.

note: the 2x improvement is all due to that:

julia> @b theorCovEff3(1:100, 2, 10, 100, 0.7)
49.941 μs (1133 allocs: 20.906 KiB)

julia> @b theorCovEff3(1:100, 2, 10, 100, 0.7)
34.744 μs

where these were run with the original implementation, just by swapping these lines:

    k, l = (k > l) ? (l,k) : (k,l)
    #if k > l; k, l = l, k; end

typeof((N1(h, ln, l) * incrCov(ts, 1, h, k, l, K, α) ^ 2 for h = 2:ln - l)) = Base.Generator{UnitRange{Int64}, var"#106#109"{UnitRange{Int64}, Int64, Int64, Int64, Float64}}

The Generator still makes a closure. And the boxes still happen because it might need to access the data originally assigned to k or originally to l wherever k or l are used. The unconditional assignment fixes this by anchoring them into solid locations.

1 Like

Well, ok, but honestly I don’t see any good reason for the lowering of one case be any different from the other. When using k, l = (k > l) ? (l,k) : (k,l) the values still might be interchanged, and they are of the same type, I don’t see why having the else statement there guarantees anything to the compiler that is not guaranteed in the original version.

This is an actual case where I would expect that specialization to all input variable types would actually help the compiler to know what to do.

Well, this is MWE if anyone feels there’s something to be done there:

function unstable(x,k,l)
    if k > l; k, l = l, k; end
    return sum((x[k]+x[l])^2 for _ in eachindex(x)) 
end

function stable(x,k,l)
    k, l = if k > l; (l, k); else (k, l); end
    return sum((x[k]+x[l])^2 for _ in eachindex(x)) 
end
julia> @b unstable(1:100, 2, 10)
4.478 μs (299 allocs: 7.797 KiB)

julia> @b stable(1:100, 2, 10)
1.335 ns
7 Likes

Honestly, I’m surprised too. It’s not like the if <bad>; <swap>; end pattern is uncommon. And in any case it isn’t type unstable (it’s probably “binding unstable,” which is what I suspect caused the problem, but the compiler could probably work that out).

1 Like

Filled an issue here. Probably a duplicate of something known, but it doesn’t hurt.

The original function I used actually was theorCovEff(ts,k,ln,K::Function). The reason I defined K in the body here is that it makes it twice faster when used on its own. However this function is mostly used inside other function, and I checked in these conditions both versions behave the same way. The fastest version I currently have is

function theorCovEff4(ts,k,l,ln,α)
    K(s,t) = (α ≈ 1.0) ? 2min(s,t) : (s^α + t^α - abs(s-t)^α)
    incrCov(a,b,c,d) =  K(a,b) + K(a+c,b+d) - K(a,b+d) - K(a+c,b)
    k, l = minmax(k, l)
    ts1, tsk, tsl = ts[1], ts[k], ts[l]

    S1 = sum(h -> (ln-l-h+1) * incrCov(ts1, ts[h], tsk, tsl)^2, 2:ln-1; init=0)
    S2 = sum(h -> ((h <= l-k+1) ? ( ln-l ) : ( ln-k-h+1 )) * incrCov(ts[h], ts1, tsk, tsl)^2, 1:ln-k)

    return 2/((ln-k)*(ln-l)) * (S1 + S2)
end

in Julia 1.11. Using loop insted of sum(f,iter) makes it slightly faster. Different variants of incrCov do not matter. It is nasty that sum(iter) is twice worse, intuitively you’d think it should be the same as loop… In the version as above on my machine it is still around twice slower than numba code, which is

@njit
def theorCovEff(ts, k, l, ln, alpha):
    K = lambda s, t: 2 * np.minimum(s, t) if np.abs(alpha - 1.0) < 1e-8 else (s**alpha + t**alpha - np.abs(s - t)**alpha)
    if k > l:
        k, l = l, k
    N1 = lambda h,k,l,ln: ln - l - h + 1
    N2 = lambda h,k,l,ln: (ln - l) if h <= l - k + 1 else (ln - k - h + 1)

    S1 = 0.
    for h in range(2, ln - l + 1):
        S1 += N1(h,k,l,ln) * (K(ts[0], ts[h-1]) + K(ts[0] + ts[k-1], ts[h-1] + ts[l-1]) - K(ts[0], ts[h-1] + ts[l-1]) - K(ts[0] + ts[k-1], ts[h-1]))**2
    S2 = 0.
    for  h in range(1, ln - k + 1):
        S2 += N2(h,k,l,ln) * (K(ts[h-1], ts[0]) + K(ts[h-1] + ts[k-1], ts[0] + ts[l-1]) - K(ts[h-1], ts[0] + ts[l-1]) - K(ts[h-1] + ts[k-1], ts[0]))**2
    return 2 / ((ln - k) * (ln - l)) * (S1 + S2)

Keep in mind lowering happens quite early, before type inference. And currently (until JuliaLowering.jl becomes a viable replacement, I guess) it’s not as well integrated with the rest of the compiler as one might hope for, as type inference is written in Julia (Compiler.jl), while the lowering is (partly?) written in flisp (except for JuliaLowering.jl).

2 Likes

There’s still a layer of Julia+C or Numba+NumPy+C between your code and the LLVM IR, so differences on this level aren’t unusual. Some years ago, NumPy’s sum was measured in a demo to be faster than Julia’s due to some extra optimizations in its C implementation; no idea if that held up, but it just goes to show that “compiles to LLVM” isn’t the whole picture.

Something you can control are the semantic differences between your Numba and Julia implementations. I don’t know if that’ll translate to any significant differences in performance, but it would affect your results, even subtly, so it’s worth examining. These are the differences I could spot:

  • np.abs(alpha - 1.0) < 1e-8 versus (α ≈ 1.0)
    • aka isapprox is not implemented like that. It uses <= instead of < and considers a relative tolerance (defaults to square root of the input types’ epsilon) in addition to an absolute tolerance (defaults to 0). The relative tolerance isn’t compared directly, it scales the runtime values of the inputs for the comparison, which is more work.
  • if k > l: k, l = l, k versus k, l = minmax(k, l)
    • minmax compares with isless, not < / >, which corresponds to sorting order versus comparisons of floating point values per the IEEE 754 standard. They differ on special values, so they do different work. k, l = (k > l) ? (l, k) : (k, l) is the strict equivalent.
  • S1 Numba loop iterates over range(2, ln - l + 1), but the Julia sum input is the longer2:ln-1. That’s a one, not a lower-case L, and I assume that’s a typo.
  • S1 Numba loop starts at 0.0, but the Julia sum takes init=0of the type Int, not Float64. Julia’s type inference can’t make semantic switches for you;init=0.0 would be the equivalent. The Julia sum for S2 doesn’t have an explicit init at all, which branches to a different initialization strategy.

I’m not the last word on anything, but I did also check your Python lambdas and Julia closures. Possible boxing/type stability issues for the closures aside, they look equivalent. Just curious, is there a reason why the Python version didn’t have a incrCov and the Julia version didn’t have N1 and N2? Visual parallels like that and writing the loop in Julia has its perks, especially when translating.

About the boxing/type stability issues, those are fairly easy to work around if you understand the lowerer’s current limitations, but it’s also a valid choice to avoid dealing with capturing variables. Couple ways:

  1. You can just throw in extra arguments like your N1and N2 lambdas until no variables are captured.
  2. Instantiate callable structinstances to directly contain values at the sites where a closure definition captures variables, and being immutable, those values can’t change the way captured variables could be reassigned conditionally.
6 Likes

This is the faster I got can you try

@fastmath function theorCovEff3(ts,k,l,ln,α)
    K(s,t) = ifelse(α ≈ 1.0, 2min(s,t) , (s^α + t^α - abs(s-t)^α))
    @inbounds @inline function incrCov(ts,i,j,k,l,K) 
        a, b, c, d = ts[i], ts[j], ts[k], ts[l]
        K(a,b) + K(a+c,b+d) - K(a,b+d) - K(a+c,b)
    end

    if k > l
        k, l = l, k
    end
    s1 = 0.0
    s2 = 0.0
    @simd for h in 2:ln-l
        s1 += incrCov(ts,1,h,k,l,K)^2*(ln-l-h+1)
    end
    @simd for h in 1:ln-k
        s2 += incrCov(ts,h,1,k,l,K)^2*ifelse(h <= l-k+1 , (ln-l) , (ln-k-h+1))
    end

    return 2/((ln-k)*(ln-l)) * ( s1 + s2)
end

on my computer : numba : 27 μs, julia : 24 μs, without fastmath I get same speed for both. Also, you’re doing a lot too much ops, maybe you know it but it is equivalent to

 @fastmath @inbounds function theorCovEff3(ts,k,l,ln,α::T) where T
    if k > l
        k, l = l, k
    end
    @inline K(α, a,b,c,d) =  α ≈ 1 ? 2min(a,b) + 2min(a+c,b+d) - 2min(a,b+d) - 2min(a+c,b) : abs(a-b-d)^α + abs(a+c-b)^α - abs(a-b)^α   - abs(a+c-b-d)^α 
    s1 = T(0)
    s2 = T(0)
    c = ts[k]
    d = ts[l]
    @inbounds @simd for h in 2:ln-l
        a = ts[1]
        b = ts[h]
        i1 =  K(α, a,b,c,d)
        s1 += i1*i1*(ln-l-h+1)
    end
    @inbounds @simd for h in 1:ln-k
        a = ts[h]
        b = ts[1]
        i2 =  K(α, a,b,c,d)
        s2 += i2*i2*ifelse(h <= l-k+1 , (ln-l) , (ln-k-h+1))
    end
    return T(2)/((ln-k)*(ln-l)) * ( s1 + s2)
end

which then leads to 8 μs on my end

8 Likes

The Python version does not have incrCov because numba cannot compile functions which take functions as arguments. It cannot compile closures too. And sum function. And matrices with anything other than the so-called dtypes. In general it was a horrible experience to make it run in numba, and otherwise it is 150 times slower and therefore useless (the estimation starts to take hours).

I removed N1 and N2 in Julia just to simplify things and have this potential factor out of the picture.

The -1 was a typo, yes. Other versions had -l, and it does not change much for this particular call. Other factors you suggest also do not make a noticeable change. Was worth trying though.

Just playing a bit here, and we can get the close to the fastest version using:

julia> function theorCovEff3(ts,k,l,ln,α)
           K(s,t) = ifelse((α ≈ 1.0), 2min(s,t),(s^α + t^α - abs(s-t)^α))
           function incrCov(ts,i,j,k,l,K) 
               a, b, c, d = ts[i], ts[j], ts[k], ts[l]
               K(a,b) + K(a+c,b+d) - K(a,b+d) - K(a+c,b)
           end
           k, l = ifelse(k > l, (l,k), (k,l))
       
           N1 = h -> ln-l-h+1
           N2 = h -> (h <= l-k+1) ? ( ln-l ) : ( ln-k-h+1 )
       
           return 2/((ln-k)*(ln-l)) * ( 
                 sum( N1(h)*incrCov(ts,1,h,k,l,K)^2 for h in 2:ln-l; init=0 ) 
               + sum( N2(h)*incrCov(ts,h,1,k,l,K)^2 for h in 1:ln-k )
           )
       end
theorCovEff3 (generic function with 1 method)

julia> @b theorCovEff3(1:100, 2, 10, 100, 0.7)
21.193 μs

which is the original version just changing the k,l assignment to remove the type-instability and changing the K(s,t) definition to use ifelse.

(I like the fact that @inbound, @simd, @fastmath, etc, decorations are most of the times unnecessary)

This is twice faster than numba on my machine! So it seems the main bottleneck was using incrCov and numba was faster mostly because I could not use it in the first place?

1 Like

Oh strange, I read way back in the v0.47 release notes that lambdas and closures were supported. Never actually tried them in Numba because I hated writing them in Python, but I probably misunderstood what the release note actually meant.

I wouldn’t think that’s the reason. I can’t rule out the manual @inline helping, but small methods like this usually get inlined to optimize away redundancies, in which case it wouldn’t be significantly different from manually expanding the expression at call sites. The K in the 8us version does the job that K and incrCov did, but also omitted terms that would’ve canceled to zero within floating point errors (full explanation below); I’d expect manual cancelling to have the same performance benefits in Numba.

originals
    K(s,t) = (α ≈ 1.0) ? 2min(s,t) : (s^α + t^α - abs(s-t)^α)
    incrCov(a,b,c,d) =  K(a,b) + K(a+c,b+d) - K(a,b+d) - K(a+c,b)

1. α never changes, so you can merge the 2 methods and reduce the redundant
   α ≈ 1.0 checks by moving the branch into the intrCov part.
    (α ≈ 1.0) ?
     2min(a,b) + 2min(a+c,b+d) - 2min(a,b+d) - 2min(a+c,b) :
    (a^α + b^α - abs(a-b)^α) + ((a+c)^α + (b+d)^α - abs(a+c-b-d)^α)
    - (a^α + (b+d)^α - abs(a-b-d)^α) - ((a+c)^α + b^α - abs(a+c-b)^α)
2. K could be split into 2 separate callees to reduce the repeated subexpressions,
   but now that it's fully expanded, we can spot the s^α + t^α terms cancel, which
   normally can't be optimized away. K was strictly a callee of incrCov to compute
   the expression of interest, so manual cancelling is worth it.
     (a^α-a^α + b^α-b^α + (a+c)^α-(a+c)^α + (b+d)^α-(b+d)^α) # zero
     - abs(a-b)^α - abs(a+c-b-d)^α + abs(a-b-d)^α) + abs(a+c-b)^α
   If we wished, we could still make callees for 2min(s,t) and abs(s-t)^α.

fastest version so far
@inline K(α, a,b,c,d) =  α ≈ 1 ?
  2min(a,b) + 2min(a+c,b+d) - 2min(a,b+d) - 2min(a+c,b) :
  abs(a-b-d)^α + abs(a+c-b)^α - abs(a-b)^α - abs(a+c-b-d)^α

Generally, I’d say this optimization exercise says very little about Numba vs Julia because implementation equivalence was never attempted. For example, Numba was never given the fastmath=True option, which flouts IEEE 754 in many ways for potential optimizations. Flouting IEEE 754 isn’t strictly bad; @simd is routinely used to reassociate floating point operations in loops or higher-level functions implemented with loops, and LLVM (so both Julia and Numba) will try to do SIMD automatically. However, @fastmath is aggressive enough that it’s usually entirely up to us to validate assumptions for the specific use case and check if it even yielded any performance benefits.

2 Likes

I checked with or without @fastmath, it is nearly irrelevant. I was also checking @inline before, it was adding like few percent. But the last version by @yolhan_mannes was faster 3 times than my previous one. I guess the most of the gain is due to the fact that merging K and incrCov made dependence on α completely explicit. (I was trying this before, but only with K and it didn’t help.) Maybe compiler has problems with inlining functions which depend on external parameters?

Anyway, it is somewhat disappointing that Julia compiler cannot deal with such stuff itself. The initial version I had was written in Julia idiomatic style (at least I think), and it was 15 times slower than the last one. But, it’s not like numba is better with this.

1 Like

Did you try this version ? It is essentially identical to what you have written initially, just fixing the type instability and the way K is defined in terms of ifelse. (I think the advantage of the fastest version relative to this is the better way to compute K, but that is not related to the language used).

1 Like