Compilers, purity, and variance computation

MOST likely you call sum, then std, i.e. separate loops and scan twice, BUT technically the functional approach doesn’t rule out optimal performance: scanning just once… In theory the compiler (of functional languages at least) is allowed to reverse the order of loops or doing even more advanced (because these are pure functions), though in practice it likely doesn’t:

I.e. these are pure functions, and could be merged with a sufficiently good compiler, both for naive loops, and the better versions for numerically stable code, as in Base (I at least assume). I have no confidence compilers actually do that however… and I’m not even sure it would be good, i.e. now you have those two functions precompiled, ready to use, not inlined, but for the compiler to do this it would have to inline them and merge/interleave the loads, so they were shared, no longer separate in each loop.

So what does the compiler do, or should?

  1. It could scan up in memory for sum, but for std, it could scan down, i.e. what’s most recently in the cache. If both scan in the same direction (do they, likely, then suboptimal), then your data may be out of the cache, if not L3, then at least likely L1.
  2. You could also start two threads, one to scan for sum, and the other for std, then both will use load instructions, load into same (or different) L1 Dcache, but at least benefit from the data prefetched into the shared L2/3 cahce. THEN it would be better for both to scan in the same direction, whatever is chosen…

Note, 1. and 2. have opposite directions, so the algorithms can’t be optimal direction-wise to support both. Because of functional programming, the order needs not be defined. But I’m pretty sure imperative loop code is used.

A very good compiler could even reverse order of an imperative loop, since the function is pure, though I very much doubt it would. And I’m even less convinced Numba or other compilers would do such trick.

I’m only trying to show that functional can be better in theory… though in practice it likely wouldn’t be.

Maybe…, or you pay in implementation complexity by NOT using functional programming? And at least in this case this doesn’t apply since these are pure functions. If you allow overwriting, then likely you disable some possible optimizations. Like for above, you want to know the array isn’t mutated by some yet another function. If it happens, you’re likely left with bugs, why Julia doesn’t use @view by default, the default of other languages like C/C++ (it’s faster using views yes, just potentially dangerous, and you must know what you’re doing, i.e. not use threads to at least mutate, or then be very careful…).

What you pay with functional programming/persistent data structures (that are otherwise very nice) is more GC pressure, but that doesn’t seems too bad, and Julia GC always getting better. Julia just needs to free eagerly in loops, when possible with compiler support (not manually doing such with Bumper.jl).

No it’s not. x + y is pure, but (x + y) + z \ne x + (y + z) for floating-point numbers (in general), so the compiler isn’t allowed to re-associate without some explicit annotation like @simd.

(That being said, the Julia reduce function is explicitly documented to have unspecified associativity, so in principle we could do more here.)

A single-pass algorithm looks fundamentally different than the two-pass algorithm, and changes the floating-point error even more severely. So no, the compiler is not generally allowed to transform a two-pass algorithm into a single-pass algorithm. (And this kind of automatic transformation between two completely different algorithms is not something compilers are capable of in general.)

Asymptotically this does not help the memory performance, once the array is much bigger than the cache.

No, they should be computed together (or in sequence) if you want a numerically stable algorithm.

Purity has nothing to do with these issues.

3 Likes

The compiler is allowed to reorder when pure, as you say in practice always including for this, and see below for Postis float-like (and well also allowed for the standard integers… not though all types, e.g. not for saturated).

Why is that exactly?

I mean why can’t the loops be merged into one in theory (at least when same direction kept for both, then merging wouldn’t change semantic meaning?!)?

If this is non-trivial, and the compiler doesn’t do it, as I thought, should Julia provide a sumstd function that provides a tuple of both, since seemingly you often want both of those calculated…? [Some very functions are already multi-valued, e.g. divrem (though only basic ones that I recall, never on vectors, and it doesn’t translate to one machine instruction, though some very few in CPUs are multi-valued, a bit awkward for RISC).]

I think it helps up to L3/last-level size, i.e. some hundred MB and counting, but yes, not if over (by even 1 byte? and in practice smaller if not only code running), though it seems like opposite directions would help many:

Assuming you wouldn’t get totally wrong answers: Kahan summation helps already, so wouldn’t opposite orders be in practice any issue? And as you explained opposite is allowed when explicitly sometimes allowed. Do you agree then better to have opposite for the operations in sequence? For larger than L3, or say half that) then you could do in chunks, or you could just have two threads concurrently, is same direction.

[Quoting you math/latex is screwed up by discourse… any know way to fix it, or what do you use to input? I fixed it sort of.]

Yes, for traditional floats, but not a problem for Posits, why good for parallel, and e.g. allowing the other order (they look like floats to me, but maybe aren’t/shouldn’t be described like that because of the advantages)? They subtype Reals, but not IEEEFloats, possibly Julia needs AssociativityReal abstract type…?

Holding Algebraic Rules Across Formats. Unlike IEEE 754, posit holds the associativity of addition. Moreover, computing values across multiple posit formats with different sizes is guaranteed to produce the same value.

some of the many differences:

Exception Handling . While there are 14 representations of Nans in IEEE 754, there is no “NaN” in posits. Moreover, posit has single representations for 0 and ∞. Overall, it makes the computation with the posit numbers simpler than IEEE 754. In the case of exceptions (e.g., division by zero), the interrupt handler is expected to report the error and its cause to the application.

Inefficiencies of Posit Numbers. Despite all the expected benefits of posit numbers, the floating-point format has one important advantage over posit, which is the fixed bit format for the exponent and fraction. As a result, parallel decoding may be used for extracting the exponent and fraction of floating-point numbers.

There are plenty of articles on single-pass numerically stable algorithms for the variance, some of which are linked in the Wikipedia article I linked above.

Because a stable two-pass calculation of the variance uses the result of the first sum (the mean) in the second sum (the variance).

It might be nice to have a meanstd(x) algorithm in Statistics.jl that returns both.

(Statistics.jl does provide a stable single-pass (“streaming”) std algorithm that it uses for iterables that maybe only allow you to loop over them once.)

If it fits in the cache then you don’t need to reverse the order either.

Nope, Kahan summation doesn’t help here. Even if the \sum x_k and \sum x_k^2 are exactly rounded — which is better than Kahan and requires something like Xsum.jl — you can still lose all the significant digits when you subtract them.

Nope. If you round to a posit on each operation you still lose associativity (because the order of the rounding changes). Gustafson only claims associativity for “quire” arithmetic that uses arbitrary precision.

But I don’t want to go down the rabbit-hole of debating the promises of posits here. We’ve already had long threads on this, e.g.: Posits - a new approach could sink floating point computation - #17 by simonbyrne

2 Likes

The oldest data is most likely to no longer be in cache, why reverse seems better, when scanning backwards in the second, svd phase. I was wrong writing then it needs all to fit in cache (that would be ideal, yes) but would degrade gracefully if the data isn’t no longer all in cache.

So, would it be much less numerically stable?

If the cache holds Z elements, and your array is length n, then you reduce the number of cache misses by at most a factor of (1 - Z/n) for n > 2Z by going backwards in a second pass (assuming LRU replacement; with optimal replacement you would save nothing at all). As n grows \gg Z, any benefits should become negligible. That’s why I said there could be no asymptotic benefit.

Worse, because of the way cache lines work with prefetching, there’s a good chance you’re going to get worse cache-line utilization going backwards, and this effect could easily dominate.

In fact, something like this happened when I tried it: doing a forward pass followed by a forward pass (a simple summation loop with @simd), or a forward pass followed by a backwards pass, and the latter was slower (until n becomes huge, at which point it’s the same speed):

You can see the performance (points/second summed) dropping off with n as it exceeds the various cache sizes, asymptoting to a constant rate (where it exceeds all cache sizes and is memory bound). And doing a forward followed by a backwards pass is clearly slower (though not by a huge amount, only 5–30%). It’s possible this was poor cache-line utilization (of the smaller caches), but it could also be some interaction with the compiler’s ability to SIMD-vectorize the loop.

Code:
function mysum(x)
    s = zero(eltype(x))
    @simd for i in eachindex(x)
        @inbounds s += x[i]
    end
    return s;
end
function mysumr(x)
    s = zero(eltype(x))
    @simd for i in reverse(eachindex(x))
        @inbounds s += x[i]
    end
    return s
end

using BenchmarkTools
t2 = Float64[]
for n in ns
    x = rand(n); @show n
    push!(t2, @belapsed (mysum($x); mysum($x)))
end
ts = Float64[]
for n in ns
    x = rand(n); @show n
    push!(ts, @belapsed (mysum($x); mysumr($x)))
end
(This was on a 2021 Apple M1 Pro.)
2 Likes

6 posts were merged into an existing topic: Functional programming is not capable of achieving absolute top performance

A post was merged into an existing topic: Functional programming is not capable of achieving absolute top performance