Functional programming is not capable of achieving absolute top performance

xiaodai · September 10, 2024, 2:09am

I was just playing around with the best way to compute the mean and standard deviation of a vector of values.

Obviously, sum(x), std(x) is the cleanest but it isn’t the fastest, as I assume that approach needs to scan through the vector x twice. We can easily just scan through it once and then calculate it.

From my testing, the function ms is the fastest as it uses tricks available for indexing (something about avx instructions on CPUs). This trick is NOT available for the functional programming paradigm using reduce. I tried the VectorisedReduction.jl package but it didn’t work (see ms5).

It’s kinda sad that for Julia to hit top performance, we need to use “vectorized” versions of the code. This reminds me of Python/numpy and R. Also echoing the 1.5 language problem talk in the 2023 JuliaCon (sorry, I don’t know who the presenter is here), optimised Julia code is often very alien-looking vs “regular” Julia code to the point where you might as well call it a descendent language. For the record, when I tested the perform of Python vs Julia (in the past), I could get Numba to hit the same speed as optimised Julia. Now that could be because my problem was too “toyish” so I didn’t reveal the sharp edges of Numba but writing Numba in Python is akin to writing highly optimised code in Julia it’s kinda in the 1.5 language problem arena.

With Julia, you can use the GPU much more easily due to the usage of CUDA.jl but I think it’s still true that to get the most out of CUDA you would need to go into the CUDA C++ extensions and can’t rely on Julia’s just yet. I might be wrong as I haven’t been closely watching but that’s the kind of gist I get.

Are there even faster ways to optimise this?

My code for `sum(x) and std(x)`

using LoopVectorization
using VectorizedReduction: vvreduce
using VectorizedStatistics: vsum, vstd

function ms(x)
    m = 0.0
    s = 0.0
    @turbo for i in eachindex(x)
        m+=x[i]
        s+=x[i]^2
    end
    m,s

    n = length(x)
    m, (s - n*m^2)/(n-1)
end

function ms2(x)
    m, s = reduce(x; init=(0.0,0.0)) do (m, s), val
        (m+=val, s += val^2)
    end

    n = length(x)
    m, (s - m^2/n) / (n - 1)
end

function ms3(x)
    sum(x), sum(x .^ 2)
end

function ms4(x)
    sum(x), reduce((x,y)->x+y^2, x; init=0.0)
end

function ms5(x)
    vvreduce(x; init=(0.0, 0.0)) do (m, s), val
        (m += val, s += val^2)
    end
end

function ms6(x)
    vsum(x), vstd(x)
end

using BenchmarkTools

x = rand(1_000_000);

@benchmark ms($x)
@benchmark ms2($x)
@benchmark ms3($x)
@benchmark ms4($x)
@benchmark ms5($x)
@benchmark ms6($x)


ms(x)
ms2(x)
ms3(x)
ms4(x)

@code_warntype ms(x)
@code_warntype ms2(x)
@code_warntype ms3(x)
@code_warntype ms4(x)

edit fixed bug

stevengj · September 10, 2024, 2:16am

(Note that this is probably computing a different answer than reduce, because floating-point arithmetic is not associative, and @turbo is probably re-ordering things. If you use sum then it makes an extra effort to get an accurate answer via pairwise summation.)

(Note especially that your algorithm for std is numerically unstable.)

Your fastest code is the opposite of what I think of as “vectorized”: it’s an explicit hand-coded loop (as opposed to calling some opaque library routine that somebody else wrote, maybe in another language).

But yes, squeezing out the maximum performance from an algorithm, especially the last factor of 2, often requires non-obvious tricks and expertise, in any language. (If you think optimizing this loop is hard, try optimizing a hand-coded matrix multiplication or FFT.) At least it’s possible to do this optimization in Julia.

Benny · September 10, 2024, 2:32am

Those Vectorized_ packages are using an entirely different definition of “vectorized” than NumPy and R does. Both packages reference LoopVectorization, which converts semantically elementwise operations into processor instructions on multiple elements (“vector”) at a time, see Automatic vectorization - Wikipedia.

NumPy and R’s vectorization describes high level functions that implicitly loop over many inputs in arrays, possibly in addition to taking a single input. To quote that wikipedia link, “In these languages, an operation that operates on entire arrays can be called a vectorized operation, regardless of whether it is executed on a vector processor, which implements vector instructions”. This second meaning is less common because many languages don’t need to wrap another language’s loops for performance, in which case native loops and higher-order functions like broadcast are used instead. I personally never liked this meaning because arrays are often not vectors, but it is what it is.

xiaodai · September 10, 2024, 2:34am

Thanks those are probably technical issues but from a users perspective I just care about the precision to an acceptable degree so they are non issues as far as I know.

The vectorised point is also correct but I was thinking there might be a ‘mean_std’ or a moments function to compute these by wrapping the code in ms. Hence vectorised as an eventuality not as it is implemented

stevengj · September 10, 2024, 2:48am

~~Try your algorithm with x = fill(1/3, 10^5) … the variance should be zero, but your algorithm returns a variance of -1e9, which is huge and with an impossible sign.~~ Actually, this is because of a simple bug: you are multiplying n*m^2 when you should be dividing m^2/n (or, equivalently, you forgot a step m /= n).

Even with this fix, however, your code can easily return a negative number for the variance with x = fill(1/3, 10^6). As another example, it returns a completely wrong variance (and sometimes a negative number) for x = randn(10^6) * 1e-9 .+ 1. This is well known: you are using the “naive algorithm” from this Wikipedia page on “Algorithms for calculating variance”.

xiaodai · September 10, 2024, 3:13am

right. didn’t know this. i only did one numerical computing course in first year uni and i recall things like catastrophic cancellation etc etc but don’t recall this issue which i think must’ve been covered. thanks for pointing it out.

But say I implement the non-naive algorithm, the key points of the posts still stand, no?

greatpet · September 10, 2024, 11:51am

Your fastest version (ignoring numerical instability), i.e. explicit loop, is probably more readable to most programmers than the version using reduce. It’s highly subjective to say that code in a functional style is more “regular” Julia code.

stevengj · September 10, 2024, 12:13pm

If your key point is that optimizing code is hard, and extracting maximum performance sometimes requires expertise and lower-level tricks, then I agree.

But functional-style programming is not bad at all in Julia — if you eliminate the @turbo, then the loop-based code ms(x) obtains the same performance as the functional reduce implementation ms2(x), which is amazingly good and occurs because Julia is able to inline the higher-order function call with compiler specialization. Moreover, on my machine, the built-in @simd macro achieves the same performance as LoopVectorization’s @turbo, about a 5x speedup in both cases, and the reason it happens it because it changes the results (using SIMD instructions requires re-associating the additions), which the compiler is not allowed to do unless you give it explicit permission.

Moreover, “vectorization” in a language like Python or Matlab or R means something different — in those languages, the only way to obtain high performance is to hope someone has written a “vectorized” library routine (probably in another language) that performs the critical task of your calculation on a large array (a “vector”) all at once. If you can’t find such a library routine (and perhaps contort your code to use it), then you simply hit a wall: you can’t make it faster without dropping down to a lower-level language (C, Cython, Numba, etc.).

And even in a lower-level language, the same caveat applies: obtaining maximum performance often requires expertise and low-level tricks. (MIT has a whole course on performance optimization whose starting point is using C, and they spend the rest of the semester on strategies to optimize C code.)

PS. Unlike a low-level language, making the code fast in Julia does not necessarily require making it less generic. If you had initialized your sums to zero(eltype(x)) and zero(eltype(x))^2, respectively, then your code would work on any container type and any number type, even user-defined numeric types like Unitful.jl’s.

PPS. Most programs, in any language, do not achieve anywhere close to the maximum possible performance, precisely because it is so hard that it is rarely worth the tradeoffs in effort, maintainability, and flexibility. If you get within a factor of 5 without trading off too much, you are doing pretty well.

foobar_lv2 · September 10, 2024, 2:15pm

I don’t really understand what your question has to do with functional programming?

Functional programming and functional datastructures are the discipline of programming in a world of non-overwriting memory.

This is an extremely good fit for certain settings, like e.g. flash memory / SSD (which are genuinely non-overwriting memory for physical reasons) or some distributed systems. You may consider it also a good fit for some single-node multi-core settings (it helps preventing tug-of-war between cores about who owns a cache-line).

Many filesystems like ZFS are effectively functional programming (implemented in the C language, but that’s just cosmetics).

I’d argue that functional programming is less good of a fit for “normal” hardware architecture day-to-day programming – standard CPUs and DRAM are actually pretty OK with overwriting memory.

And you pay a lot of complexity and performance for functional programming (i.e. for avoiding to overwrite memory). This is sometimes worth it, and functional programming languages can help you deal with this added complexity. But if you don’t have a very good reason for this constraint, it makes zero sense to abide by it.

stevengj · September 10, 2024, 2:24pm

I think @xiaodai just means programming with higher-order functions like reduce and similar patterns (which are closely associated with functional programming even if they are not precisely the same thing).

I think the fundamental question being raised here is why “optimised Julia code is often very alien-looking vs ‘regular’ Julia code”, the latter being code that commonly uses high-level abstractions and is often written to target compactness, clarity, and generality rather than absolutely maximizing performance. As I said above, I agree — heavily optimizing code is hard and the resulting code often looks a bit weird.

stevengj · September 10, 2024, 4:34pm

4 posts were split to a new topic: Compilers, purity, and variance computation

LeePhillips · September 10, 2024, 6:40pm

The pattern for fast numerical and scientific code (large simulations, etc.) is to allocate big fixed-sized arrays to hold everything and then mutate their contents. The functional programming idea is the opposite: pass around immutable data structures. This makes it easier to avoid mistakes when writing possibly complicated code for parallel execution, but it will never be as efficient. However, Julia is great for writing programs in a functional style that run fast.

dqeeq · September 10, 2024, 9:37pm

Can you elaborate on this. I mean how to do large simulations in julia without allocating big fixed-sized arrays?

jar1 · September 10, 2024, 10:09pm

By not dynamically allocating they mean you allocate all the arrays once before the computation starts and pass the arrays into the functions to be mutated, rather than dynamically allocating new arrays throughout the computation. E.g. using map! instead of map.

LeePhillips · September 10, 2024, 10:13pm

That’s what I mean. And I would do it that way (fortran style) in Julia, too, for high performance computing.

foobar_lv2 · September 10, 2024, 11:44pm

No. GC pressure is typically not the only issue with persistent data structures like HAMT, its also slow read access and bad cache locality and larger memory consumption from all the small heap objects.

Don’t get me wrong, these are very nice techniques and amazing datastructures – a mere 3x slowdown is an incredible accomplishment, if achievable.

Persistent datastructures tend to be quite a bit more complicated than their mutating counterparts – unsurprisingly, since you have to program a much weaker machine where memory can only be written once. Compare scala/clojure Vector with a simple array.

Worse: Persistent datastructures are often nontrivial to reason about, performance-wise. It is super easy to mess up and produce accidentally quadratic code when e.g. working with single-linked-lists.

Persistent datastructures are a marvel, but the requirements leading to them are pretty niche in practice. The thing I’m strenuously arguing against is to use them as default “just because”, or because of misguided “purity” considerations, when a mutating datastructure would do the job just as well.

A good example for well-considered use is julia’s implementation of ScopedValue. We need O(1) snapshots because Task-creation must be fast, so julia now has a HAMT in Base. I already mentioned ZFS (the filesystem), which gains robustness, snapshots, performance on SSD (most writes are sequential, from the SSD controller’s perspective), and some notion of simplicity (e.g. one should not come into the situation of having a broken/inconsistent FS due to power outages – instead one has a consistent snapshot, plus ideally a ZIL to play back).

jar1 · September 11, 2024, 12:01am

One idea that is increasingly appearing in languages like Roc is programming in a functional non-mutating style, but using alias analysis the compiler can determine that it’s safe to mutate an array in place under the hood. That way you can (ideally) get the reasoning benefits of functional programming without the performance cost.

xiaodai · September 11, 2024, 8:22am

Very interest. I wonder how successful is this approach and what the coverage ie are there lots of corner cases?

CameronBieganek · September 11, 2024, 11:39am

I’m no expert on compilers, but I would have thought that the compilers for pure, functional languages like Haskell would already be doing this.

foobar_lv2 · September 11, 2024, 1:08pm

I don’t know whether I misunderstood you here, but: This is probably a really really terrible idea.

In-place mutating and return-new-object APIs are quite different, and letting something as brittle and unpredictable and heuristic as alias analysis decide on that is terrible!

Both kind of APIs become the same if you have exclusive ownership / a unique_ptr / can use something like move in C++.

This is nice! I am not against that, and whether to call that “mutating” or “functional” is potato-potato.

So potato-potato whether you consider this rust-style “I exclusively own the object so it’s safe to mutate” vs C+±style “I own the object so it’s safe to std::move it to reconstruct a new object that happens to be the old object post mutation”. In the end, you must provide the same guarantees to the compiler, and get the same machine code emitted.

But this should not be an “optimization”, it is core semantic that must be part of the language type system and contract. Not fickle alias analysis, a strict language guarantee that is ideally obvious and legible to the programmer.

It is especially noteworthy (albeit tautological) that all the amazing and interesting programming patterns that require persistent datastructures cannot ever permit this optimization.