Where does Julia (ecosystem) provide the greatest speedup, and where does it lag the most behind (compared to e.g. Python)?

Right. At the moment if you want to get good scaling, you need to spawn tasks that have enough work to do to make it worth spawning them and split the work up in such a way that there’s enough parallelism to get all the work done faster in the end. Reducing spawn overhead makes that easier to accomplish. In the limit, if you can reduce spawn overhead to zero, you can make tasks that do tiny amounts of work and utilize all possible parallelism. Of course that’s not possible but you get the idea.

3 Likes

We expect nothing less!

8 Likes

It’s also worth noting that as far as I’m aware, no other BLAS library is able to profitably turn on multithreading as early as Octavian is. Presumably the reason is that none of them have a multithreading system that has as low overhead as CheapThreads.jl

E.g. https://github.com/JuliaLinearAlgebra/Octavian.jl/issues/24#issuecomment-766185684

This might be an AMD tuning issue, I’d be interested to see a more up to date run on someone’s Intel CPU.

1 Like

Yes, I agree with this point and I think Go’s dev team has been making clever decisions for the domain they are targeting. But is this also a superpower in the domain Julia shines? Go has been criticised as a non-customizable language (e.g., took about a decade to get generics). Although Go developers effectively proved that customizability does not really matter in the domain Go is used for (e.g., networking applications), I think composability of Julia programs is deeply rooted in customizability (many custom array types, the broadcasting infrastructure, etc.).

I don’t think it’s entirely far-fetched to have some well-defined abstract protocol (but not just coroutine; otherwise, yes, it’d then fracture the ecosystem) upon which the ecosystem is build while some performance engineers can tweak or build tailor-made (nested) runtimes. Although doing this for unrestricted concurrency is very challenging (which I still am interested in, though), if we restrict to just parallel computations, I think it’d be much easier and actually is very beneficial. In fact, I kind of have already done this in FoldsThreads.jl (ref: ANN discourse thread) for a rather restricted kind of computation (= a slightly extended class of divide-and-conquer algorithmic skeleton). Each library API and even each use of each API can have very different requirements in terms of the scheduling (e.g., some wants throughput, some wants low latency). So, I can’t help wondering if/how we can have a united ecosystem while maximizing the customizability of the scheduling. In a way, it’s a theme common to Halide/TVM/Tiramisu/MLIR.

Isn’t “that’s not possible” actually the point? I think more realistic and beneficial approach is to make parallel tasks fusible by the compiler, runtime, and libraries. It’d let programmers denote existing parallelism in the program without causing incurring the run-time cost. Decoupling how and what enables more optimizations by the compiler, runtime, and engineers. Exploiting the parallelism or not in actual compiled result can be lazily decided at much later stages. An alternative notion of task for may-happen in parallel parallelism (JuliaLang/julia#39773) (Cilk-like tasks) is a prerequisite for this (and other optimizations). This is because we can’t fuse tasks if they try to use unrestricted concurrent communication APIs without potentially introducing deadlocks.

7 Likes

Actually, I said too much :slight_smile: Obviously, it’s better to make the scheduler as fast as possible. I wanted to bring up that there are multiple mostly orthogonal aspects that we can improve Julia’s thread-based parallelism.

4 Likes