Where does Julia (ecosystem) provide the greatest speedup, and where does it lag the most behind (compared to e.g. Python)?

I enjoyed reading some of your papers on CAS in the 1990s. Time flies.

1 Like

Thanks for a pointer to that video! It even mentions me!

But it isn’t really correct, at least with respect to Maxima. As I said, people sometimes think they are comparing apples to apples, and yet the programs they are comparing do quite different things, and at different times. John Lapeyre claims a time, for symengine.jl to do a task of 3.8 seconds and an astonishing 870 seconds for Maxima.

I don’t know what computer is being used, but on my Macbook Pro from 2009, 2.66 GHz 8GB ram, Maxima version 5.41 (easily downloaded as binary, source, etc. so you can run it yourself) in SBCL lisp,

the time is not 870 seconds. It is 0.372 seconds.

It appears Maxima is 10x faster than symengine, (though on different machines.)

Here is the example. Let f = 1+x+y+z. Let p= f^20 with all coefficients explicitly computed.
Time the multiplication of p*(p+1), again with all coefficients explicitly computed.
The script, in Maxima, is
showtime:all$
f:x+y+z+1$
p:rat(f) q:p*p+1
How to make this slower? Well, you could include the display time for the result, which has 12,341 terms.
Or you could use the “expand” command, which does numerous additional computations other than polynomial multiplication, uses a very general tree expression form, and spends time simplifying already simplified subexpressions as they are generated.

The right way to do this is to use the command rat(), which converts to a canonical rational/polynomial form (the only form available in those allegedly fast systems), and does the multiplication, using the knowledge that there are no additional simplifications possible for terms like x^3y^10z^7.

So it is not that lisp is slow, it is that Maxima is being asked to do something quite a bit more general, but for this specific case results in the same answer. The documentation for the command expand() specifically says to use ratexpand() for polynomials… I used rat() which leaves the internal recursive form alone; ratexpand() in this example splays out the result into a sum of terms. Ratexpand() is somewhat slower… 0.611 second.

i would expect Reduce, asked a similar correctly phrased question, to respond at least as fast.

I am fully in favor of finding new and previously unavailable capabilities to solve applied problems, programmed by enthusiastic Julia fans. I would take, with a considerable grain of salt, claims that (serial) Julia programs will do symbolic computation faster than comparable programs in (decently compiled) Lisp. When you think about it, why would Julia be faster? Are the Lisp systems just spinning their wheels? Were the CAS programmers and the implementers of the underlying Lisp (perhaps written in Lisp and C) just lousy programmers? Perhaps they were building into the Lisp system capabilities that were not needed and constituted a drag on the total system? (you could argue about reference counts vs garbage collection, maybe?). Anyway, is there something that Julia leaves out, by comparison?
(Oh, MPI, CUDA, … haven’t been used in Maxima, but who knows.)
RJF

6 Likes

I haven’t done it myself or seen something extensive yet. But yeah, it’d be nice to have one.

Rust actually has Rayon (inspired by Cilk) which has task parallelism with limited concurrency support (“may happen in parallel”), just like Cilk: rayon::scope - Rust. A quick websearch also suggests that D and Nim also have task parallel framework.

I wouldn’t dismiss those new “systems” languages (whatever it means) as non-contender for becoming a good parallel platform (for computing in general). For one thing, they all seem to trend toward having an ownership model building into the compiler. It is a big relief in programming in such a system, especially for highly concurrent programs.

Of course, this kind of system tends to get in the way if you are interactively exploring things while programming. Furthermore, for scientific/technical computing, we don’t need full concurrency semantics most of the time and so we are probably OK to not optimize Julia for manual reasoning/assurance of “hard” concurrent problems.

Yup, we can do this exactly the reason why DifferentialEquations.jl was successful; good higher-order function support. It should be possible go beyond simple distributed iteration and doing something like GPU-on-distributed and also custom containers beyond simple arrays.

FLoops.jl (via Transducers.OnInit) already can kind of do this. But I think we need a more straightforward syntax and a good tutorial.

Yes, I personally agree with this. There are still a lot of possible improvements. Currently, people mainly worry about spawn/wait overhead. But I guess the next thing to come up will be concurrent GC compared to Go or Java. There is still a long way to go in terms of the base system and the ecosystem. But I believe we are on a good trajectory.

7 Likes

For your polynomial benchmark, SymEngine is slower than almost every system I tested (Mathematica, Maple, Nemo). Here are the timings on my 2-year old quad-core i7 laptop.

Mathematica 12.1:

In[1]:= p=Expand[(1+x+y+z)^20];                                                 
In[2]:= Timing[q=Expand[p*(p+1)];] (* time in seconds *)                                               
Out[2]= {0.694881, Null}

Maple 2020:

> p:=expand((1+x+y+z)^20):                                                     
> st:=time(): q=expand(p*(p+1)): time()-st;                                    
                                     0.074

Nemo (all polynomials are expanded by default):

julia > using Nemo, BenchmarkTools
julia> r, (x,y,z) = PolynomialRing(ZZ, ["x", "y", "z"]);
julia> p = (1+x+y+z)^20;
julia> @btime q=p*(p+1);
  3.658 ms (13 allocations: 316.35 KiB)

Finally, SymEngine accessed from the Julia interface:

julia> using SymEngine, BenchmarkTools
julia> @vars x y z
(x, y, z)
julia> p = expand((1+x+y+z)^20);
julia> @btime q = expand(p*(p+1));
  3.233 s (26725925 allocations: 212.47 MiB)

The Nemo timing is pretty surreal, but it’s supposed to be the state of the art, powered by recent versions of FLINT.

We’ve been using ThreadingUtilities.jl/CheapThreads.jl in Octavian.jl, and it works pretty well.

1 Like

Indeed that function is powered by FLINT. Can you quickly test the recent versions of AbstractAlgebra.jl? That’s slowly replacing FLINT throughout Nemo but it seems it hasn’t replaced this portion yet. We were looking to use that for the polynomial representation of SymbolicUtils.jl since it seems to greatly outperform MultivariatePolynomials.jl, but it would be interesting to place it along the scale you have going there. (Also, SymPy just for kicks)

How did you get the impression that FLINT is being replaced in Nemo?

Here’s my timing for AbstractAlgebra.jl on the same machine:

julia > using AbstractAlgebra, BenchmarkTools
julia> r, (x,y,z) = PolynomialRing(ZZ, ["x", "y", "z"]);
julia> p = (1+x+y+z)^20;
julia> @btime q=p*(p+1);
  176.027 ms (39495 allocations: 1.45 MiB)

For SymPy, there is memoization, so I can only time it once using @time:

julia> using SymPy
julia> (x, y, z) = symbols("x y z")
(x, y, z)
julia> p=expand((1+x+y+z)^20);
julia> @time q=expand(p*(p+1));
365.774315 seconds (9.17 k allocations: 542.630 KiB)
1 Like

Just watching AA.jl slowly begin to cover more and more of Nemo (and stated in Development roadmap for AbstractAlgebra.jl · Issue #64 · Nemocas/AbstractAlgebra.jl · GitHub). If the plan is to never include specializations in Julia that would seem a little odd given how much is getting covered there.

Thanks! Wow, looks like we should be making use of the FLINT wrapper for now.

The plan of AbstractAlgebra was to supplement the fast C implementation coming from FLINT, not to replace it. The idea is to provide generic functionality which is not covered by FLINT, e.g. polynomial rings over polynomial rings etc. Because the implementation is generic, one gets another implementation of e.g. Z[x,...] for free, this time using BigInt.

1 Like

Good to know. It would be nice to drop a binary but it’s not the most crucial thing in the world since the jlls seem to be pretty stable.

I was concerned SymPy might be getting a slow name here as it is being called through SymPy.jl, but I tested that this code, which only utilized PyCall has essentially the same timings over n in {2,4,8,16}:

julia> using PyCall

julia> sympy = PyCall.pyimport_conda("sympy", "sympy");

julia> x,y,z = sympy.symbols("x,y,z")
(PyObject x, PyObject y, PyObject z)

julia> p = sympy.expand(sympy.Pow((1 + x + y + z), n));

julia> @time q = sympy.expand(sympy.Mul(p, sympy.Add(p,1)));

I don’t think PyCall introduces much overhead here, though it may, so I would expect the timings within a Python session to be similar.

Since one of the themes in this thread has been the composability of parallel Julia programs, I cannot help but comment that there is a concern in the composability of ThreadingUtilities.jl (and so CheapThreads.jl that depends on it). There is a discussion in: Overhead of `Threads.@threads` - #16 by tkf. @Elrod said he had some ideas of fixing it so it may be fixed or mitigated at some point, though.

I think composability for parallel Julia programs will be a community effort. For example, APIs with data races (e.g., a function that mutates global states without holding a lock) cannot be composed with parallel functions. A more subtle example is that, if a package manipulates the scheduling aspect of the task too eagerly, it may not work well in the context of a large application. The basis of composability is the centralized management of computation resources (= CPU + cache). Of course, I think any “hacks” will be justified if it helps you in the end. It is your software, after all. However, I cannot imagine a scenario that the Julia ecosystem evolves into a composable parallel platform without trusting the core parallel task runtime.

5 Likes

I think a really key point here though is that having a standard way to do composable threading is very different than having an optional one that you can use if you want to. It’s only really effective if everyone uses it. That’s why I don’t credit Cilk towards C/C++: not only do you need a special compiler and language extentions, but if one library uses pthreads and another library uses Cilk, the combination is not going to scale well. Same with other systems — a composable threading system is only effective if everyone buys in.

Go’s superpower as a language seems to be the fact that they implemented an incredibly good built-in task-based threading system and absolutely everyone uses it. It’s so important that the keyword for it is go — the same as the name of the language. That and they’ve optimized the heck out of it so that it’s really, really efficient and reliable and the garbage collector is very low latency despite the threading.

Reducing spawn overhead is definitely a compiler team todo, but hasn’t lately been as high up as reducing compiler latency.

14 Likes

For what it’s worth, if I recall correctly (@chriselrod correct me if I’m wrong) in Octavian the spawn overhead was most noticeable for small matrix sizes, where every nanosecond counts when you are trying to match the GFLOPS of OpenBLAS and MKL. For larger matrix sizes, it was less of an issue.

Right. At the moment if you want to get good scaling, you need to spawn tasks that have enough work to do to make it worth spawning them and split the work up in such a way that there’s enough parallelism to get all the work done faster in the end. Reducing spawn overhead makes that easier to accomplish. In the limit, if you can reduce spawn overhead to zero, you can make tasks that do tiny amounts of work and utilize all possible parallelism. Of course that’s not possible but you get the idea.

3 Likes

We expect nothing less!

8 Likes

It’s also worth noting that as far as I’m aware, no other BLAS library is able to profitably turn on multithreading as early as Octavian is. Presumably the reason is that none of them have a multithreading system that has as low overhead as CheapThreads.jl

E.g. https://github.com/JuliaLinearAlgebra/Octavian.jl/issues/24#issuecomment-766185684

This might be an AMD tuning issue, I’d be interested to see a more up to date run on someone’s Intel CPU.

1 Like

Yes, I agree with this point and I think Go’s dev team has been making clever decisions for the domain they are targeting. But is this also a superpower in the domain Julia shines? Go has been criticised as a non-customizable language (e.g., took about a decade to get generics). Although Go developers effectively proved that customizability does not really matter in the domain Go is used for (e.g., networking applications), I think composability of Julia programs is deeply rooted in customizability (many custom array types, the broadcasting infrastructure, etc.).

I don’t think it’s entirely far-fetched to have some well-defined abstract protocol (but not just coroutine; otherwise, yes, it’d then fracture the ecosystem) upon which the ecosystem is build while some performance engineers can tweak or build tailor-made (nested) runtimes. Although doing this for unrestricted concurrency is very challenging (which I still am interested in, though), if we restrict to just parallel computations, I think it’d be much easier and actually is very beneficial. In fact, I kind of have already done this in FoldsThreads.jl (ref: ANN discourse thread) for a rather restricted kind of computation (= a slightly extended class of divide-and-conquer algorithmic skeleton). Each library API and even each use of each API can have very different requirements in terms of the scheduling (e.g., some wants throughput, some wants low latency). So, I can’t help wondering if/how we can have a united ecosystem while maximizing the customizability of the scheduling. In a way, it’s a theme common to Halide/TVM/Tiramisu/MLIR.

Isn’t “that’s not possible” actually the point? I think more realistic and beneficial approach is to make parallel tasks fusible by the compiler, runtime, and libraries. It’d let programmers denote existing parallelism in the program without causing incurring the run-time cost. Decoupling how and what enables more optimizations by the compiler, runtime, and engineers. Exploiting the parallelism or not in actual compiled result can be lazily decided at much later stages. An alternative notion of task for may-happen in parallel parallelism (JuliaLang/julia#39773) (Cilk-like tasks) is a prerequisite for this (and other optimizations). This is because we can’t fuse tasks if they try to use unrestricted concurrent communication APIs without potentially introducing deadlocks.

7 Likes

Actually, I said too much :slight_smile: Obviously, it’s better to make the scheduler as fast as possible. I wanted to bring up that there are multiple mostly orthogonal aspects that we can improve Julia’s thread-based parallelism.

4 Likes