Where does Julia (ecosystem) provide the greatest speedup, and where does it lag the most behind (compared to e.g. Python)?

I might be missing something but last time I tried to measure julia’s thread overhead compared to C’s it was atrociously slow. Both the interface and the implementation appear to be in flux (there’s @threads and @spawn which, if I understand correctly, don’t use the same scheduler), and “core” functionality (such as parallelizing broadcast, or reductions) is split between several packages that are not easy to find for beginners. I haven’t found an equivalent to openmp’s model of private and shared variables. There have been attempts at composing threading with BLAS and FFTW but I don’t think it’s been resolved. These are hard problems (both implementation and APIs), Julia has set itself ambitious goals and things are improving fast, but I wouldn’t call the situation “great”.

Regarding distributed, essentially any language that can call into MPI can do distributed. I have looked at julia’s distributed functionality, but I’m not sure I find it more convenient than MPI and I trust it less to work in an optimized way on supercomputers, so I just used MPI.

3 Likes

The @spawn overhead is high, but the composability of task-based parallelism is really useful if used correctly (i.e. high enough so that the overhead is minimal). If you do need a lower spawn cost, then you can essentially disable the scheduler using tools like:

which are still built on Julia’s threading but sidesteps the scheduler.

4 Likes

I don’t know why Reduce doesn’t seem to be fast, but I don’t seem to be the only one who has noticed it. For example, in John’s talk at 11:20 you see something similar to what I always see:

Maybe everyone is just using the wrong version of Lisp? Or not tuning the installation correctly? Maybe. I just use the one provided by GitHub - chakravala/Reduce.jl: Symbolic parser generator for Julia language expressions using REDUCE algebra term rewriter . But I couldn’t figure out how to get an ship a fast version of it, so I turned to something that seems to benchmark really well: William Hart’s AbstractAlgebra.jl

That as a tool combined with a rewrite system seems to be pretty good FWICT. And according to their benchmarks, the Nemo stack benchmarks pretty well:

https://nemocas.org/benchmarks.html

So I’ve found it to be a good foundation. But we still need to do a bit more in-depth benchmarks of our own.

Since we’re pulling this towards applications, my plan is to benchmark it on applications. Simplification of large sparse Jacobians arising from PDE discretizations, generation time for simplified stability regions of explicit Runge-Kutta methods, that kind of stuff. Integration benchmarks, not unit benchmarks.

1 Like

I enjoyed reading some of your papers on CAS in the 1990s. Time flies.

1 Like

Thanks for a pointer to that video! It even mentions me!

But it isn’t really correct, at least with respect to Maxima. As I said, people sometimes think they are comparing apples to apples, and yet the programs they are comparing do quite different things, and at different times. John Lapeyre claims a time, for symengine.jl to do a task of 3.8 seconds and an astonishing 870 seconds for Maxima.

I don’t know what computer is being used, but on my Macbook Pro from 2009, 2.66 GHz 8GB ram, Maxima version 5.41 (easily downloaded as binary, source, etc. so you can run it yourself) in SBCL lisp,

the time is not 870 seconds. It is 0.372 seconds.

It appears Maxima is 10x faster than symengine, (though on different machines.)

Here is the example. Let f = 1+x+y+z. Let p= f^20 with all coefficients explicitly computed.
Time the multiplication of p*(p+1), again with all coefficients explicitly computed.
The script, in Maxima, is
showtime:all$
f:x+y+z+1$
p:rat(f) q:p*p+1
How to make this slower? Well, you could include the display time for the result, which has 12,341 terms.
Or you could use the “expand” command, which does numerous additional computations other than polynomial multiplication, uses a very general tree expression form, and spends time simplifying already simplified subexpressions as they are generated.

The right way to do this is to use the command rat(), which converts to a canonical rational/polynomial form (the only form available in those allegedly fast systems), and does the multiplication, using the knowledge that there are no additional simplifications possible for terms like x^3y^10z^7.

So it is not that lisp is slow, it is that Maxima is being asked to do something quite a bit more general, but for this specific case results in the same answer. The documentation for the command expand() specifically says to use ratexpand() for polynomials… I used rat() which leaves the internal recursive form alone; ratexpand() in this example splays out the result into a sum of terms. Ratexpand() is somewhat slower… 0.611 second.

i would expect Reduce, asked a similar correctly phrased question, to respond at least as fast.

I am fully in favor of finding new and previously unavailable capabilities to solve applied problems, programmed by enthusiastic Julia fans. I would take, with a considerable grain of salt, claims that (serial) Julia programs will do symbolic computation faster than comparable programs in (decently compiled) Lisp. When you think about it, why would Julia be faster? Are the Lisp systems just spinning their wheels? Were the CAS programmers and the implementers of the underlying Lisp (perhaps written in Lisp and C) just lousy programmers? Perhaps they were building into the Lisp system capabilities that were not needed and constituted a drag on the total system? (you could argue about reference counts vs garbage collection, maybe?). Anyway, is there something that Julia leaves out, by comparison?
(Oh, MPI, CUDA, … haven’t been used in Maxima, but who knows.)
RJF

6 Likes

I haven’t done it myself or seen something extensive yet. But yeah, it’d be nice to have one.

Rust actually has Rayon (inspired by Cilk) which has task parallelism with limited concurrency support (“may happen in parallel”), just like Cilk: rayon::scope - Rust. A quick websearch also suggests that D and Nim also have task parallel framework.

I wouldn’t dismiss those new “systems” languages (whatever it means) as non-contender for becoming a good parallel platform (for computing in general). For one thing, they all seem to trend toward having an ownership model building into the compiler. It is a big relief in programming in such a system, especially for highly concurrent programs.

Of course, this kind of system tends to get in the way if you are interactively exploring things while programming. Furthermore, for scientific/technical computing, we don’t need full concurrency semantics most of the time and so we are probably OK to not optimize Julia for manual reasoning/assurance of “hard” concurrent problems.

Yup, we can do this exactly the reason why DifferentialEquations.jl was successful; good higher-order function support. It should be possible go beyond simple distributed iteration and doing something like GPU-on-distributed and also custom containers beyond simple arrays.

FLoops.jl (via Transducers.OnInit) already can kind of do this. But I think we need a more straightforward syntax and a good tutorial.

Yes, I personally agree with this. There are still a lot of possible improvements. Currently, people mainly worry about spawn/wait overhead. But I guess the next thing to come up will be concurrent GC compared to Go or Java. There is still a long way to go in terms of the base system and the ecosystem. But I believe we are on a good trajectory.

7 Likes

For your polynomial benchmark, SymEngine is slower than almost every system I tested (Mathematica, Maple, Nemo). Here are the timings on my 2-year old quad-core i7 laptop.

Mathematica 12.1:

In[1]:= p=Expand[(1+x+y+z)^20];                                                 
In[2]:= Timing[q=Expand[p*(p+1)];] (* time in seconds *)                                               
Out[2]= {0.694881, Null}

Maple 2020:

> p:=expand((1+x+y+z)^20):                                                     
> st:=time(): q=expand(p*(p+1)): time()-st;                                    
                                     0.074

Nemo (all polynomials are expanded by default):

julia > using Nemo, BenchmarkTools
julia> r, (x,y,z) = PolynomialRing(ZZ, ["x", "y", "z"]);
julia> p = (1+x+y+z)^20;
julia> @btime q=p*(p+1);
  3.658 ms (13 allocations: 316.35 KiB)

Finally, SymEngine accessed from the Julia interface:

julia> using SymEngine, BenchmarkTools
julia> @vars x y z
(x, y, z)
julia> p = expand((1+x+y+z)^20);
julia> @btime q = expand(p*(p+1));
  3.233 s (26725925 allocations: 212.47 MiB)

The Nemo timing is pretty surreal, but it’s supposed to be the state of the art, powered by recent versions of FLINT.

We’ve been using ThreadingUtilities.jl/CheapThreads.jl in Octavian.jl, and it works pretty well.

1 Like

Indeed that function is powered by FLINT. Can you quickly test the recent versions of AbstractAlgebra.jl? That’s slowly replacing FLINT throughout Nemo but it seems it hasn’t replaced this portion yet. We were looking to use that for the polynomial representation of SymbolicUtils.jl since it seems to greatly outperform MultivariatePolynomials.jl, but it would be interesting to place it along the scale you have going there. (Also, SymPy just for kicks)

How did you get the impression that FLINT is being replaced in Nemo?

Here’s my timing for AbstractAlgebra.jl on the same machine:

julia > using AbstractAlgebra, BenchmarkTools
julia> r, (x,y,z) = PolynomialRing(ZZ, ["x", "y", "z"]);
julia> p = (1+x+y+z)^20;
julia> @btime q=p*(p+1);
  176.027 ms (39495 allocations: 1.45 MiB)

For SymPy, there is memoization, so I can only time it once using @time:

julia> using SymPy
julia> (x, y, z) = symbols("x y z")
(x, y, z)
julia> p=expand((1+x+y+z)^20);
julia> @time q=expand(p*(p+1));
365.774315 seconds (9.17 k allocations: 542.630 KiB)
1 Like

Just watching AA.jl slowly begin to cover more and more of Nemo (and stated in Development roadmap for AbstractAlgebra.jl · Issue #64 · Nemocas/AbstractAlgebra.jl · GitHub). If the plan is to never include specializations in Julia that would seem a little odd given how much is getting covered there.

Thanks! Wow, looks like we should be making use of the FLINT wrapper for now.

The plan of AbstractAlgebra was to supplement the fast C implementation coming from FLINT, not to replace it. The idea is to provide generic functionality which is not covered by FLINT, e.g. polynomial rings over polynomial rings etc. Because the implementation is generic, one gets another implementation of e.g. Z[x,...] for free, this time using BigInt.

1 Like

Good to know. It would be nice to drop a binary but it’s not the most crucial thing in the world since the jlls seem to be pretty stable.

I was concerned SymPy might be getting a slow name here as it is being called through SymPy.jl, but I tested that this code, which only utilized PyCall has essentially the same timings over n in {2,4,8,16}:

julia> using PyCall

julia> sympy = PyCall.pyimport_conda("sympy", "sympy");

julia> x,y,z = sympy.symbols("x,y,z")
(PyObject x, PyObject y, PyObject z)

julia> p = sympy.expand(sympy.Pow((1 + x + y + z), n));

julia> @time q = sympy.expand(sympy.Mul(p, sympy.Add(p,1)));

I don’t think PyCall introduces much overhead here, though it may, so I would expect the timings within a Python session to be similar.

Since one of the themes in this thread has been the composability of parallel Julia programs, I cannot help but comment that there is a concern in the composability of ThreadingUtilities.jl (and so CheapThreads.jl that depends on it). There is a discussion in: Overhead of `Threads.@threads` - #16 by tkf. @Elrod said he had some ideas of fixing it so it may be fixed or mitigated at some point, though.

I think composability for parallel Julia programs will be a community effort. For example, APIs with data races (e.g., a function that mutates global states without holding a lock) cannot be composed with parallel functions. A more subtle example is that, if a package manipulates the scheduling aspect of the task too eagerly, it may not work well in the context of a large application. The basis of composability is the centralized management of computation resources (= CPU + cache). Of course, I think any “hacks” will be justified if it helps you in the end. It is your software, after all. However, I cannot imagine a scenario that the Julia ecosystem evolves into a composable parallel platform without trusting the core parallel task runtime.

5 Likes

I think a really key point here though is that having a standard way to do composable threading is very different than having an optional one that you can use if you want to. It’s only really effective if everyone uses it. That’s why I don’t credit Cilk towards C/C++: not only do you need a special compiler and language extentions, but if one library uses pthreads and another library uses Cilk, the combination is not going to scale well. Same with other systems — a composable threading system is only effective if everyone buys in.

Go’s superpower as a language seems to be the fact that they implemented an incredibly good built-in task-based threading system and absolutely everyone uses it. It’s so important that the keyword for it is go — the same as the name of the language. That and they’ve optimized the heck out of it so that it’s really, really efficient and reliable and the garbage collector is very low latency despite the threading.

Reducing spawn overhead is definitely a compiler team todo, but hasn’t lately been as high up as reducing compiler latency.

14 Likes

For what it’s worth, if I recall correctly (@chriselrod correct me if I’m wrong) in Octavian the spawn overhead was most noticeable for small matrix sizes, where every nanosecond counts when you are trying to match the GFLOPS of OpenBLAS and MKL. For larger matrix sizes, it was less of an issue.

Right. At the moment if you want to get good scaling, you need to spawn tasks that have enough work to do to make it worth spawning them and split the work up in such a way that there’s enough parallelism to get all the work done faster in the end. Reducing spawn overhead makes that easier to accomplish. In the limit, if you can reduce spawn overhead to zero, you can make tasks that do tiny amounts of work and utilize all possible parallelism. Of course that’s not possible but you get the idea.

3 Likes

We expect nothing less!

8 Likes