Where does Julia (ecosystem) provide the greatest speedup, and where does it lag the most behind (compared to e.g. Python)?

A. First on where Julia is ahead:

[EDIT: What are the greatest Julia (public) package success-stories, with speed-up factors? Feel free to also mention your proprietary non-public success…]

At best the speed is comparable to e.g. Fortran, C++ and Rust, and is in most cases, even faster than all of them in some cases, but what I have in mind, are packages that on an algorithmic level are much faster, could in theory be replicated in other languages, just haven’t yet.

What I know of: for differential equations, Julia can be way ahead. I’m just ignorant of what the conditions are.

I saw e.g. Turing.jl - Turing.jl “is fast”, does that mean much faster?

JuMP is state-of-the-art, but some of the (proprietary) backends are the same speed in other languages, while others are made in Julia.

B. On slowness, e.g.:

I recently heard about DataFrames.jl being 100 times slower than Pandas, joins, possibly it was only inner join, that’s fixed I believe on the main branch; should a new tag be made? Some other speed fixes are in the pipeline, with PRs. How close to being comparable speed-wise are we, for features ahread there (and are we far behind feature-wise, or is functionality just in other packages to use with)?

3 Likes

I think even if the question can be answered objectively, the situation changes too quickly to make a thorough answer useful for future readers.

Most of the time it goes like this: some (advanced) user find something slow/incomplete → files an issue → someone implement/fix it.

I would optimistically cite other people’s view on the second part of the question that it simply takes time to make a feature-complete pkg with good performance almost everywhere. But the observation has been that it is relatively easy to improve for Julia’s code and less disruptive for developers (i.e. no need to write a new C/Fortran backend that does the same thing as the original Julia code in order to improve).

8 Likes

Which one? The former is just for my curiosity, e.g. knowing Julia can be up to “5,850x” faster, which I find interesting: torchdiffeq (Python) vs DifferentialEquations.jl (Julia) ODE Benchmarks (Neural ODE Solvers) · GitHub

I’m basically fishing for such success-stories, 10x faster or more. E.g. CSV.jl is 20x faster (maybe not always). Something to get people excited about Julia. I know the CSV.jl success/speedup factor might go away/be replicated, I think the other case might be more elusive (also the much more functionality). I know there’s GitHub - SciML/diffeqpy: Solving differential equations in Python using DifferentialEquations.jl and the SciML Scientific Machine Learning organization but it’s ok to me, I do not need to convert people fully to Julia, I can also help them by pointing out can be use together, helping both communities.

The other question, wasn’t so much for future readers, just about knowing for the present where most help is needed.

1 Like

… greatest speedup … (compared to e.g. Python)?

my bet for a medium- and long-term:

2 Likes

When I talk about Julia, I talk about three levels of performance improvements:

  1. Avoiding traditional dynamic language overheads. (see Why Are Languages Like Python and MATLAB So Much Slower than Julia? - #9 by mbauman)
  2. Implementing better algorithms. Using a productive high-level language that is also fast (see point 1) can enable rapid development of new and smarter algorithms.
  3. Exploiting parallelism and accelerators like GPUs.

These speedups are, of course, multiplicative. The ecosystems that see the greatest speedups will be those that are still stuck on step 1. Differential Equations and CSV reading are both good examples of areas where step 1 has been “solved” in most languages, but Julia’s ecosystem is head and shoulders above in steps 2 and 3.

18 Likes

Where Julia tends to do the best are cases like:

  1. Higher order functions. Being able to compile the function into the library helps a ton. DifferentialEquations.jl for example exploits this all over the place, with the only way to beat it to statically compile against the library which isn’t possible with standard dynamic language function calling to C/Fortran. In addition, this is a case where the function call is in the host library, which means you can be bound by that cost. Differential equations might be the worst case scenario for many dynamic programming languages.
  2. Scalarized arithmetic. While higher level libraries can do fine with “vectorization” and linear algebra, they can fall apart when writing code that is fairly scalarized. For example, nonlinear operations like in a differential equation or nonlinear solve. This is the vast majority of the reason for the benchmarks against torchdiffeq looking so big, because PyTorch optimizes vector operations a few orders of magnitude better than it optimizes scalar operations.
  3. Dynamic code representation and code generation. It can be really hard to do metaprogramming and codegen on a language that’s not built for a compiler to easily understand. Because Julia’s computer has to do type inference, this also means that there’s a lot of simplicity baked into the structure in order to allow such compiler analysis to work. Zygote.jl for example exploits this, as otherwise source-to-source on the language IR itself would be extremely difficult.
  4. Task-based multithreading. It’s an easy way to get parallelism out of cases which are very difficult to parallelize by hand. The only competition here is really other task-based parallelism languages like Go or if someone is using CILK. IIRC some of the CSV and JSON readers are fast in serial, but really break records when parallelism is involved. SymbolicUtils.jl and thus Symbolics.jl also make heavy use of task based parallelism, since symbolic computations can have a lot of “parallelize all tasks in this tree” recursively.
  5. GPU codegen. Of course, CUDA.jl also exploits this. This doesn’t help the ML frameworks so much because if you are matmul dominated then :man_shrugging: nothing really matters, but x.*y.*z inside of PDEs changing from 2 to 1 kernel launches makes it much easier to saturate kernels. So I’d put BifurcationKit.jl as an example of a really good library exploiting this kind of feature.
  6. Recursive multiple dispatch. You pass a Jacobian type to DiffEq which is a BlockBandedMatrix, and then inside of DiffEq it uses the special overloads for a BBM because that’s the matrix it happens to be. This is a powerful feature because it means that many of the best features of DifferentialEquations.jl, such as ComponentArrays.jl, are simply outside extensions implementation through the type system. This doesn’t directly gets speed, but it makes it easier to get the optimal algorithm by more easily exploiting community resources outside the development of a single package.
  7. Just better algorithms. Being able to iterate in the same language that you deploy is still underrated.

[If you can’t tell from this list, I choose my research projects by first finding a competitive advantage and the exploiting it :wink:]

27 Likes

Here’s my 30x success story: Julia's applicable context is getting narrower over time? - #5 by Satvik , as someone with years of experience with numpy and ~0 experience with Julia at the time.

Currently I find that Julia really shines anywhere you have a memory bottleneck. It’s also worth noting that profiling is significantly easier in Julia, because it’s “Julia all the way down” – so even in cases where I could achieve the same speedup in Python, it was much easier to find that speedup in Julia.

14 Likes

That’s a good and under-appreciated point. Julia’s structs are just as efficient as C structs. This is a noticeable contrast from most OO languages where objects are typically at least somewhat bloated.

11 Likes

I’ll keep pointing people to your work and answers like that one.

I realize Julia’s advantage is to be a better dynamic language, but if I had to play devils advocate, people will say, I can get the same speed in C++, or D, or Nim, or Rust, and they would be right in theory for the (better) statically-typed languages, they all have the same performance-ceiling. You don’t have to sell me on Julia, and I share your view on C++, know you wouldn’t have done what you did, with it.

I’m thinking for outsiders, numbers sell the language, or should I say the ecosystem/individual packages (and “Pandas 100x faster” doesn’t… I just know possible to fix, and probably already is), and while I really want to hear about the packages, such as BifurcationKit.jl, I’m curious if you, who post, also have some numbers? I realize maybe I wasn’t specific enough, asking for “greatest speedup”.

E.g. I found this intriguing new algorithm, see sequencer.org (also pictures there), that might even help me at work. I found it through GitHub - turingtest37/SequencerJ.jl: Julia-language port of the Sequencer algorithm, originally developed in python (https://github.com/dalya/Sequencer). The Sequencer finds trends in 1-dimensional data sets and has been used by its original authors for data analysis in astrophysics, seismology, image processing, etc. Contributions are welcome! but it’s a reimplemention, the original in Python. We are not going to sell the language pointing to reimplementaions unless they are faster.

I noticed this now (looking up why no longer PyDSTool.jl recommended), and hadn’t realized up to 200 packages…, how extensive this work is:

is ComponentArrays.jl now part of DifferentialEquations.jl?

Nope, but it composes so well and people use it with DiffEq so often that it is essentially a feature at this point.

We ended up knocking down our dependencies by a lot soon after:

https://juliahub.com/ui/Packages/DifferentialEquations/UQdwS/6.16.0?t=1

Removing all Python dependencies and such. It has help the load time of DiffEq and helped the dependency maintenance. Now we’re down to only about 10-20 core packages of DiffEq leading to 120 dependencies through all of the solvers. Try 1.6-rc1 and do ]add DifferentialEquations and watch the precompile list fly by :laughing:

1 Like

The multithreaded precompile is so satisfying to watch

9 Likes

It’s the next generation of ASMR.

10 Likes

None of these languages offer the composable task-based multithreading model that Julia, Go and Cilk do. Yes, you can hand-code something with optimal performance in C++ if you work enough at it, but it’s really hard to do. The kind of parallelism that Symbolics.jl is using, for example, would be very complex in any of those languages and involve setting up, managing and passing around thread pools and other such tricky business. Even then, very few thread pool implementations have sufficiently low overhead to take advantage of that kind of granular parallelism efficiently enough to provide speed ups.

In Julia (like Go), on the other hand, you just identify parts of the computation that could be concurrent, put @spawn (or go) in front of them, and let the scheduler do its thing. The very low overhead built in task scheduling + work-stealing implementation gives great scaling in many cases without any more effort. With some additional compiler work on reducing task overhead, this will get even faster without users having to change their code.

13 Likes

To clarify the discussion, Cilk might sound like some obscure language, and according to its page the Cilks are languages, but in reality, they are extensions to C/C++:

What matters, is if it’s used (or if some library does, its user might use standard C++ in they’re own code).

An argument against could be, then it’s no longer standard C++ (for the whole application), but it might not matter, and isn’t really fair, as Julia isn’t standardized.

Another language made for parallelism is Chapel (used to be Cray Chapel; Wikipedia still has the outdated logo). Is there anything important in that language (or any other you can think of) that you might think makes parallelism easier than in Julia?

More importantly, are we in practice far behind some other (parallel) code, or ahead?

It seems to me the obvious comparison for speed would be Common Lisp systems, where a function, method, etc. can be compiled immediately and in-place. The effect of adding (optional) declarations for restrictions on types (like fixnum, in-line …), and setting different values to the compilation parameters (debugging, speed, space) , using timing and profiling tools, and “disassembling” code – that is, you can look at the assembler that is generated – you have the kind of control (if you have the patience) to get your “inner loop” code to look good. There’s basically plenty of possibilities for compiler jockeys to identify optimizations.

If Julia advocates are claiming great speedup by contrast with an interpreted Python system, (is sympy mostly interpreted??) it’s maybe not so impressive.

About FOSS development, I have certain reservations. I think that
the context of computer algebra systems is different from that of some other projects: “deeper” in some sense. This means that adding relatively inexperienced volunteer programmers with limited exposure to advanced mathematics and computing education can have a positive effect on the volunteers. Less often it has a positive effect on the project.
Leadership is critical here, and I suspect quite difficult.

Regarding exposure to past projects, especially retrospective reviews of computer algebra systems, I re-read my paper, originally written in 1982, revised slightly in 2001, about Macsyma. I don’t know if it was fed into the hopper of project input, but here it is (free)

It is (gasp!) 40 years old, but may still be informative. I also wrote a review of Mathematica, in case you care…

I look forward to good things written in Julia, and wish you the best of success.
Regards
Richard Fateman

9 Likes

MixedModels.jl is faster than R’s lme4
Avalon.jl is supposed to be even faster than Flux.jl
Yota.jl is faster than Zygote.jl and Jax.
Grassmann.jl is a very fast differential geometric algebra pacakge.
Soss.jl is a very fast probabilistic programming package

But Julia is slower than R’s data.table for some operations, for example joining dataframes.
And Julia lacks of many important packages such as metaanalysis, multiple imputation.

2 Likes

In C++, I used to enjoy Intel TBB (especially tasks system) or parallel runtimes like StarPu or ParSec. Both approaches offer very good performance (affinity handling, scalable allocators,…).

I did not made any experiments and I wonder about the relative performance of the Julia’s task based parallelism and the minimal task granularity that can be achieved. Maybe @tkf made such comparison ?

In case this kind of fine grained parallelism is very efficient in Julia, I think that it opens a huge set of potentially best in class scientific libraries nowadays (painfully) developed in low level languages (C). The first example that comes to my mind is a direct sparse solver like MUMPS (http://mumps.enseeiht.fr/) where you have a combination of complex maths, low level task parallelism and a strong need for high performance tiny/small matrix operations where @Elrod’s tools could really shine (https://oatao.univ-toulouse.fr/10506/1/weisbecker.pdf).

A second field where fine grained //ism is mixed with high level math is mathematical optimisation (simplex, B&B, interior points) solvers.

Chapel doesn’t have a ton of multithreading features IIRC, it’s mostly distributed programming. They do distributed better than Base.Distributed, that’s for sure. In fact, I would say that we have a lot to improve in distributed computing, which is why I didn’t add it to the list above. But a lot of their parallel iteration strategies are being added to the language through packages like:

So I’d say Julia is in a great spot with multithreading, but only okay for distributed. But “okay” for distributed already puts it into a small class of languages that even attempt to do something decent there.

Python is an common target and benchmarked against because many users want to know, but the real meaty benchmarks are against C, C++, or Fortran libraries. Take for example SciMLBenchmarks. Sure there is one benchmark against Python:

https://benchmarks.sciml.ai/html/MultiLanguage/wrapper_packages.html

But the vast majority of the benchmarks are directly against the C++ or Fortran differential equation solvers, like:

https://benchmarks.sciml.ai/html/NonStiffODE/Pleiades_wpd.html

https://benchmarks.sciml.ai/html/StiffODE/Hires.html

Similarly, Symbolics.jl’s benchmark target is things like SymEngine, C++ libraries, not SymPy. We plan to time against REDUCE as well, but from our past encounters with it we should pretty easily outperform that. SymPy is a performance punching bag, but it’s a modern requirement because >90% of scientists I know have only ever used MATLAB or Python, and of those the Python users have only ever used SymPy from Python, so that is the only baseline many people know. Given the circumstances, SymPy has to be in any CAS benchmark because it’s the common baseline that people know.

8 Likes