Blog post: Rust vs Julia in scientific computing

Yes, unfortunately. Do you know of any library trying to fix that? As far as I understand, handlers and restarts should not be difficult to implement via dynamic binding … Julia seems to have task_local_storage which might be similar?

Having attempted to write a moderate sized scientific library in Julia and also for Rust for a field I know very well (20+ years experience with developing software in the field), allow me to say that both Julia and Rust have their challenges. Rusts promise of “write once code” is very appealing. I’d like to write something once and move on. Robust multi-theading is great and I learned a lot about what not to do from the Rust compiler. I like compilers that tell me what I’m doing wrong. However, many of the libraries I need just aren’t available. The infrastructure for scientific computing in Rust is insufficient at this time.
On the other hand, getting Julia to perform optimally can be a real challenge. Memory allocation/deallocation is my nemesis. Compared, lets say to the Java memory management, Julia’s garbage collected memory management is plodding. More often than not my multi-threaded code gets bogged down in garbage collection. (Yes, I’ve spent a lot of time reducing memory allocation but some (and sometimes a lot) is inevitable.)
In the end, I gave up on Rust because of the challenges of implementing dynamic dispatch behavior. Rust enums aren’t an adequate replacement. Also there just don’t seem to be any decent cross-platform GUI toolkits.
But I keep on using Julia because it is a great way to try new ideas and algorithms in a dynamic environment with tabular data and plotting tools. Multiple dispatch is eye-opening. I love that each release hacks away at the TTFX problem - a real problem for my library which is requires many dependencies and implements many algorithms and requires much data.

30 Likes

I totally agree. It is growing, but it is not comparable to the current ecosystem of Julia.

But remember that people did say and sometimes still say that Julia’s ecosystem is lacking behind that of Python. It is a “chicken and egg problem”.

I know what you mean. Iced looks very promising for the future and GTK bindings are very good, but Rust still has a long way to go regarding GUI.

If you want something functional but not “beautiful on each platform”, which is often the case in scientific computing, then I would highly recommend using egui.

I did write a GUI for an Ising simulation with egui:

I don’t think that the GUI situation is any better for Julia though.

1 Like

Just a quick clarification. I also wish this was possible. In fact, I was just atempting to write a custom allocator a few days ago in pure Julia using Arrays… given that we can pre-allocate+@view very efficiently, I thought that maybe we could also reshape and reinterptet a memory pool. But it is sadly not the case, I believe reshape allocates, and you can’t write into a reintepreted array.

Also, in my case, I think I never write type unstable code (maybe because I try not to define many types, if at all) but I do find the workflow of prototyping with allocating code and refactoring later a little bit annoying. So I was trying to come up with a useful tool to make that easier.

Another thing we can hope for is better compiler support for automatic stack allocations of MVectors, which currently are quite restricive in how they can be used and end up in the heap anyway (they have to live within inlined functions or some other problem related to @GC.preserve). If I remember correctly, this is solvable with a proper compiler optimization pass, but adding a feature like that to the compiler is very hard.

This is extremely interesting. Can you provide any more information like rough timelines, objective, features etc ?

5 Likes

Julia offers functions like zeros, ones and fill. But these have the overhead of overwriting the memory first.

This notion is too simplistic versus how modern operating systems function. I would like to call your attention to calloc.

Here is the description from man 3 calloc on a GNU Linux system.

The calloc() function allocates memory for an array of nmemb elements of size bytes each and returns a pointer to the allocated memory. The memory is set to zero.

In Julia, you can access this via Libc.calloc. However, I have made it even easier to use via ArrayAllocators.jl:

julia> using ArrayAllocators

julia> @time A = Vector{UInt8}(undef, 1024^3);
  0.008795 seconds (2 allocations: 1.000 GiB, 99.15% gc time)

julia> @time fill!(A, 0)
  0.215130 seconds (136 allocations: 5.688 KiB, 2.52% compilation time)

julia> @time sum(A)
  0.186580 seconds
0x0000000000000000

julia> @time sum(A)
  0.192126 seconds
0x0000000000000000

julia> @time B = Vector{UInt8}(calloc, 1024^3);
  0.007696 seconds (5 allocations: 1.000 GiB, 99.10% gc time)

julia> @time sum(B)
  0.244811 seconds
0x0000000000000000

julia> @time sum(B)
  0.140942 seconds
0x0000000000000000

julia> @time C = zeros(UInt8, 1024^3);
  0.328151 seconds (2 allocations: 1.000 GiB, 2.14% gc time)

julia> @time sum(C)
  0.218502 seconds
0x0000000000000000

julia> @time sum(C)
  0.143196 seconds
0x0000000000000000

Observations:

  1. The array creation time for A and B are about the same. The creation time of C takes significantly longer.
  2. fill! takes a considerable amount of time. You are correct in that zeros is just undef based memory allocation followed by fill!(A, 0).
  3. The initial sum(A) is faster than sum(B) but the difference is smaller than the time needed for fill!(A,0)
  4. Subsequent calls to sum(A), sum(B), and sum(C) are roughly equivalent in time.

The situation here is more complicated than your explanation.There is a hint in the GNU libc documentation of why this might be.

You could define calloc as follows:

void *
calloc (size_t count, size_t eltsize)
{
  void *value = reallocarray (0, count, eltsize);
  if (value != 0)
    memset (value, 0, count * eltsize);
  return value;
}

But in general, it is not guaranteed that calloc calls reallocarray and memset internally. For example, if the calloc implementation knows for other reasons that the new memory block is zero, it need not zero out the block again with memset. Also, if an application provides its own reallocarray outside the C library, calloc might not use that redefinition. See Replacing malloc.

4 Likes

I am confused: you mean that you decided to settle on Rust?

Multithreaded GC is coming in v1.10, I wonder if that could improve scaling with multithreaded allocations.

This is a user problem, but a developer opportunity. If SciPy could do it I bet some Rust crates can.

3 Likes

FWIW, my benchmarks were on Julia master.

4 Likes

StrideArraysCore.jl exists to make some of that easier.
You can reinterpret, setindex!, getindex, reshape, etc.
It also provides an @gc_preserve macro, which needs work. This macro should GC.@preserve all arguments to a function call, and try and take PtrArray views. This can help when using MArrays, as it technically prevents their mistake.

I almost never intentionally write type unstable code, but I write only a tiny fraction of the Julia code that I run, and Juila’s compiler likes to give up (without telling you), which is why I can end up with things like https://github.com/PumasAI/SimpleChains.jl/blob/f028d69679d47f11d35e7f311abdf0d1d3bfab9c/src/SimpleChains.jl#L94-L111
With Julia, I’m faced with an endless fight against the language and ecosystem, or giving in and embracing mediocrity.
I mostly prefer keeping my opinions to myself (which is why I deleted an earlier comment), as they’re not constructive. There is a long list of better things to do to try and move things forward in a positive direction, than to spread negativity or rant online.

6 Likes

Looks like it doesn’t scale well despite working more. I’m surprised at how much more, 2.41s * 52.7 → 127s for only ~36x the garbage, but there’s probably math explaining that from the garbage being in the same heap. Am I assuming correctly that you have a 36 core machine to share those threads?

Think escape analysis doing eager frees could replicate the performance of the manual frees in your benchmark? Not sure if that’s worth doing everywhere but it seems like it’s very worth it in multithreading. Also reminded me to read up on RAII again, I still don’t know what happens there when heap allocated data is returned from scope.

By the way, there is a talk comparing Julia and Rust at the upcoming JuliaCon.

4 Likes

How good is Rust’s macro system compared with Julia’s? It’s subtle to use macros in a nested manner in Julia, according to a Github issue:

Since I don’t know Rust, I’m wondering if Rust handle this kind of situations better.

Moving on from this weird case, more generally, is it as easy to define DSLs in Rust as it’s done in Julia’s JuMP package?

No, not yet.

2 Likes

Actually, today I use Julia for algorithm development and reproducible data analysis (thanks, DrWatson) and then, once developed, I rewrite them to Java. Yes, this is a non-standard choice but Java has come a long way over the past 20+ years and it is a language I’ve become very comfortable with (including GUI development).

Can’t wait to find out how the Multitheaded GC will perform. Time to download a pre-release version.

juliaup now supports alpha versions

2 Likes

[quote=“Benny, post:139, topic:101711”]
Looks like it doesn’t scale well despite working more. I’m surprised at how much more, 2.41s * 52.7 → 127s for only ~36x the garbage,[/quote]

I’m not sure. I thought it might be because of the generational assumption being violated in multithreaded contexts, but enabling
GC.enable_logging(true) only ever reports incremental collections (which is good).

I’m on a different computer now than earlier (10980XE instead of 7980XE), but they’re basically the exact same CPU.

Note that these are both actually 18-core CPUs.
So we have twice as much work per physical core in the multithreaded case. The mallocs doing any better than >2x the single threaded time means they’re getting really good multithreaded scaling.

Baseline is similar on the 10980XE, except (surprisingly) it is a bit slower:

julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
 21.603470 seconds (30.00 M allocations: 71.526 GiB, 11.78% gc time)
1.3620400542987349e10

julia> @time foo(LibcMalloc(), X, f, g, h, 30_000_000)
  3.164538 seconds (1 allocation: 16 bytes)
1.3620400542987349e10

julia> @time foo(MiMalloc(), X, f, g, h, 30_000_000)
  2.128713 seconds (1 allocation: 16 bytes)
1.3620400542987349e10

julia> @time foo(JeMalloc(), X, f, g, h, 30_000_000)
  1.976689 seconds (1 allocation: 16 bytes)
1.3620400542987349e10

julia> @show Threads.nthreads();
Threads.nthreads() = 36

julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
222.812451 seconds (1.08 G allocations: 2.515 TiB, 59.32% gc time)
4.903344195475447e11

julia> @time foo_threaded(LibcMalloc(), X, f, g, h, 30_000_000)
  8.182727 seconds (222 allocations: 20.703 KiB)
4.903344195475447e11

julia> @time foo_threaded(MiMalloc(), X, f, g, h, 30_000_000)
  4.208087 seconds (222 allocations: 20.703 KiB)
4.903344195475447e11

julia> @time foo_threaded(JeMalloc(), X, f, g, h, 30_000_000)
  4.512129 seconds (223 allocations: 20.734 KiB)
4.903344195475447e11

julia> versioninfo()
Julia Version 1.11.0-DEV.142
Commit d1be33d4bc (2023-07-22 20:20 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
  Threads: 53 on 36 virtual cores

Now, enabling GC logging…

julia> GC.enable_logging(true);

julia> @time foo(GarbageCollector(), X, f, g, h, 30_000_000)
# huge wall of GC: pauses that look just like the below:
GC: pause 1.55ms. collected 45.875200MB. incr 
GC: pause 1.44ms. collected 45.875200MB. incr 
GC: pause 1.53ms. collected 45.875200MB. incr 
GC: pause 1.53ms. collected 45.875200MB. incr 
 22.042343 seconds (30.00 M allocations: 71.526 GiB, 12.45% gc time)
1.3620400542987349e10

julia> @time foo_threaded(GarbageCollector(), X, f, g, h, 30_000_000)
# the end contained single threaded GCs
# when we were down to 1 task, but the 
# bulk contained collections like:
GC: pause 67.41ms. collected 1397.212160MB. incr 
GC: pause 70.31ms. collected 1454.510080MB. incr 
GC: pause 73.16ms. collected 1324.771840MB. incr 
GC: pause 69.26ms. collected 1434.995200MB. incr 
GC: pause 70.86ms. collected 1469.299200MB. incr 
226.461490 seconds (1.08 G allocations: 2.515 TiB, 59.97% gc time)
4.903344195475447e11

They were all incr, none of the collections during these runs were full.
For the multithreaded case, my computer was at only 40% average utilization (according to btop).

Yes, I think that would let us replicate the performance of manual frees. We may even be able to do better in some specialized circumstances like this benchmark, by having less checks on a reuse fast-path (one implementation of that could get! from task local storage under the hood, and use a weakref to allow the memory to be reclaimed).

Depends. Worse case scenario, it gets copied. In those cases, you can/should generally move the memory out. That means the destination will take ownership.

More commonly, (Named) Return Value Optimization [i.e. (N)RVO] should apply. When this optimization applies, instead of the callee both allocating and filling, the caller actually does the allocation and passes in a reference to the callee, which then fills it.

3 Likes

Very well explained!

And the the central point is: Rust is a systems programming language whereas Julia is application- and domain-oriented. These are fundamentally different objectives.

The whole comparison of @Mo8it is apples to oranges :man_shrugging:

4 Likes

@Mo8it Your post is getting more and more popular. It has been linked twice in Hacker News.
https://news.ycombinator.com/from?site=mo8it.com

4 Likes