Did Modular just reinvent Julia?

Fair point!

Now my time is consumed by this thread. I’m doing it wrong :wink:

Seriously: I find this discussion, not Mojo, quite interesting. Thanks, go on!

5 Likes

Totally agree, wanting “script-friendly” behavior from small binaries is self-contradictory.
(If you want to interact with the script, go back to the file + REPL) Thus it would be reasonable to expect a stricter mode for compilation.

To stay in the thread, I don’t know if Modular just reinvent Julia, but this has triggered a lot of feedbacks & reactions to “reinvent”(*) Julia :upside_down_face:

(*) kidding

4 Likes

When I think more about it, I think there might be another angle to look at it.

Julia’s vision about solving the languages problem was about having both dynamic high productivity language and being fast. It was reasonable to believe that people will embrace the vision and hence will embrace the Julia run time.

Mojo’s take on solving the problem is different.
The assumption is people like their current language (Mainly Python).
The issue is the dynamic range between the coding skills needed to use the host language (The preferred one) and the languages used to generate the accelerated code.
For instance, in Python, it means C / C++ / Cython capabilities.

So Mojo, in my opinion, will also be developed in that direction as the 2nd step. Generate functions / packages to accelerate Python with support to any accelerator under MLIR.
Now we see the 1st step, embracing Python into Mojo. Next step we’ll see Python embracing Mojo with many packages rewritten in Mojo.
Think for instance on NumPy, SciPy, etc… Having built in support for all the accelerators using a single code.

Strategically thinking, this is something Julia can achieve as well. Maybe even faster?

1 Like

The classic PyTorch (e.g. PyTorch 1) doesn’t go through the paths in this diagram.

2 Likes

Neither does the “current” PyTorch 2.0. It turns out things are a little more complicated though.

As of 2.0, PyTorch comes with a “go fast button” in the form of torch.compile. This can interface with a number of different backends, but the default one on GPU uses GitHub - openai/triton: Development repository for the Triton language and compiler to generate kernels. As it happens, Triton recently switched to using MLIR internally for its compiler, and to my knowledge has seen some notable performance uplifts post-switch. In turn, PyTorch compiled mode can improve perf pretty significantly on top of the old, not compiled eager mode path.

There are more places where PyTorch 1 and 2 plug into a compiler stack which uses MLIR (e.g. the various forms of XLA integration), but I feel this default GPU backend is the most salient one because it mirrors what one might want to do in Julia. For example, Tullio.jl’s GPU support currently involves generating KernelAbstractions.jl kernels. To my knowledge nobody has benchmarked Triton and KernelAbstractions side-by-side, but having seen benchmarks of both against similar baselines (e.g. cuBLAS) I wouldn’t be surprised if the former can deliver better performance given the same amount of time to write a kernel. How much of that is due to using MLIR? Hard to say. Is being dismissive of this tech still a good idea? Probably not.

7 Likes

It does use MLIR, as mentioned above.

This isn’t even my take, it’s taken from conversations with Flux devs. If you want to recreate an entire ML compiler ecosystem on par with MLIR from scratch, be my guest, but that seems harder than just borrowing an existing compiler infrastructure.

3 Likes

Agreed! Couldn’t have said it better myself.

I disagree about correctness being provided only by unit tests. A versatile but strict type system is such a luxury. Compared to Rust I have to write so many more unit tests in Julia and have so many more runtime errors. Rust-analyzer can immediately tell me if my I messed up a function call. The monads can immediately tell me if some operation has a high chance of failure. Enumerations are really nice for modeling data, too.

I feel like once I have the correct design idea in Rust, there’s so much less mental overhead as compared to Julia.

4 Likes

Other kind of tools exist even in languages that don’t have a strict type system, eg

2 Likes

In some regards, yes, in others, I would argue there is more overhead. For example, there is added mental overhead of casting as_int32 as opposed to Julia.

Rust does not strike the right balance of mental overhead for me, but I do appreciate its compiler hints a lot.

1 Like

I was curious and also got access to the Mojo playground.

One of the things I still do not understand is how programmers should decide where to use def/class and where fn/struct. For sure, if you want performance, you’d go for fn/struct, otherwise if you don’t care, you use the good old def/class. But how will someone structure the data and logic inside a software package?

The main problem here in my opinion is that you not only sacrifice performance with def/class, but also increase architecture complexity and reduce the interoperability inside your own application (e.g. I don’t see how class instances would work in a fn context or def in fn). This kind of means that you introduce the two-language problem within a single language and in addition, you’ll have to deal with all the other two-language infected Python constructs which you might want/need to use in your Mojo.

The other thing is how the package system and distribution will work and interoperate with existing Python packages and Mojo (hello pip, conda, anaconda, miniconda, pyenv, poetry, cockroach, pipenv, deadsnakes, hatch, flit, pdm, along with easy_install, setuptools, wheel, … and all the other cool stuff coming 2023+).

Again, as many others pointed out, it’s way too early to draw conclusions. We’ll see…

18 Likes

So, does this mean that the real problem is that it isn’t easy enough to attach C / C++ / Rust code to the existing Python Language? To the extent that they want to write a new language where the Rust is embedded in the Python code?

2 Likes

There are multiple aspects to two-language problems. The two-language problem I’m most familiar with (even if it’s some time back now) is extending Matlab with “mex” files written in C. The pain points of that can roughly be split into four parts:

  1. A different language being used for the extension.
  2. Boilerplate code to interface input/output data between the languages.
  3. Complications with C tool chains and/or distribution of compiled artifacts.
  4. Tooling like debugger and profiler not working past the language border.

How significant each of these are is subjective but for me it was always a minor problem that there were two languages. The major problem was that it was so far from seamless to connect them.

4 Likes

I have often made the case that, particularly from the perspective of someone inexperienced in lower-level programming, idiomatic Julia[1] and performance-optimised Julia read or feel like different languages.

The key difference between the “two-language” problem I present and the one referred to in Julia’s raison d’etre is precisely that the transition between the two “languages” of Julia are seamless. I can write part of my code in nice-looking Julia, and I can write performance-critical parts of my code using esoteric mystical incantations. Then I can debug, profile, differentiate, transform, or do anything I want to either part of the code without needing to worry about whether one tool will work for both “languages”.

Julia’s range from high-level niceties to low-level spellcasting is perfect for me, a scientific programmer who wants things to run fast and look nice. Rust may be better for others, Python may be better for others. Their ranges are different. With two languages you get a wider range but more pain points, with Julia I get a wide enough range that comes painlessly.


  1. Particularly idiomatic functional-style Julia, although this is improving (as am I as a programmer). ↩︎

26 Likes

That can be true, but in a scale different from interpreted languages. For instance, in this thread a highly optimized code obtains “only” a 6x speedup relative to the most natural implementations. Not 20x, 200x. The code without the optimization is still useful, and that optimization was a last resort. With other languages one simply cannot proceed before rewriting the critical parts, because the codes becomes too slow even for testing.

22 Likes

As others have said, it is extremely early to judge how things will turn out for Mojo, but one thing I hope is that this whole issue of raising the bar turns out for the best on the Julia side.

The 1.9 release I think is a game-changer with regards to TTFP, a major pain-point that now is nearly gone for a lot of packages (and that could be even gone for Makie as well, or that’s what I’ve read).

It’d be good to know what other improvements are being planned… maybe there is a roadmap document that I’m missing? Not only to keep our hopes high on Julia, having a better-known roadmap would allow us to stop worring about some present inconveniences.

For example, I’ve been worried about how to get stack-allocated arrays in Julia, and how not having them could impact performance. Then I stumbled upon StridedArrays.jl, and I’ve read a comment I believe by @Mason saying that with improved escape anaysis in the compiler some variables could become automatically stack-allocated in the future, and that StridedArrays will then work much better (I found it quite hard to use, because of the need to bypass the GC, etc).

I’ve also read that the runtime (including the GC) will become modularized, maybe allowing PackageCompiler to select which modules to include in binaries. (I’m not sure I got this right, though)… For me, this could mean that I can just wait util this improvement comes out, and I can stop worrying about wether my code is compatible with StaticCompiler or not, as I really like the idea of being able to produce small binaries, shared libraries, and also target webasembly (as reciently demonstrated by StaticCompiler).

Anyway, those are my 2c, having a roadmap would be nice.

8 Likes

I’ve seen people claim that, but mostly I find that performant code is simple and idiomatic. For optimal performance, a straight loop with @views and @inbounds or @turbo, look simple enough to me.

10 Likes

I somewhat buy it. A lot of Julia code starts out as chains of higher-level, vectorized operations. I’m including fused in-place broadcasts and functions like map! here too. Oftentimes people go from those straight to nested loops, as can be seen in help threads on this very forum. I personally don’t feel the difference in mental model etc is insurmountably large, but it does exist.

Given this is a topic about Mojo, however, I think another interesting angle to explore here is performance portability and how it trades off with various programming paradigms. The status quo in Julia is that loopy code + all the associated go fast macros (@simd, @turbo, etc) are CPU only, while chaining vectorized functions may be slower but works on GPUs and other accelerators. Yes you can technically write cross-device kernels, but that’s a completely different paradigm and one which lacks many of the creature comforts people have come to expect from the aforementioned ones (when was the last time you had to worry about thread synchronization using broadcasting or @turbo?).

I’m still not sure how I feel about the specifics of Mojo’s programming model and nobody knows how well it’ll work on non-CPUs, but it’s really nice to see someone trying to find more common ground between these disparate programming paradigms. If they can pull this off, I think it’s less of a question of whether Julia-the-language can do the same (e.g. one could imagine easily porting this tutorial 1-1), but what is required behind the surface-level syntax to make it work (going beyond @tturbo, if you will).

6 Likes

I’m looking into MLIR and LLVM recently and it looks like MLIR was meant to solve exactly this problem. For example see the talk by Harsch Menon at https://youtu.be/VFexAjUoTZI.

Basically, as I understand it, MLIR makes it easy to define code transformations and constraints. Given some input code, you map a bunch of transformations over it such as loop unrolling. Since the transformations are just a set, you can define different sets for different targets as is shown by Harsch in the video. He shows generated code for both a CPU and GPU target.

4 Likes