Did Modular just reinvent Julia?

Coud you give more details on this “something”? What would the Mojo solution look like?

Note that my question was not about the maximum performance you can get from Mojo, but about the possibility to write simple code and have it automatically optimized.

So far I have seen these Mojo examples, and this Julia example. Some observations based on that:

  • In Mojo I can write optimized code. For example, starting with the non-optimized

    def matmul_untyped(C, A, B):
        for m in range(C.rows):
            for n in range(C.cols):
                for k in range(A.cols):
                    C[m, n] += A[m, k] * B[k, n]
    

    I can use SIMD by replacing the last line with

    C[m,n] += (A.load[nelts](m,k+x) * B.load_tr[nelts](k+x,n)).reduce_add()
    

    I still need to handle the scalars that don’t fit the SIMD width. To solve this problem I can wrap this line into a small dot function and call it with vectorize[nelts, dot](A.cols).

    I can parallelize my code by rewriting it as a function that operates on single rows and calling it with parallelize.

    I can implement some manual tiling in my for loops.

  • In Julia I can write the non-optimized A[m,k]*B[k,n] and get all these optimizations automatically by adding a @turbo call in front of the main for loop.

    The @turbo macro from LoopVectorization will basically rewrite my code to apply the same optimizations as in the Mojo example. (The tiling optimization is not currently implemented in LoopVectorization but it could be. And it seems the LoopVectorization version without tiling is faster than the Mojo version with tiling for some reason).

I’d love to see an example of how comptime (or other Mojo/Zig feature) can be used to do this kind of automatic optimization of simple code that LoopVectorization does in Julia.

(The successor of LoopVectorization being written in C++ is interesting, but a bit orthogonal to what is possible in Mojo and Julia.)

2 Likes

In my opinion, the main thing i’m looking for in a next-gen language that solves the 2 language problem is a compiler that assures that type stable code has top notch performance.

For example, I think it is too hard to write differentiable code using arrays that doesn’t allocate in the forward and backward path. Some macro like @propagate_inbounds should exist that triggers automatic preallocation in functions. With that, performant coding in Julia would be as easy as in Pytorch.

IMO, this is the main thing holding Julia back. More so than not being able to compile to an executable.

I do like the fact that you can define 2 types of function. One like in Julia and one which errors when it isn’t typestable. Julia should also add this. These functions should also be compileable to an exe.
But other than that, It doesn’t seem that Mojo is improving on anything else in a significant way.

6 Likes

Related article: fast.ai - Mojo may be the biggest programming language advance in decades

3 Likes

I don’t really know what the significance is, but Mojo apparently has a similar relationship to MLIR as Julia has to LLVM. If someone can explain the ramifications of this, I’d be interested to hear.

6 Likes

You can use cache-oblivious tiling instead: Julia matrix-multiplication performance - #12 by stevengj

4 Likes

MLIR is a layer over LLVM. It’s literally hosted on the LLVM website: https://mlir.llvm.org/

It’s worth understanding Modular in the context of Chris Lattner’s life:

  • He created LLVM in grad school with his advisor.
  • This work led to the creation of Clang after he joined Apple.
  • Later he worked at Apple to build Swift.
  • He went to Google to do ML compilers. His group built Tensorflow for Swift and also MLIR.
  • He left Google and moved around a bit.
  • His current startup is now creating Modular, which is Python++ that uses MLIR.
37 Likes

That’s interesting. But does it give Mojo some advantage over Julia? Can (or should) Julia ‘transition’ to MLIR? Is Julia ‘stuck with a lesser engine’? Or is this overblown, and Julia can reap the benefits of not being shackled by the Python legacy?

(I must admit I’m seeing this a bit in a competitive light, and feeling that if Mojo delivers on its promises, we will all be stuck with Python and its awkward syntax forever. I would never be able to recruit colleagues and customers to use Julia over python :cry:)

25 Likes

Let’s split this into two points:

  1. MLIR is meant to simplify optimizations that hard to do in LLVM (e.g. LLVM doesn’t have constructs for nd-array optimizations built in). Perhaps it’s useful to Julia, but Julia could always use it if it wanted since it’s just an additional stage between user code and LLVM IR.
  2. Mojo’s pitch is not that MLIR is important (it’s really an implementation detail, not a value add), but that all your existing Python code just works, but some is a lot faster. That’s a pitch that an infinite amount of work on Julia as a language won’t rival since Julia is not Python. But it’s a pitch that still needs to be tested in practice. The contrast between Julia and Mojo is really between two answers to a single question: “to get dramatically better experiences for scientific computing, do you need a new language or not?” Mojo is a bet that the answer is no; Julia is a bet that the answer is yes.
59 Likes

We’ve seen that people love Python so much they will bear any development hardship to keep using it. If Mojo delivers some part of some of it’s promises, the Python world will be in nirvana.

8 Likes

It looks like MLIR is a bit more than the target IR for Mojo. It seems to replace the role that the core of Julia that calls C and llvm via C plays. But it also seems to be a bit more integrated into the language. Maybe a wider class of devs is meant to interact with it:

@register_passable("trivial")
struct Int:
    var value: __mlir_type.`!pop.scalar<index>`

    fn __init__(value: __mlir_type.`!pop.scalar<index>`) -> Self:
        return Self {value: value}

Mojo really does look like a new language to me. They do carry a lot of Python baggage. But it doesn’t look to me like stuff bolted onto Python, nor a DSL that shares Python syntax. The features (as far as I can tell from their “manual”) are integrated into a coherently designed language.

It looks curiously like a cross between a dynamic language and an explicit statically-typed language. Chris L repeats several times that they don’t want to rely on a smart compiler to optimize facile dynamic code. Rather they allow you to be specific (and provide syntax and features to require you to be specific) so that optimization is closer to guaranteed.

“a cross between”. It’s not a blend, but sort of two approaches or languages that live side by side and interact well. def and fn. And struct and class. It’s not clear how it will turn out, but it’s … interesting.

6 Likes

Does this mean that it seems to trade off performance and genericity, so you cannot have full performance without specifying concrete types?

1 Like

Most people “use” one language or the other for the packages.

Currently I think that the heads up of the ML stuff that python has is what keeps it unbeatable.

But there are other dozens of languages in which people are happily developing interesting things, and hardly any will go away.

4 Likes

Indeed, but if one’s goal is to get colleagues and customers to use a particular language, then it doesn’t help much that the language survives or even thrives. Unless Julia makes major strides into the mainstream, and wins massive mindshare, I’ll be forever stuck with python/matlab/C++ at work.

I realize that it was always a long shot to hope that Julia would gain that much on Python, but developments like this makes it even less likely. That’s why a project like Mojo makes me sad, not excited. Not more python, aaargh.

I guess I’ll go read the Mojo pitch, maybe it can make me dislike python less.

26 Likes

This is the second time that Chris Lattner (sort of) bets against Julia: the first time was when he led the TF2 rewrite in Swift. It would have been nice to see Google throwing some resources into Julia instead of reinventing AD in a language with poor support outside MacOS and no scientific computing community.

The amount of progress that the Julia stewards do with so very little resources (in comparison) is proof of how well the underlying design of the language works. Most obstacles to make Julia a truly great development experience seem a matter of time and some moderate investment.

It saddens me that so much effort gets invested into Python: it’s fundamentally not well cut out for its niche. It also do not think is the pinacle of programming language design and I suppose even within the Python community they will have to agree, given there are so many projects to redefine its runtime or tackle new semantics on top.

EDIT: I guess I sound a bit bitter, but I’d like to see Julia succeed and expand to new niches. I suppose this kind of effort to improve Python is still a net positive.

51 Likes

Related: We are a bit overwhelmed at the moment, but sure, this is something we'd eventua... | Hacker News

5 Likes

LoopVectorization does register tiling, not cache tiling.
Ideally, it’d do both. For 128x128 matrices like in their example, you don’t need cache tiling because the matrices already fit in cache.

Register tiling is essential for good performance on problems like these, and an optimization LoopVectorization is good at. For 128x128 matrices, LV should hit >=90% of theoretical peak flops, and of course handle most combinations of transposing the arrays fairly well.

For a few combinations, an optimization related to cache tiling is essential even at these small sizes, so performance will probably drop to closer to 50% of the theoretical peek.

julia> using LoopVectorization
Precompiling LoopVectorization
  Progress [>                                        ]  0/1
  1 dependency successfully precompiled in 11 seconds. 30 already precompiled.

julia> function AmulB!(C,A,B)
           @turbo for n = indices((C,B),2), m = indices((C,A),1)
               Cmn = zero(eltype(C))
               for k = indices((A,B),(2,1))
                   Cmn += A[m,k]*B[k,n]
               end
               C[m,n]=Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> AmulB!(C1,A,B); C1 ≈ C0
true

julia> 2e-9*M*K*N/@belapsed(AmulB!($C1,$A,$B)) # 233 GFLOPS
233.71804301794273

Using Float32 and 128x128 matrices, I get 233 GFLOPS on my 10980XE.

julia> 4*32*2 # 4 GHz * 32 flop/instr * 2 instr/clock
256

julia> 233/256
0.91015625

This is about 91% of the theoretical peak for this CPU.
Most other combinations drop performance slightly for various reasons:

julia> 2e-9*M*K*N/@belapsed(AmulB!($C1,$A',$B')) # scatter stores
207.97857886646506

julia> 2e-9*M*K*N/@belapsed(AmulB!($C1,$A,$B')) # traverses B too quickly
209.19221945137159

julia> 2e-9*M*K*N/@belapsed(AmulB!($C1,$A',$B)) # needs packing
106.37072354239051

Transposing C1 = A*B is equivalent to C1' = B'*A', so performance is again equivalent.

julia> 2e-9*M*K*N/@belapsed(AmulB!($C1',$A',$B'))
233.91355752607217

Note that this is all single threaded. Using @tturbo instead and starting Julia with the default -t4, so that I use 4 threads as in their example:

julia> 2e-9*M*K*N/@belapsed(AmulBt!($C1',$A',$B')) # 852 GFLOPS!
852.3028332559221

My 10980XE is probably twice as fast as the CPU they benchmarked on (AVX512 w/ 2 FMA/core), so keep that in mind. >100 GFLOPS/core should still be expected.

8 Likes

I’m old enough to remember when Swift for Tensorflow was going to be the biggest programming language advance for ML in decades. Swift for TensorFlow (TensorFlow Meets) - YouTube

15 Likes

Mojo isn’t just a syntax or PL step forward, it is a massive step forward in compiler architecture.
-Chris

But can Julia readily tap into this improvement?

MLIR is an intermediate representation that’s at a higher level than LLVM-IR. Being an IR it wouldn’t be all that difficult to add an MLIR layer to Julia - that’s the whole point of IRs. No changes to the language itself would be required. An MLIR layer could definitely offer more opportunities for optimization, but of course it would take some development effort. One could imagine that if the amount of money being poured into Mojo went into developing an MLIR layer (and optimizations) for Julia that Julia would be out in front performance-wise faster than Mojo would hit maturity.

5 Likes