Did Modular just reinvent Julia?

Just kind of thinking out loud here… what if Julia did add an MLIR layer? Would that mean that Julia could then pretty easily tap into the Mojo compiler infrastructure?

2 Likes

@vchuravy did add this pull request too MLIR.jl. I think some reactions are happening to Mojo announcement ? Core developers need to say that though I’m just speculating :grimacing:

edit: Just saw that it was mentioned in the slack.

I don’t know. I’ve seen several things that point that direction. Here is Chris L on HN:

The reason Mojo is way way faster than Python is because it give programmers control over static behavior and makes it super easy to adopt incrementally where it makes sense. The key payoff of this is that the compilation process is quite simple, there are no JITs required, you get predictable and controllable performance, and you still get dynamism where you ask for it.

Mojo doesn’t aim to magically make Python code faster with no work (though it does accidentally do that a little bit), Mojo gives you control so you can care about performance where you want to. 80/20 rule and all that.

Looks like maybe Python with a hopefully-more-ergonomic Rust built in that you an apply judiciously. And no word on how to address Python’s leaky abstractions (eg. sys._getframe) . I suppose you either run that code slowly by calling out to (single threaded) Python, or reimplment it. I’m not sure what Mojo is. But it’s not Julia with Python syntax.

Re: MLIR. My understanding or guess has been that Julia compiler people are interested in the idea of using MLIR. But there are many pressing problems and few devs. What’s the path to solving problems with MLIR, and how long will it take?

2 Likes

Judging by this threads we do not have perf comparisons with julia, do we?

No, you can sign up to their waitlist and be granted to the “playground”. On a more general overview, they are still in alpha stages, stuff like memory management and lots of other features are not implemented yet, so we have a while to be able to do any meaningful performance comparisons.

2 Likes

The language claim to be a perfect fit for AI. Why not scientific computation? Is it just a marketing strategy, or is something inherently in their architecture like a 0-based index, math notation, etc?

5 Likes

That was not a reaction to Mojo, but a continuation of research efforts at MIT that started several years ago.

4 Likes

I also don’t see anything about auto-differentiation, is the language AD-friendly in any way ?

3 Likes

Well summarized!

Lets be honest, as long as performance is here, IRs are not important for non-specialized programmers. And one key takeaway it the “supersetness” of Python, with performances. That’s why I see it more a competitor of Numba (even though one could argue that Python is a competitor to Julia, in that case then by transitivity maybe …), in that people want better performance with same code.
As the syntax is a superset of that of Python’s, if it works in Python, it should work in Mojo.

The level of performance gain that has been showcased is quite impressive, but at the expense of a syntax that looks (at first glance) quite cumbersome to me. Maybe I’ll change my mind later, but for now this is quite far from Python’s or Julia’s simplicity.

So I guess the two use cases are something like:

  1. People (likely expert) in ML who want push performances to H/W limits, taping into the "superset’ part
  2. “layman” programmer who want an quick & easy way to improve performance, by “just” changing the interpreter / compiler

Julia, on the other hand, is a better performance / complexity compromise, but at the expense of a new syntax.

It’s a bit like:
performance, simplicity, familiar syntax : choose two.

EDIT:
This may be why Julia should (IMHO) go out of its ML niche - by allowing e.g. small binary sizes :innocent: - to mitigate the risk that this niche could disapear.

15 Likes

Totally agree. Small binaries and dll are important and key for Julia to be adopted in an enterprise environment. It would be great if Julia can be backed by a big company. The JuliaCon video of Jeremy Howard in the link I posted above is a good discussion about Julia. I would really love to see Julia succeed, however I see the very existence of projects like Mojo being a serious threat to Julia…

12 Likes

Be careful what you wish for. Big companies are notorious for dropping projects (open source or internal) on a whim, at which point the decision is treated as a signal on the merits of the project by a lot of others, regardless of the actual potential or quality of the project.

I would rather have a more organic growth path for Julia.

I don’t. Competition is always healthy, and Julia has learned and continues to learn a lot from other languages.

Also, in this particular case, I don’t see a way to try out Mojo on my own computer. There is a “playground”, but no source code or executable. And, importantly, no license, so we do not know if it will be open source or not. I am much more convinced by announcements like this one — “hey, we created this new language, get the source here”.

34 Likes

I did look at that after somebody plugged it on r/Python, but optimised MoJo code is nothing like what you’d have in normal Python.

I looked at the example in Modular Docs - Matrix multiplication in Mojo that’s essentially the same code as the original Python code, which only gave an 9x speedup.

For the 300x speedup it’s basically the same as rewriting the code in another language - a strictly typed Python dialect. And for 500x speedup one needs to define new structs and vectorise the code manually (OK, vectorisation is something Pythonistas are used to, but it doesn’t help with code-readability).

So Mojo is cool if you need to work in Python, but the effort to optimise is considerable. It seems to solve the two-language-problem by creating a low-level version of Python.

A quick 1000x1000 matrix-multiplication benchmark (I didn’t replicate the MoJo benchmark, I had this on my drive already):

import numpy as np

import time

N = 1000
A = np.random.rand(N, N)
B = np.random.rand(N, N)

def matrixmult_i(A,B):
    N = np.shape(A)[1]
    C = np.zeros(np.shape(A))
    for i in range(0,N): # Columns are in the outer loop
        for k in range(0,N):
            for j in range(0,N):  # Rows in the inner loop
                C[i,j] = C[i,j] + A[i,k] * B[k,j]
    return C

tic = time.perf_counter()
matrixmult_i(A,B);
toc = time.perf_counter()
print(f"loop i outer: {toc - tic:0.4f} seconds")
$ python python.py 
numpy: 0.0160 seconds
loop i outer: 313.3334 seconds
loop j outer: 317.3559 seconds
N = 1000
A = randn(N, N)
B = randn(N, N)

using BenchmarkTools

print("A*B: ")
@btime A*B

##
function matrixmult_j(m1, m2)
    n = size(m1, 1)
    m = size(m2, 2)
    l = size(m1, 2)

    m3 = zeros(m, n)

    for j in 1:n
        for k in 1:l
            for i in 1:m
                m3[i, j] = m3[i, j] + m1[i, k] * m2[k, j]
            end
        end
    end
end


print("loop j outer: ")
@btime matrixmult_j(A,B)

The optimised code looks like this - find the difference :wink: :

using LoopVectorization # This, together with the @avx is basically what -O3 does for the Fortran case

function matrixmult_j_avx(m1, m2)
    n = size(m1, 1)
    m = size(m2, 2)
    l = size(m1, 2)

    m3 = zeros(m, n)

    @avx for j in 1:n
        for k in 1:l
            for i in 1:m
                m3[i, j] = m3[i, j] + m1[i, k] * m2[k, j]
            end
        end
    end
end



print("loop j outer - optimised: ")
@btime matrixmult_j_avx(A,B)

So close to maximum speed is reached without any low-level interference in the code (4 characters in the loop and loading the optimisation module).

For reference: the optimal version is operating in-place on pre-allocated matrices, which is also achievable without any low-level acrobatics:

## Preallocate result matrix
m3 = zeros(N,N)

## ! is just a naming convention for in-place function
## send in-place matrix as first argument (convention, Julia doesn't actually care about it):

function matrixmult_j_avx_inplace!(m3, m1, m2)
    n = size(m1, 1)
    m = size(m2, 2)
    l = size(m1, 2)
    @avx for j in 1:n
        for k in 1:l
            for i in 1:m
                m3[i, j] = m3[i, j] + m1[i, k] * m2[k, j]
            end
        end
    end
end

print("loop j outer - optimised, inplace: ")
@btime matrixmult_j_avx_inplace!(m3,A,B)

So Julia stays a high-level language, even when you optimise to the limit.

Results:

$ julia julia.jl 
A*B:   9.765 ms (2 allocations: 7.63 MiB)
loop i outer:   2.009 s (2 allocations: 7.63 MiB)
loop j outer:   698.976 ms (2 allocations: 7.63 MiB)
loop j outer - inbounds:   169.944 ms (2 allocations: 7.63 MiB)
loop j outer - optimised:   34.891 ms (2 allocations: 7.63 MiB)
loop j outer - optimised, inplace:   34.356 ms (0 allocations: 0 bytes)

So with Julia, without ant low-level shenanigans, with Julia the speedup over Python seems to be an a lot better than anything that can be achieved in MoJo with considerable rewriting.

For completeness:

>> Matlab
A*B: Elapsed time is 0.012897 seconds.
loop i outer: Elapsed time is 1.631886 seconds.
loop j outer: Elapsed time is 1.188649 seconds.
$ ./fortranO0 #(unoptimised)
 matmul:    3.14330012E-02
 loop i outer:    2.54097891    
 loop j outer:    2.42137980   
$ ./fortranO3
 matmul:    2.84690000E-02
 loop i outer:   0.868086994    
 loop j outer:    8.28260183E-02

Note: Fortran is slow because I use gfortran and standard libraries. Should be better with another compiler. But hey: on a standard Linux installation, Julia is faster than Fortran!

Edit: I just tried the same benchmark on my old machine and there the LoopVectorization in Julia isn’t quite as efficient. But that may be SIMD progress between i7-4 and i7-11?

Edit2: I am by no means a low-level programmer any more (have been when Fortran90 was the thing everybody was moving towards), so there may be more to be had, but my point here is that we do NOT need to resort to low-level tricks, but can use normal high level code to achieve high performance.

21 Likes

I agree that the examples of Mojo seem quite cumbersome, but using @avx (or @turbo) is not really fair. That’s a macro that rewrites the code, doing the cumbersome stuff for you. Such a macro could be written in Mojo (in principle) as far as I could see.

7 Likes

The optimized one used a python decorator and some functions that seem to be doing the SIMD for you under the hood. I wouldn’t say that using our version of that is wrong.

1 Like

I’m not saying it is wrong, but it doesn’t illustrate a fundamental difference between the languages, only that one convenient macro was written for Julia which currently has no equivalent there.

I find Julia syntax much nicer than python’s, with or without Mojo, but I guess than in both languages one could (will?) reach the point that not even that macro is needed, if the loop is written with a provable inbounds access.

2 Likes

That’s the thing I disagree that they don’t have fancy macros

# Autotune the tile size used in the matmul.
@adaptive
fn matmul_autotune_impl(C: Matrix, A: Matrix, B: Matrix):
    @parameter
    fn calc_row(m: Int):
        @parameter
        fn calc_tile[tile_x: Int, tile_y: Int](x: Int, y: Int):
            for n in range(y, y + tile_y):
                @parameter
                fn dot[nelts : Int](k : Int):
                    C[m,n] += (A.load[nelts](m,k+x) * B.load_tr[nelts](k+x,y)).reduce_add()
                vectorize_unroll[nelts, tile_x // nelts, dot](tile_x)

        # Instead of hardcoding to tile_size = 4, search for the fastest 
        # tile size by evaluting this function as tile size varies.
        alias tile_size = autotune(1, 2, 4, 8, 16, 32, 64)
        tile[calc_tile, nelts * tile_size, nelts*tile_size](A.cols, C.cols)
      
    parallelize[calc_row](C.rows)

I see a vectorize_unrollthat does something like what avx does. It also has the adaptive decorator that seems to autotune the loop unrolling. That looks like an equivalent to me.

1 Like

That’s certainly much more complicated, syntax wize, than @turbo. But is thar a fundamental limitations of their macro capabilities? I guess it is not.

1 Like

I’d also say that macros or not, the base code for the macros has no resemblence to the original naive code anymore. It’s a complete rewrite in a statically typed version of Python.

I wouldn’t want to do that on a larger code base. Adding an @avx now and then (without having to change the original code at all), is what makes it so powerful in Julia.

6 Likes

I think the main limitation with anything Python based is that you’ll have to go through the whole code and statically type all variables. Otherwise the compiler won’t be able to compile.

If you use duck-typed code with variable types at run-time, then that won’t work.

In Julia we get the benefit of multiple dispatch, which means I don’t need to statically type, unless I want to, as long as I follow some simple rules to allow the compiler to infer the types (and compile a new instance of the function as needed).

Does anybody know Python internals well enough to say if that’s a hard restriction?

I only have limited experience with Numba and PyPy and while promising, I had loads of problems when it came to real applications.

Anyhow: for my application, even if Python were the same speed as Julia, I wouldn’t want to go back to it. The SciML framework is just to powerful and unequaled in Python. And I much prefer Julia Syntax, and - most importantly - 1-based indexing and not OO.

10 Likes

I’m not sure about that. So far I could not find an example of Mojo code-rewriting of the type done by LoopVectorization. I found references that Mojo’s metaprogramming is similar to Zig’s comptime, and as far as I can see this is a form of metaprogramming that is quite limited compared to macros as in Julia.

I see that comptime enables many optimizations that can also be done with macros, such as implementing generic types, and with reflection you can have code that automatically compiles to different implementations depending on the features of the actual type. But I don’t see how you can do the sort of code manipulation (treating several lines of code like data) required for something like LoopVectorization.

Maybe Mojo can do something like that at the MLIR level, but I’m not sure how comparable that is.

3 Likes