Coud you give more details on this “something”? What would the Mojo solution look like?
Note that my question was not about the maximum performance you can get from Mojo, but about the possibility to write simple code and have it automatically optimized.
In Mojo I can write optimized code. For example, starting with the non-optimized
def matmul_untyped(C, A, B): for m in range(C.rows): for n in range(C.cols): for k in range(A.cols): C[m, n] += A[m, k] * B[k, n]
I can use SIMD by replacing the last line with
C[m,n] += (A.load[nelts](m,k+x) * B.load_tr[nelts](k+x,n)).reduce_add()
I still need to handle the scalars that don’t fit the SIMD width. To solve this problem I can wrap this line into a small
dotfunction and call it with
I can parallelize my code by rewriting it as a function that operates on single rows and calling it with
I can implement some manual tiling in my for loops.
In Julia I can write the non-optimized
A[m,k]*B[k,n]and get all these optimizations automatically by adding a
@turbocall in front of the main for loop.
@turbomacro from LoopVectorization will basically rewrite my code to apply the same optimizations as in the Mojo example. (The tiling optimization is not currently implemented in LoopVectorization but it could be. And it seems the LoopVectorization version without tiling is faster than the Mojo version with tiling for some reason).
I’d love to see an example of how comptime (or other Mojo/Zig feature) can be used to do this kind of automatic optimization of simple code that LoopVectorization does in Julia.
(The successor of LoopVectorization being written in C++ is interesting, but a bit orthogonal to what is possible in Mojo and Julia.)