Hello all! Super excited to use Julia. It won out in a lengthy evaluation of mine against Rust, “modern” C++ (>= C++23), modern Java (>= Java 24), Mojo, etc. as the basis of my startup’s backend. Thank you to all who have made Julia and its community what they are!
One curiosity on my mind is whether Julia might benefit from its primitive datatypes being SIMD by default, as in Mojo (e.g. Int64 actually being Vec{1,Int64}). Perhaps loop autovectorization + SIMD.jl render irrelevant any purported benefits of SIMD-by-default datatypes. But not being by any means yet an expert in Julia, I figured I’d ask!
I’m not sure I understand what you’re asking. What does it mean that Int is a 1-element SIMD vector by default, or just a scalar primitive value? That is, what does that enable, or make easier?
It’s worth noting that you don’t even need to use SIMD.jl or @simd or even a for loop to get vectorization in many cases. Often just having elements in a tuple or array is enough. For example, adding 4 Ints in a tuple on my machine generates <4 x i64> LLVM IR and a pair of .2d ARM vector instructions.
julia> f(x) = (x[1]+x[2])+(x[3]+x[4])
f (generic function with 1 method)
julia> @code_llvm debuginfo=:none f((1,2,3,4))
; Function Signature: f(NTuple{4, Int64})
define i64 @julia_f_1604(ptr nocapture noundef nonnull readonly align 8 dereferenceable(32) %"x::Tuple") local_unnamed_addr #0 {
top:
%0 = load <4 x i64>, ptr %"x::Tuple", align 8
%1 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %0)
ret i64 %1
}
Or to take the example from julia#55118, even without explicit the VecElement:
julia> f(a, b) = a .* b
f (generic function with 1 method)
julia> a = Float32.(sqrt.(((1:4)...,)))
(1.0f0, 1.4142135f0, 1.7320508f0, 2.0f0)
julia> @code_llvm debuginfo=:none f(a,a)
; Function Signature: f(NTuple{4, Float32}, NTuple{4, Float32})
define void @julia_f_1783(ptr noalias nocapture noundef nonnull sret([4 x float]) align 4 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"a::Tuple", ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"b::Tuple") local_unnamed_addr #0 {
top:
%0 = load <4 x float>, ptr %"a::Tuple", align 4
%1 = load <4 x float>, ptr %"b::Tuple", align 4
%2 = fmul <4 x float> %0, %1
store <4 x float> %2, ptr %sret_return, align 4
ret void
}
There’s nothing special about broadcasting; primitive recursion (or explicit indexing) also gets this treatment:
julia> g(a, b) = (a[1]*b[1], g(a[2:end], b[2:end])...)
g (generic function with 1 method)
julia> g(::Tuple{}, ::Tuple{}) = ()
g (generic function with 2 methods)
julia> @code_llvm debuginfo=:none g(a,a)
; Function Signature: g(NTuple{4, Float32}, NTuple{4, Float32})
define void @julia_g_2583(ptr noalias nocapture noundef nonnull sret([4 x float]) align 4 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"a::Tuple", ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"b::Tuple") local_unnamed_addr #0 {
top:
%0 = load <4 x float>, ptr %"a::Tuple", align 4
%1 = load <4 x float>, ptr %"b::Tuple", align 4
%2 = fmul <4 x float> %0, %1
store <4 x float> %2, ptr %sret_return, align 4
ret void
}
It can be fiddly in some cases, though; that’s when you’d want to reach for an VecElement of a given width. I’m not a compiler person, but I’d wager the difficulties that are preventing a non-VecElement from SIMD’ing would be exactly the same as the difficulties in coalescing multiple 1-element special SIMD-able types into an N-way SIMD.
Looks like a scalar value really is a SIMD vector with 1 semantic element. So what happens when it’s a field in a struct, does it just take up 4x the space?
Auto vectorization is unreliable.
I made a comparison between ForwardDiff.jl and a C++ forward-mode AD library here, and more efficient memory management plus SIMD were the major contributors to the Julia implementation performing worse:
Mojo takes the approach of no-autovectorization. It disables those compiler passes. You need to instead be explicit, e.g. using x_simd = x.vectorize[1,simd_width]() instead of x, or using functional tools that map over ranges and evaluate with SIMD.
For peak performance, you need expertise + control. Mojo is currently focused on implementing algorithms like matrix multiply and flash attention with state of the art performance, so this focus makes sense.
For most code that most people write, it isn’t worth the time/effort to manually vectorize, tile, and thread, even if they have the expertise to know how to optimize it.
So relying on auto-vectorization is reasonable for a lot of code, IMO. Just not things in the matmul class, where Mojo is still focused.
All views here are my own only and don’t express anyone else’s opinions/ intentions/etc and are merely my perspective.
Great discussion! Love the thoughtfulness and engagement of the community here.
In my mind, that’s what Mojo’s “SIMD by default” approach intends to overcome. But I haven’t dissected Mojo internals to really know.
By this I assume you’re implying that loop autovectorization would not actually be easier with SIMD-by-default primitives. I think that’s quite likely. My experience is primarily on the JVM, where it’s very easy to get off the autovectorization happy path. I assume any compiler — JVM, LLVM, or otherwise — is going to have this challenge.
On the JVM, SIMD code is very explicit, very verbose (not as much as hand-rolled assembly of course, but nothing as breezy as the multiple-dispatch-enabled SIMD.jl API), and looks very different from scalar operations. What I’m gathering from these posts is that both the ergonomics and the performance of SIMD code in Julia are already good, and that SIMD-by-default datatypes wouldn’t affect this either way.
That said, maybe my assessment is off, given this information:
I guess arithmetically the strategy is: LLVM - Vectorization = LLM + Mojo Vectorization :-).
Being serious, given there are enough examples of such Mojo code this strategy can have a lot of synergy with the rise of LLM as co developers.
Modular released many of their kernels in Open Source lately. One of them should be the world fastest GEMM. It is a great start to “material” data.
Thanks @tecosaur! It’s called Onton (onton.com). We’re helping people make purchase decisions they feel great about. We’ve started in home decor and furniture, but are expanding to other verticals this year. We have a multimodal search engine and some image generation tools. Recently we built a neurosymbolic self-learning agent that overcomes the LLM hallucination problem, and that’s going to become the basis of our search engine as time goes on.
Fun side fact (since I suspect Julians are more likely than many to be philosophically inclined) — the name “Onton” came from a term I introduced in an unpublished paper a long time ago. My cofounder liked it so we used it when we rebranded.
We take what exists to be all and only metaphysical or ontological simples beyond which distinction is impossible. We term these “ontons”. Ontons may be the fields composing the fundamental particles of standard quantum mechanics … However, given the unending nature of the refinements made to empirical inquiry over the centuries and even over the past few decades, it is likely that those particles we consider fundamental are not truly fundamental after all. Thus, we define ontons as being whatever the most fundamental units of being actually are. Whether or not we have observed them directly, we cannot know.
That sounds super cool and i have bookmarked Onton now. How do you avoid hallucinations? I didnt know that was possible (albeit I am not a AI Engineer).