Datatypes SIMD by default? (as in Mojo)

alexandergunnarson · May 7, 2025, 6:04am

Hello all! Super excited to use Julia. It won out in a lengthy evaluation of mine against Rust, “modern” C++ (>= C++23), modern Java (>= Java 24), Mojo, etc. as the basis of my startup’s backend. Thank you to all who have made Julia and its community what they are!

One curiosity on my mind is whether Julia might benefit from its primitive datatypes being SIMD by default, as in Mojo (e.g. Int64 actually being Vec{1,Int64}). Perhaps loop autovectorization + SIMD.jl render irrelevant any purported benefits of SIMD-by-default datatypes. But not being by any means yet an expert in Julia, I figured I’d ask!

tecosaur · May 7, 2025, 9:04am

That’s cool! Best of luck with your startup, what sort of problem area are you planning on operating in?

For now you’ll likely continue to need to use SIMD.jl and friends, but there is some work on the backburner to make SIMD with Base types work better: Add support for intrinsics for `NTuple{VecElement}` by oscardssmith · Pull Request #55118 · JuliaLang/julia · GitHub

jakobnissen · May 7, 2025, 1:21pm

I’m not sure I understand what you’re asking. What does it mean that Int is a 1-element SIMD vector by default, or just a scalar primitive value? That is, what does that enable, or make easier?

mbauman · May 7, 2025, 2:20pm

It’s worth noting that you don’t even need to use SIMD.jl or @simd or even a for loop to get vectorization in many cases. Often just having elements in a tuple or array is enough. For example, adding 4 Ints in a tuple on my machine generates <4 x i64> LLVM IR and a pair of .2d ARM vector instructions.

julia> f(x) = (x[1]+x[2])+(x[3]+x[4])
f (generic function with 1 method)

julia> @code_llvm debuginfo=:none f((1,2,3,4))
; Function Signature: f(NTuple{4, Int64})
define i64 @julia_f_1604(ptr nocapture noundef nonnull readonly align 8 dereferenceable(32) %"x::Tuple") local_unnamed_addr #0 {
top:
  %0 = load <4 x i64>, ptr %"x::Tuple", align 8
  %1 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %0)
  ret i64 %1
}

Or to take the example from julia#55118, even without explicit the VecElement:

julia> f(a, b) = a .* b
f (generic function with 1 method)

julia> a = Float32.(sqrt.(((1:4)...,)))
(1.0f0, 1.4142135f0, 1.7320508f0, 2.0f0)

julia> @code_llvm debuginfo=:none f(a,a)
; Function Signature: f(NTuple{4, Float32}, NTuple{4, Float32})
define void @julia_f_1783(ptr noalias nocapture noundef nonnull sret([4 x float]) align 4 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"a::Tuple", ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"b::Tuple") local_unnamed_addr #0 {
top:
  %0 = load <4 x float>, ptr %"a::Tuple", align 4
  %1 = load <4 x float>, ptr %"b::Tuple", align 4
  %2 = fmul <4 x float> %0, %1
  store <4 x float> %2, ptr %sret_return, align 4
  ret void
}

There’s nothing special about broadcasting; primitive recursion (or explicit indexing) also gets this treatment:

julia> g(a, b) = (a[1]*b[1], g(a[2:end], b[2:end])...)
g (generic function with 1 method)

julia> g(::Tuple{}, ::Tuple{}) = ()
g (generic function with 2 methods)

julia> @code_llvm debuginfo=:none g(a,a)
; Function Signature: g(NTuple{4, Float32}, NTuple{4, Float32})
define void @julia_g_2583(ptr noalias nocapture noundef nonnull sret([4 x float]) align 4 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"a::Tuple", ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"b::Tuple") local_unnamed_addr #0 {
top:
  %0 = load <4 x float>, ptr %"a::Tuple", align 4
  %1 = load <4 x float>, ptr %"b::Tuple", align 4
  %2 = fmul <4 x float> %0, %1
  store <4 x float> %2, ptr %sret_return, align 4
  ret void
}

It can be fiddly in some cases, though; that’s when you’d want to reach for an VecElement of a given width. I’m not a compiler person, but I’d wager the difficulties that are preventing a non-VecElement from SIMD’ing would be exactly the same as the difficulties in coalescing multiple 1-element special SIMD-able types into an N-way SIMD.

ctkelley · May 7, 2025, 3:10pm

I’m a little confused. When I do this

x=(1,2,3,4);
julia> @code_llvm debuginfo=:none sum(x)
; Function Signature: sum(NTuple{4, Int64})
define i64 @julia_sum_4537(ptr nocapture noundef nonnull readonly align 8 dereferenceable(32) %"a::Tuple") #0 {
top:
  %0 = load <4 x i64>, ptr %"a::Tuple", align 8
  %1 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %0)
  ret i64 %1
}

I get what you got. But if I use a vector instead of a tuple I do not.
What’s happening?

y=[1,2,3,4];
julia> @code_llvm sum(y)
; Function Signature: sum(Array{Int64, 1})
;  @ reducedim.jl:979 within `sum`
define i64 @julia_sum_4474(ptr noundef nonnull align 8 dereferenceable(24) %"a::Array") #0 {
top:
; ┌ @ reducedim.jl:979 within `#sum#734`
; │┌ @ reducedim.jl:983 within `_sum`
; ││┌ @ reducedim.jl:983 within `#_sum#736`
; │││┌ @ reducedim.jl:984 within `_sum`
; ││││┌ @ reducedim.jl:984 within `#_sum#737`
; │││││┌ @ reducedim.jl:326 within `mapreduce`
; ││││││┌ @ reducedim.jl:326 within `#mapreduce#727`
; │││││││┌ @ reducedim.jl:334 within `_mapreduce_dim`
; ││││││││┌ @ reduce.jl:418 within `_mapreduce`
; │││││││││┌ @ indices.jl:536 within `LinearIndices`
; ││││││││││┌ @ abstractarray.jl:98 within `axes`
; │││││││││││┌ @ array.jl:194 within `size`
              %"a::Array.size_ptr" = getelementptr inbounds i8, ptr %"a::Array", i64 16
              %"a::Array.size.0.copyload" = load i64, ptr %"a::Array.size_ptr", align 8
; │││││││││└└└
; │││││││││ @ reduce.jl:420 within `_mapreduce`
           switch i64 %"a::Array.size.0.copyload", label %L28 [
    i64 0, label %L104
    i64 1, label %L23
  ]

L23:                                              ; preds = %top
; │││││││││ @ reduce.jl:423 within `_mapreduce`
; │││││││││┌ @ essentials.jl:920 within `getindex`
            %memoryref_data = load ptr, ptr %"a::Array", align 8
            %0 = load i64, ptr %memoryref_data, align 8
; │││││││││└
; │││││││││ @ reduce.jl:424 within `_mapreduce`
           br label %L104

L28:                                              ; preds = %top
; │││││││││ @ reduce.jl:425 within `_mapreduce`
; │││││││││┌ @ int.jl:83 within `<`
            %1 = icmp sgt i64 %"a::Array.size.0.copyload", 15
; │││││││││└
           br i1 %1, label %L93, label %L44

L44:                                              ; preds = %L28
; │││││││││ @ reduce.jl:427 within `_mapreduce`
; │││││││││┌ @ essentials.jl:920 within `getindex`
            %memoryref_data10 = load ptr, ptr %"a::Array", align 8
            %2 = load i64, ptr %memoryref_data10, align 8
; │││││││││└
; │││││││││ @ reduce.jl:428 within `_mapreduce`
; │││││││││┌ @ essentials.jl:920 within `getindex`
            %memoryref_data27 = getelementptr inbounds i64, ptr %memoryref_data10, i64 1
            %3 = load i64, ptr %memoryref_data27, align 8
; │││││││││└
; │││││││││ @ reduce.jl:429 within `_mapreduce`
; │││││││││┌ @ reduce.jl:19 within `add_sum`
; ││││││││││┌ @ int.jl:87 within `+`
             %4 = add i64 %3, %2
; │││││││││└└
; │││││││││ @ reduce.jl:430 within `_mapreduce`
; │││││││││┌ @ int.jl:83 within `<`
            %.not5455 = icmp sgt i64 %"a::Array.size.0.copyload", 2
; │││││││││└
           br i1 %.not5455, label %L71, label %L104

L71:                                              ; preds = %L71, %L44
           %value_phi2957 = phi i64 [ %5, %L71 ], [ 2, %L44 ]
           %value_phi2856 = phi i64 [ %7, %L71 ], [ %4, %L44 ]
; │││││││││ @ reduce.jl:431 within `_mapreduce`
; │││││││││┌ @ int.jl:87 within `+`
            %5 = add nuw nsw i64 %value_phi2957, 1
; │││││││││└
; │││││││││┌ @ essentials.jl:920 within `getindex`
            %memoryref_data41 = getelementptr inbounds i64, ptr %memoryref_data10, i64 %value_phi2957
            %6 = load i64, ptr %memoryref_data41, align 8
; │││││││││└
; │││││││││ @ reduce.jl:432 within `_mapreduce`
; │││││││││┌ @ reduce.jl:19 within `add_sum`
; ││││││││││┌ @ int.jl:87 within `+`
             %7 = add i64 %6, %value_phi2856
; │││││││││└└
; │││││││││ @ reduce.jl:430 within `_mapreduce`
; │││││││││┌ @ int.jl:83 within `<`
            %exitcond.not = icmp eq i64 %5, %"a::Array.size.0.copyload"
; │││││││││└
           br i1 %exitcond.not, label %L104, label %L71

L93:                                              ; preds = %L28
; │││││││││ @ reduce.jl:436 within `_mapreduce`
; │││││││││┌ @ reduce.jl:269 within `mapreduce_impl`
            %8 = call i64 @j_mapreduce_impl_4477(ptr nonnull %"a::Array", i64 signext 1, i64 signext %"a::Array.size.0.copyload", i64 signext 1024)
; │││││││││└
           br label %L104

L104:                                             ; preds = %L93, %L71, %L44, %L23, %top
           %value_phi = phi i64 [ %0, %L23 ], [ %8, %L93 ], [ 0, %top ], [ %4, %L44 ], [ %7, %L71 ]
; └└└└└└└└└
  ret i64 %value_phi
}

Mason · May 7, 2025, 3:20pm

It’s hidden down in lower level calls like this line:

An easier way to see this happen on a Vector would be to look at a regular loop instead:

julia> code_llvm(Tuple{Vector{Int}}; debuginfo=:none) do v
           s = 0
           for x in v
               s += x
           end
           s
       end

; Function Signature: var"#7"(Array{Int64, 1})
define i64 @"julia_#7_26406"(ptr noundef nonnull align 8 dereferenceable(24) %"v::Array") #0 {
top:
  %0 = getelementptr inbounds i8, ptr %"v::Array", i64 16
  %.size.sroa.0.0.copyload = load i64, ptr %0, align 8
  %.not.not = icmp eq i64 %.size.sroa.0.0.copyload, 0
  br i1 %.not.not, label %L75, label %L34

L34:                                              ; preds = %top
  %1 = load ptr, ptr %"v::Array", align 8
  %2 = load i64, ptr %1, align 8
  %.not.not54.not59.not = icmp eq i64 %.size.sroa.0.0.copyload, 1
  br i1 %.not.not54.not59.not, label %L75, label %L69.preheader

L69.preheader:                                    ; preds = %L34
  %3 = add i64 %.size.sroa.0.0.copyload, -1
  %min.iters.check = icmp ult i64 %.size.sroa.0.0.copyload, 33
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L69.preheader
  %n.vec = and i64 %3, -32
  %ind.end = or i64 %n.vec, 1
  %ind.end62 = or i64 %n.vec, 2
  %4 = insertelement <8 x i64> <i64 poison, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0>, i64 %2, i64 0
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i64> [ %4, %vector.ph ], [ %9, %vector.body ]
  %vec.phi64 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %10, %vector.body ]
  %vec.phi65 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %11, %vector.body ]
  %vec.phi66 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %12, %vector.body ]
  %offset.idx = or i64 %index, 1
  %5 = getelementptr inbounds i64, ptr %1, i64 %offset.idx
  %wide.load = load <8 x i64>, ptr %5, align 8
  %6 = getelementptr inbounds i64, ptr %5, i64 8
  %wide.load67 = load <8 x i64>, ptr %6, align 8
  %7 = getelementptr inbounds i64, ptr %5, i64 16
  %wide.load68 = load <8 x i64>, ptr %7, align 8
  %8 = getelementptr inbounds i64, ptr %5, i64 24
  %wide.load69 = load <8 x i64>, ptr %8, align 8
  %9 = add <8 x i64> %vec.phi, %wide.load
  %10 = add <8 x i64> %vec.phi64, %wide.load67
  %11 = add <8 x i64> %vec.phi65, %wide.load68
  %12 = add <8 x i64> %vec.phi66, %wide.load69
  %index.next = add nuw i64 %index, 32
  %13 = icmp eq i64 %index.next, %n.vec
  br i1 %13, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = add <8 x i64> %10, %9
  %bin.rdx70 = add <8 x i64> %11, %bin.rdx
  %bin.rdx71 = add <8 x i64> %12, %bin.rdx70
  %14 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %bin.rdx71)
  %cmp.n = icmp eq i64 %3, %n.vec
  br i1 %cmp.n, label %L75, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L69.preheader
  %bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ 1, %L69.preheader ]
  %bc.resume.val63 = phi i64 [ %ind.end62, %middle.block ], [ 2, %L69.preheader ]
  %bc.merge.rdx = phi i64 [ %14, %middle.block ], [ %2, %L69.preheader ]
  br label %L69

L69:                                              ; preds = %L69, %scalar.ph
  %15 = phi i64 [ %value_phi1660, %L69 ], [ %bc.resume.val, %scalar.ph ]
  %16 = phi i64 [ %20, %L69 ], [ %bc.merge.rdx, %scalar.ph ]
  %value_phi1660 = phi i64 [ %19, %L69 ], [ %bc.resume.val63, %scalar.ph ]
  %17 = getelementptr inbounds i64, ptr %1, i64 %15
  %18 = load i64, ptr %17, align 8
  %19 = add i64 %value_phi1660, 1
  %20 = add i64 %16, %18
  %exitcond.not = icmp eq i64 %value_phi1660, %.size.sroa.0.0.copyload
  br i1 %exitcond.not, label %L75, label %L69

L75:                                              ; preds = %L69, %middle.block, %L34, %top
  %value_phi46 = phi i64 [ 0, %top ], [ %2, %L34 ], [ %14, %middle.block ], [ %20, %L69 ]
  ret i64 %value_phi46
}

Benny · May 7, 2025, 3:36pm

Looks like a scalar value really is a SIMD vector with 1 semantic element. So what happens when it’s a field in a struct, does it just take up 4x the space?

Elrod · May 7, 2025, 4:45pm

Why 4x? Why not 2x, 8x, 16x?

Anyway, no, it doesn’t take extra space.

Auto vectorization is unreliable.
I made a comparison between ForwardDiff.jl and a C++ forward-mode AD library here, and more efficient memory management plus SIMD were the major contributors to the Julia implementation performing worse:

Mojo takes the approach of no-autovectorization. It disables those compiler passes. You need to instead be explicit, e.g. using x_simd = x.vectorize[1,simd_width]() instead of x, or using functional tools that map over ranges and evaluate with SIMD.

For peak performance, you need expertise + control. Mojo is currently focused on implementing algorithms like matrix multiply and flash attention with state of the art performance, so this focus makes sense.

For most code that most people write, it isn’t worth the time/effort to manually vectorize, tile, and thread, even if they have the expertise to know how to optimize it.
So relying on auto-vectorization is reasonable for a lot of code, IMO. Just not things in the matmul class, where Mojo is still focused.

All views here are my own only and don’t express anyone else’s opinions/ intentions/etc and are merely my perspective.

alexandergunnarson · May 7, 2025, 6:15pm

Great discussion! Love the thoughtfulness and engagement of the community here.

In my mind, that’s what Mojo’s “SIMD by default” approach intends to overcome. But I haven’t dissected Mojo internals to really know.

By this I assume you’re implying that loop autovectorization would not actually be easier with SIMD-by-default primitives. I think that’s quite likely. My experience is primarily on the JVM, where it’s very easy to get off the autovectorization happy path. I assume any compiler — JVM, LLVM, or otherwise — is going to have this challenge.

Having skimmed the blog post @Elrod linked and the associated PR (WIP: use SIMD.jl for partials by KristofferC · Pull Request #570 · JuliaDiff/ForwardDiff.jl · GitHub), I’m assuming that the following means that explicit SIMD.jl control made the Julia implementation go faster than whatever autovectorization was happening:

On the JVM, SIMD code is very explicit, very verbose (not as much as hand-rolled assembly of course, but nothing as breezy as the multiple-dispatch-enabled SIMD.jl API), and looks very different from scalar operations. What I’m gathering from these posts is that both the ergonomics and the performance of SIMD code in Julia are already good, and that SIMD-by-default datatypes wouldn’t affect this either way.

That said, maybe my assessment is off, given this information:

RoyiAvital · May 7, 2025, 6:22pm

I guess arithmetically the strategy is: LLVM - Vectorization = LLM + Mojo Vectorization :-).

Being serious, given there are enough examples of such Mojo code this strategy can have a lot of synergy with the rise of LLM as co developers.
Modular released many of their kernels in Open Source lately. One of them should be the world fastest GEMM. It is a great start to “material” data.

alexandergunnarson · May 7, 2025, 6:35pm

Thanks @tecosaur! It’s called Onton (onton.com). We’re helping people make purchase decisions they feel great about. We’ve started in home decor and furniture, but are expanding to other verticals this year. We have a multimodal search engine and some image generation tools. Recently we built a neurosymbolic self-learning agent that overcomes the LLM hallucination problem, and that’s going to become the basis of our search engine as time goes on.

Fun side fact (since I suspect Julians are more likely than many to be philosophically inclined) — the name “Onton” came from a term I introduced in an unpublished paper a long time ago. My cofounder liked it so we used it when we rebranded.

We take what exists to be all and only metaphysical or ontological simples beyond which distinction is impossible. We term these “ontons”. Ontons may be the fields composing the fundamental particles of standard quantum mechanics … However, given the unending nature of the refinements made to empirical inquiry over the centuries and even over the past few decades, it is likely that those particles we consider fundamental are not truly fundamental after all. Thus, we define ontons as being whatever the most fundamental units of being actually are. Whether or not we have observed them directly, we cannot know.

The_Mastermage · May 8, 2025, 2:15pm

That sounds super cool and i have bookmarked Onton now. How do you avoid hallucinations? I didnt know that was possible (albeit I am not a AI Engineer).

alexandergunnarson · May 15, 2025, 2:34am

Sorry, missed your message. Thanks for the kind words! We avoid hallucinations by not using a purely connectionist approach, i.e. a neurosymbolic one (To your point — with the current most popular connectionist approaches centered around deep learning, yes, hallucinations can only be mitigated, never avoided.)

Benny · May 15, 2025, 7:21am

Just idly reading about first-class manual vectorization especially wrt Mojo and I thought it was very neat that the SIMD types are tuned to the architecture’s SIMD register width (slightly similar to how Int/Uint are system-dependent aliases) and exposes that information to the users with simdbitwidth and simdwidthof. That helps address the concerns about portability usually used to argue in favor of auto-vectorization.

Something I’m still trying to figure out is if Mojo compiles non-vectorizable code to use SIMD registers anyway or if the smaller registers are used at all. Is there a general advantage (power usage, instruction parallelism, idk) to opting out of the SIMD registers if we can’t leverage them?

Elrod · May 15, 2025, 4:40pm

Yes. The C++ was faster than Julia with SIMD.jl due to more extensive SIMD in its implementation. Someone in theory could apply those same optimizations, but the PR didn’t go that far, and was still blocked.

Julia has multiple dispatch with SIMD.jl or VectorizationBase.jl, while overloading as in Mojo or C++ (e.g. eve) are similar for all practical purposes, except they disallow type instabilities.

llvmcall is slow to compile, and isn’t understood by Julia’s inlining heuristics.
I haven’t kept up to date, but back when I was writing Julia, the inlining cost of llvmcall was high, forcing use of @inline everywhere you were using SIMD intrinsics.

That’s normal, and most SIMD libraries offer it more implicitly than mojo, e.g. c++26, libraries like eve, or VectorizationBase.jl.

In Mojo, you tend to be explicit. If you use Scalar[DType.float64] (or the alias Float64), it’ll use a scalar.
You need to actually use SIMD[DType.float64,N] for SIMD.
What’s unique in Mojo is that Float64 is an alias of Scalar[DType.float64], which is an alias of SIMD[DType.float64, 1].
So it’s a norm when writing mojo to write code to be generic with respect to SIMD-vector-length when possible.

There are also convenience methods like .vectorize, so if you have a row-major matrix A, you could do A_vectorized = A.vectorize[1,sidwidthof[__type_of(A).dtype]()](), and now A_vectorized is a matrix of SIMD-vectors, i.e. A_vectorized[2,43] will return SIMD vectors (and the number of columns has been updated appropriately).
Also neat about LayoutTensors is that their memory layout is specified by a Layout object. These allow them to represent all sorts of tiled/blocked layouts (and not just row and column major).
You can see mojo layout docs, or cute docs/tutorial.

Topic		Replies	Views
Does Julia use SIMD instructions for broadcast operations? General Usage question , performance , broadcast	18	5239	March 7, 2017
Help understanding vectorization (or lack thereof) Performance	15	1262	June 8, 2018
Simple loop won't vectorize New to Julia	12	1694	January 29, 2019
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1959	June 27, 2021
How to make SIMD work? General Usage	3	763	May 2, 2018

Datatypes SIMD by default? (as in Mojo)

Related topics