Datatypes SIMD by default? (as in Mojo)

Hello all! Super excited to use Julia. It won out in a lengthy evaluation of mine against Rust, “modern” C++ (>= C++23), modern Java (>= Java 24), Mojo, etc. as the basis of my startup’s backend. Thank you to all who have made Julia and its community what they are!

One curiosity on my mind is whether Julia might benefit from its primitive datatypes being SIMD by default, as in Mojo (e.g. Int64 actually being Vec{1,Int64}). Perhaps loop autovectorization + SIMD.jl render irrelevant any purported benefits of SIMD-by-default datatypes. But not being by any means yet an expert in Julia, I figured I’d ask!

13 Likes

That’s cool! Best of luck with your startup, what sort of problem area are you planning on operating in?

For now you’ll likely continue to need to use SIMD.jl and friends, but there is some work on the backburner to make SIMD with Base types work better: Add support for intrinsics for `NTuple{VecElement}` by oscardssmith · Pull Request #55118 · JuliaLang/julia · GitHub

3 Likes

I’m not sure I understand what you’re asking. What does it mean that Int is a 1-element SIMD vector by default, or just a scalar primitive value? That is, what does that enable, or make easier?

1 Like

It’s worth noting that you don’t even need to use SIMD.jl or @simd or even a for loop to get vectorization in many cases. Often just having elements in a tuple or array is enough. For example, adding 4 Ints in a tuple on my machine generates <4 x i64> LLVM IR and a pair of .2d ARM vector instructions.

julia> f(x) = (x[1]+x[2])+(x[3]+x[4])
f (generic function with 1 method)

julia> @code_llvm debuginfo=:none f((1,2,3,4))
; Function Signature: f(NTuple{4, Int64})
define i64 @julia_f_1604(ptr nocapture noundef nonnull readonly align 8 dereferenceable(32) %"x::Tuple") local_unnamed_addr #0 {
top:
  %0 = load <4 x i64>, ptr %"x::Tuple", align 8
  %1 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %0)
  ret i64 %1
}

Or to take the example from julia#55118, even without explicit the VecElement:

julia> f(a, b) = a .* b
f (generic function with 1 method)

julia> a = Float32.(sqrt.(((1:4)...,)))
(1.0f0, 1.4142135f0, 1.7320508f0, 2.0f0)

julia> @code_llvm debuginfo=:none f(a,a)
; Function Signature: f(NTuple{4, Float32}, NTuple{4, Float32})
define void @julia_f_1783(ptr noalias nocapture noundef nonnull sret([4 x float]) align 4 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"a::Tuple", ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"b::Tuple") local_unnamed_addr #0 {
top:
  %0 = load <4 x float>, ptr %"a::Tuple", align 4
  %1 = load <4 x float>, ptr %"b::Tuple", align 4
  %2 = fmul <4 x float> %0, %1
  store <4 x float> %2, ptr %sret_return, align 4
  ret void
}

There’s nothing special about broadcasting; primitive recursion (or explicit indexing) also gets this treatment:

julia> g(a, b) = (a[1]*b[1], g(a[2:end], b[2:end])...)
g (generic function with 1 method)

julia> g(::Tuple{}, ::Tuple{}) = ()
g (generic function with 2 methods)

julia> @code_llvm debuginfo=:none g(a,a)
; Function Signature: g(NTuple{4, Float32}, NTuple{4, Float32})
define void @julia_g_2583(ptr noalias nocapture noundef nonnull sret([4 x float]) align 4 dereferenceable(16) %sret_return, ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"a::Tuple", ptr nocapture noundef nonnull readonly align 4 dereferenceable(16) %"b::Tuple") local_unnamed_addr #0 {
top:
  %0 = load <4 x float>, ptr %"a::Tuple", align 4
  %1 = load <4 x float>, ptr %"b::Tuple", align 4
  %2 = fmul <4 x float> %0, %1
  store <4 x float> %2, ptr %sret_return, align 4
  ret void
}

It can be fiddly in some cases, though; that’s when you’d want to reach for an VecElement of a given width. I’m not a compiler person, but I’d wager the difficulties that are preventing a non-VecElement from SIMD’ing would be exactly the same as the difficulties in coalescing multiple 1-element special SIMD-able types into an N-way SIMD.

8 Likes

I’m a little confused. When I do this

x=(1,2,3,4);
julia> @code_llvm debuginfo=:none sum(x)
; Function Signature: sum(NTuple{4, Int64})
define i64 @julia_sum_4537(ptr nocapture noundef nonnull readonly align 8 dereferenceable(32) %"a::Tuple") #0 {
top:
  %0 = load <4 x i64>, ptr %"a::Tuple", align 8
  %1 = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %0)
  ret i64 %1
}

I get what you got. But if I use a vector instead of a tuple I do not.
What’s happening?

y=[1,2,3,4];
julia> @code_llvm sum(y)
; Function Signature: sum(Array{Int64, 1})
;  @ reducedim.jl:979 within `sum`
define i64 @julia_sum_4474(ptr noundef nonnull align 8 dereferenceable(24) %"a::Array") #0 {
top:
; ┌ @ reducedim.jl:979 within `#sum#734`
; │┌ @ reducedim.jl:983 within `_sum`
; ││┌ @ reducedim.jl:983 within `#_sum#736`
; │││┌ @ reducedim.jl:984 within `_sum`
; ││││┌ @ reducedim.jl:984 within `#_sum#737`
; │││││┌ @ reducedim.jl:326 within `mapreduce`
; ││││││┌ @ reducedim.jl:326 within `#mapreduce#727`
; │││││││┌ @ reducedim.jl:334 within `_mapreduce_dim`
; ││││││││┌ @ reduce.jl:418 within `_mapreduce`
; │││││││││┌ @ indices.jl:536 within `LinearIndices`
; ││││││││││┌ @ abstractarray.jl:98 within `axes`
; │││││││││││┌ @ array.jl:194 within `size`
              %"a::Array.size_ptr" = getelementptr inbounds i8, ptr %"a::Array", i64 16
              %"a::Array.size.0.copyload" = load i64, ptr %"a::Array.size_ptr", align 8
; │││││││││└└└
; │││││││││ @ reduce.jl:420 within `_mapreduce`
           switch i64 %"a::Array.size.0.copyload", label %L28 [
    i64 0, label %L104
    i64 1, label %L23
  ]

L23:                                              ; preds = %top
; │││││││││ @ reduce.jl:423 within `_mapreduce`
; │││││││││┌ @ essentials.jl:920 within `getindex`
            %memoryref_data = load ptr, ptr %"a::Array", align 8
            %0 = load i64, ptr %memoryref_data, align 8
; │││││││││└
; │││││││││ @ reduce.jl:424 within `_mapreduce`
           br label %L104

L28:                                              ; preds = %top
; │││││││││ @ reduce.jl:425 within `_mapreduce`
; │││││││││┌ @ int.jl:83 within `<`
            %1 = icmp sgt i64 %"a::Array.size.0.copyload", 15
; │││││││││└
           br i1 %1, label %L93, label %L44

L44:                                              ; preds = %L28
; │││││││││ @ reduce.jl:427 within `_mapreduce`
; │││││││││┌ @ essentials.jl:920 within `getindex`
            %memoryref_data10 = load ptr, ptr %"a::Array", align 8
            %2 = load i64, ptr %memoryref_data10, align 8
; │││││││││└
; │││││││││ @ reduce.jl:428 within `_mapreduce`
; │││││││││┌ @ essentials.jl:920 within `getindex`
            %memoryref_data27 = getelementptr inbounds i64, ptr %memoryref_data10, i64 1
            %3 = load i64, ptr %memoryref_data27, align 8
; │││││││││└
; │││││││││ @ reduce.jl:429 within `_mapreduce`
; │││││││││┌ @ reduce.jl:19 within `add_sum`
; ││││││││││┌ @ int.jl:87 within `+`
             %4 = add i64 %3, %2
; │││││││││└└
; │││││││││ @ reduce.jl:430 within `_mapreduce`
; │││││││││┌ @ int.jl:83 within `<`
            %.not5455 = icmp sgt i64 %"a::Array.size.0.copyload", 2
; │││││││││└
           br i1 %.not5455, label %L71, label %L104

L71:                                              ; preds = %L71, %L44
           %value_phi2957 = phi i64 [ %5, %L71 ], [ 2, %L44 ]
           %value_phi2856 = phi i64 [ %7, %L71 ], [ %4, %L44 ]
; │││││││││ @ reduce.jl:431 within `_mapreduce`
; │││││││││┌ @ int.jl:87 within `+`
            %5 = add nuw nsw i64 %value_phi2957, 1
; │││││││││└
; │││││││││┌ @ essentials.jl:920 within `getindex`
            %memoryref_data41 = getelementptr inbounds i64, ptr %memoryref_data10, i64 %value_phi2957
            %6 = load i64, ptr %memoryref_data41, align 8
; │││││││││└
; │││││││││ @ reduce.jl:432 within `_mapreduce`
; │││││││││┌ @ reduce.jl:19 within `add_sum`
; ││││││││││┌ @ int.jl:87 within `+`
             %7 = add i64 %6, %value_phi2856
; │││││││││└└
; │││││││││ @ reduce.jl:430 within `_mapreduce`
; │││││││││┌ @ int.jl:83 within `<`
            %exitcond.not = icmp eq i64 %5, %"a::Array.size.0.copyload"
; │││││││││└
           br i1 %exitcond.not, label %L104, label %L71

L93:                                              ; preds = %L28
; │││││││││ @ reduce.jl:436 within `_mapreduce`
; │││││││││┌ @ reduce.jl:269 within `mapreduce_impl`
            %8 = call i64 @j_mapreduce_impl_4477(ptr nonnull %"a::Array", i64 signext 1, i64 signext %"a::Array.size.0.copyload", i64 signext 1024)
; │││││││││└
           br label %L104

L104:                                             ; preds = %L93, %L71, %L44, %L23, %top
           %value_phi = phi i64 [ %0, %L23 ], [ %8, %L93 ], [ 0, %top ], [ %4, %L44 ], [ %7, %L71 ]
; └└└└└└└└└
  ret i64 %value_phi
}

It’s hidden down in lower level calls like this line:

An easier way to see this happen on a Vector would be to look at a regular loop instead:

julia> code_llvm(Tuple{Vector{Int}}; debuginfo=:none) do v
           s = 0
           for x in v
               s += x
           end
           s
       end
; Function Signature: var"#7"(Array{Int64, 1})
define i64 @"julia_#7_26406"(ptr noundef nonnull align 8 dereferenceable(24) %"v::Array") #0 {
top:
  %0 = getelementptr inbounds i8, ptr %"v::Array", i64 16
  %.size.sroa.0.0.copyload = load i64, ptr %0, align 8
  %.not.not = icmp eq i64 %.size.sroa.0.0.copyload, 0
  br i1 %.not.not, label %L75, label %L34

L34:                                              ; preds = %top
  %1 = load ptr, ptr %"v::Array", align 8
  %2 = load i64, ptr %1, align 8
  %.not.not54.not59.not = icmp eq i64 %.size.sroa.0.0.copyload, 1
  br i1 %.not.not54.not59.not, label %L75, label %L69.preheader

L69.preheader:                                    ; preds = %L34
  %3 = add i64 %.size.sroa.0.0.copyload, -1
  %min.iters.check = icmp ult i64 %.size.sroa.0.0.copyload, 33
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L69.preheader
  %n.vec = and i64 %3, -32
  %ind.end = or i64 %n.vec, 1
  %ind.end62 = or i64 %n.vec, 2
  %4 = insertelement <8 x i64> <i64 poison, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0>, i64 %2, i64 0
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i64> [ %4, %vector.ph ], [ %9, %vector.body ]
  %vec.phi64 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %10, %vector.body ]
  %vec.phi65 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %11, %vector.body ]
  %vec.phi66 = phi <8 x i64> [ zeroinitializer, %vector.ph ], [ %12, %vector.body ]
  %offset.idx = or i64 %index, 1
  %5 = getelementptr inbounds i64, ptr %1, i64 %offset.idx
  %wide.load = load <8 x i64>, ptr %5, align 8
  %6 = getelementptr inbounds i64, ptr %5, i64 8
  %wide.load67 = load <8 x i64>, ptr %6, align 8
  %7 = getelementptr inbounds i64, ptr %5, i64 16
  %wide.load68 = load <8 x i64>, ptr %7, align 8
  %8 = getelementptr inbounds i64, ptr %5, i64 24
  %wide.load69 = load <8 x i64>, ptr %8, align 8
  %9 = add <8 x i64> %vec.phi, %wide.load
  %10 = add <8 x i64> %vec.phi64, %wide.load67
  %11 = add <8 x i64> %vec.phi65, %wide.load68
  %12 = add <8 x i64> %vec.phi66, %wide.load69
  %index.next = add nuw i64 %index, 32
  %13 = icmp eq i64 %index.next, %n.vec
  br i1 %13, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
  %bin.rdx = add <8 x i64> %10, %9
  %bin.rdx70 = add <8 x i64> %11, %bin.rdx
  %bin.rdx71 = add <8 x i64> %12, %bin.rdx70
  %14 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %bin.rdx71)
  %cmp.n = icmp eq i64 %3, %n.vec
  br i1 %cmp.n, label %L75, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L69.preheader
  %bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ 1, %L69.preheader ]
  %bc.resume.val63 = phi i64 [ %ind.end62, %middle.block ], [ 2, %L69.preheader ]
  %bc.merge.rdx = phi i64 [ %14, %middle.block ], [ %2, %L69.preheader ]
  br label %L69

L69:                                              ; preds = %L69, %scalar.ph
  %15 = phi i64 [ %value_phi1660, %L69 ], [ %bc.resume.val, %scalar.ph ]
  %16 = phi i64 [ %20, %L69 ], [ %bc.merge.rdx, %scalar.ph ]
  %value_phi1660 = phi i64 [ %19, %L69 ], [ %bc.resume.val63, %scalar.ph ]
  %17 = getelementptr inbounds i64, ptr %1, i64 %15
  %18 = load i64, ptr %17, align 8
  %19 = add i64 %value_phi1660, 1
  %20 = add i64 %16, %18
  %exitcond.not = icmp eq i64 %value_phi1660, %.size.sroa.0.0.copyload
  br i1 %exitcond.not, label %L75, label %L69

L75:                                              ; preds = %L69, %middle.block, %L34, %top
  %value_phi46 = phi i64 [ 0, %top ], [ %2, %L34 ], [ %14, %middle.block ], [ %20, %L69 ]
  ret i64 %value_phi46
}

1 Like

Looks like a scalar value really is a SIMD vector with 1 semantic element. So what happens when it’s a field in a struct, does it just take up 4x the space?

Why 4x? Why not 2x, 8x, 16x?

Anyway, no, it doesn’t take extra space.

Auto vectorization is unreliable.
I made a comparison between ForwardDiff.jl and a C++ forward-mode AD library here, and more efficient memory management plus SIMD were the major contributors to the Julia implementation performing worse:

Mojo takes the approach of no-autovectorization. It disables those compiler passes. You need to instead be explicit, e.g. using x_simd = x.vectorize[1,simd_width]() instead of x, or using functional tools that map over ranges and evaluate with SIMD.

For peak performance, you need expertise + control. Mojo is currently focused on implementing algorithms like matrix multiply and flash attention with state of the art performance, so this focus makes sense.

For most code that most people write, it isn’t worth the time/effort to manually vectorize, tile, and thread, even if they have the expertise to know how to optimize it.
So relying on auto-vectorization is reasonable for a lot of code, IMO. Just not things in the matmul class, where Mojo is still focused.

All views here are my own only and don’t express anyone else’s opinions/ intentions/etc and are merely my perspective.

12 Likes

Great discussion! Love the thoughtfulness and engagement of the community here.

In my mind, that’s what Mojo’s “SIMD by default” approach intends to overcome. But I haven’t dissected Mojo internals to really know.

By this I assume you’re implying that loop autovectorization would not actually be easier with SIMD-by-default primitives. I think that’s quite likely. My experience is primarily on the JVM, where it’s very easy to get off the autovectorization happy path. I assume any compiler — JVM, LLVM, or otherwise — is going to have this challenge.

Having skimmed the blog post @Elrod linked and the associated PR (WIP: use SIMD.jl for partials by KristofferC · Pull Request #570 · JuliaDiff/ForwardDiff.jl · GitHub), I’m assuming that the following means that explicit SIMD.jl control made the Julia implementation go faster than whatever autovectorization was happening:

On the JVM, SIMD code is very explicit, very verbose (not as much as hand-rolled assembly of course, but nothing as breezy as the multiple-dispatch-enabled SIMD.jl API), and looks very different from scalar operations. What I’m gathering from these posts is that both the ergonomics and the performance of SIMD code in Julia are already good, and that SIMD-by-default datatypes wouldn’t affect this either way.

That said, maybe my assessment is off, given this information:

I guess arithmetically the strategy is: LLVM - Vectorization = LLM + Mojo Vectorization :-).

Being serious, given there are enough examples of such Mojo code this strategy can have a lot of synergy with the rise of LLM as co developers.
Modular released many of their kernels in Open Source lately. One of them should be the world fastest GEMM. It is a great start to “material” data.

Thanks @tecosaur! It’s called Onton (onton.com). We’re helping people make purchase decisions they feel great about. We’ve started in home decor and furniture, but are expanding to other verticals this year. We have a multimodal search engine and some image generation tools. Recently we built a neurosymbolic self-learning agent that overcomes the LLM hallucination problem, and that’s going to become the basis of our search engine as time goes on.

Fun side fact (since I suspect Julians are more likely than many to be philosophically inclined) — the name “Onton” came from a term I introduced in an unpublished paper a long time ago. My cofounder liked it so we used it when we rebranded.

We take what exists to be all and only metaphysical or ontological simples beyond which distinction is impossible. We term these “ontons”. Ontons may be the fields composing the fundamental particles of standard quantum mechanics … However, given the unending nature of the refinements made to empirical inquiry over the centuries and even over the past few decades, it is likely that those particles we consider fundamental are not truly fundamental after all. Thus, we define ontons as being whatever the most fundamental units of being actually are. Whether or not we have observed them directly, we cannot know.

3 Likes

That sounds super cool and i have bookmarked Onton now. How do you avoid hallucinations? I didnt know that was possible (albeit I am not a AI Engineer).

1 Like