LLVM Cannot select: v16bf16 = insert_subvector on Float32→BFloat16 vector store (Julia 1.12.6, znver5)

cscherrer · April 18, 2026, 2:15pm

Hello all — is this a known LLVM codegen bug?

Minimal repro

using BFloat16s: BFloat16

function f!(dst, src)
    for j in eachindex(dst)
        dst[j] = src[j]
    end
end

f!(Vector{BFloat16}(undef, 16), rand(Float32, 16))

Output

LLVM ERROR (click to expand)

LLVM ERROR: Cannot select: v16bf16 = insert_subvector <prev>, <v8bf16 from VFPROUND>, Constant:i64<8>
    v16bf16 = insert_subvector undef:v16bf16, <v8bf16 from VFPROUND>, Constant:i64<0>
      v8bf16 = X86ISD::VFPROUND <v8f32 from load>
    v8bf16 = X86ISD::VFPROUND <v8f32 from load>
  Constant:i64<8>
In function: julia_f!_...

Full DAG + stack trace:

LLVM ERROR: Cannot select: 0x3ddc45f0: v16bf16 = insert_subvector 0x3de48da0, 0x3ddc4f20, Constant:i64<8>, array.jl:991 @[ array.jl:986 @[ REPL[2]:3 ] ]
  0x3de48da0: v16bf16 = insert_subvector undef:v16bf16, 0x3ddc4430, Constant:i64<0>, array.jl:991 @[ array.jl:986 @[ REPL[2]:3 ] ]
    0x3ddc4b30: v16bf16 = undef
    0x3ddc4430: v8bf16 = X86ISD::VFPROUND 0x3de48a90, /home/chad/.julia/packages/BFloat16s/lYUbX/src/bfloat16.jl:166 @[ REPL[2]:3 ]
      0x3de48a90: v8f32 = insert_subvector undef:v8f32, 0x3de486a0, Constant:i64<0>, /home/chad/.julia/packages/BFloat16s/lYUbX/src/bfloat16.jl:166 @[ REPL[2]:3 ]
        0x3de481d0: v8f32 = undef
        0x3de486a0: v4f32,ch = load<(load (s128) from %ir.51, align 4, !tbaa !53, !alias.scope !56, !noalias !59)> 0x3cc2f6b0, 0x3de48710, undef:i64, essentials.jl:920 @[ REPL[2]:3 ]
          0x3de48710: i64 = add 0x3de48b00, Constant:i64<32>, essentials.jl:920 @[ REPL[2]:3 ]
            0x3de48b00: i64 = add 0x3ddc4510, 0x3de48c50, essentials.jl:920 @[ REPL[2]:3 ]
              0x3ddc4510: i64,ch = CopyFromReg 0x3cc2f6b0, Register:i64 %2, essentials.jl:920 @[ REPL[2]:3 ]
                0x3ddc4350: i64 = Register %2
              0x3de48c50: i64 = shl 0x3de48fd0, Constant:i8<2>, essentials.jl:920 @[ REPL[2]:3 ]
                0x3de48fd0: i64,ch = CopyFromReg 0x3cc2f6b0, Register:i64 %8, essentials.jl:920 @[ REPL[2]:3 ]
                  0x3ddc42e0: i64 = Register %8
                0x3de48160: i8 = Constant<2>
            0x3ddc4900: i64 = Constant<32>
          0x3de48320: i64 = undef
        0x3ddc51c0: i64 = Constant<0>
    0x3ddc51c0: i64 = Constant<0>
  0x3ddc4f20: v8bf16 = X86ISD::VFPROUND 0x3ddc4ba0, /home/chad/.julia/packages/BFloat16s/lYUbX/src/bfloat16.jl:166 @[ REPL[2]:3 ]
    0x3ddc4ba0: v8f32 = insert_subvector undef:v8f32, 0x3ddc49e0, Constant:i64<0>, /home/chad/.julia/packages/BFloat16s/lYUbX/src/bfloat16.jl:166 @[ REPL[2]:3 ]
      0x3de481d0: v8f32 = undef
      0x3ddc49e0: v4f32,ch = load<(load (s128) from %ir.55, align 4, !tbaa !53, !alias.scope !56, !noalias !59)> 0x3cc2f6b0, 0x3ddc43c0, undef:i64, essentials.jl:920 @[ REPL[2]:3 ]
        0x3ddc43c0: i64 = add 0x3de48b00, Constant:i64<48>, essentials.jl:920 @[ REPL[2]:3 ]
          0x3de48b00: i64 = add 0x3ddc4510, 0x3de48c50, essentials.jl:920 @[ REPL[2]:3 ]
            0x3ddc4510: i64,ch = CopyFromReg 0x3cc2f6b0, Register:i64 %2, essentials.jl:920 @[ REPL[2]:3 ]
              0x3ddc4350: i64 = Register %2
            0x3de48c50: i64 = shl 0x3de48fd0, Constant:i8<2>, essentials.jl:920 @[ REPL[2]:3 ]
              0x3de48fd0: i64,ch = CopyFromReg 0x3cc2f6b0, Register:i64 %8, essentials.jl:920 @[ REPL[2]:3 ]
                0x3ddc42e0: i64 = Register %8
              0x3de48160: i8 = Constant<2>
          0x3de48400: i64 = Constant<48>
        0x3de48320: i64 = undef
      0x3ddc51c0: i64 = Constant<0>
  0x3ddc50e0: i64 = Constant<8>
In function: julia_f!_595

[116013] signal 6 (-6): Aborted
in expression starting at REPL[3]:1
unknown function (ip: 0x7f787a846a2c) at /usr/lib/libc.so.6
gsignal at /usr/lib/libc.so.6 (unknown line)
abort at /usr/lib/libc.so.6 (unknown line)
_ZN4llvm18report_fatal_errorERKNS_5TwineEb.cold at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel15CannotYetSelectEPNS_6SDNodeE at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel16SelectCodeCommonEPNS_6SDNodeEPKhj at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel6SelectEPN4llvm6SDNodeE at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel22DoInstructionSelectionEv at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.0 at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
...
Allocations: 2782850 (Pool: 2779994; Big: 2856); GC: 5
fish: Job 1, 'julia' terminated by signal SIGABRT (Abort)

What I’ve checked

Variant	Result
`--cpu-target=generic`	OK
`--cpu-target=znver3`	OK
`--cpu-target=znver4`	crash
`--cpu-target=znver5`	crash
`--cpu-target=sapphirerapids`	crash
Scalar `BFloat16(rand(Float32))`	OK
Same loop but `Float16` instead of `BFloat16`	OK
`dst[j] = BFloat16(src[j])` (explicit)	crash (same error)

So: any AVX-512-BF16 native target + a vectorizable Float32 → BFloat16 conversion loop. Scalar is fine; emulated BF16 (znver3, generic) is fine; Float16 is fine.

Diagnosis (verified against the LLVM 18.x sources)

The failing DAG is two X86ISD::VFPROUND halves concatenated into a v16bf16 via two insert_subvectors — one at offset 0, one at offset 8. Only the offset-0 side has a lowering pattern:

Offset-0 inserts are handled by subvector_subreg_lowering in X86InstrVecCompiler.td. The multiclass is explicitly commented “Patterns for insert_subvector/extract_subvector to/from index=0” and its pattern hard-codes (iPTR 0). bf16 entries exist (lines 86, 99, 112), added by PR #83720 for #83358.
Non-zero-offset inserts (i.e. the vinsertf128-style concat of two 128-bit halves into a 256/512-bit register) are handled by vinsert_for_size_lowering in X86InstrAVX512.td. f16 has three entries there (v8f16→v16f16, v8f16→v32f16, v16f16→v32f16); bf16 has zero.

So PR #83720 covered bf16 at offset 0, but the matching vinsert_for_size_lowering calls for bf16 were never added. The fix would be three lines mirroring the f16 entries at 495/502/509. (vextract_for_size_lowering at 796/811 for f16 also has no bf16 counterparts, so the extract-from-upper-half path is presumably broken too — I haven’t tripped it.)

Environment

Julia 1.12.6, LLVM 18.1.7
AMD Ryzen 9 9950X (znver5, AVX-512 + BF16)
BFloat16s v0.6.1

julia> versioninfo()
Julia Version 1.12.6
Commit 15346901f00 (2026-04-09 19:20 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 9950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver5)
  GC: Built with stock GC
Threads: 16 default, 1 interactive, 16 GC (on 32 virtual cores)
Environment:
  LD_LIBRARY_PATH = /opt/cuda/lib64
  JULIA_NUM_THREADS = 16
  JULIA_PKG_USE_CLI_GIT = true

adienes · April 18, 2026, 2:22pm

looks like "LLVM ERROR: Cannot select" in LinearAlgebra/triangular2 tests on Apple Silicon M4 · Issue #61599 · JuliaLang/julia · GitHub ?

cscherrer · April 18, 2026, 2:33pm

Thanks @adienes . Same error, not clear whether it’s the same bug. Different architecture, and I’m only seeing it with BFloat16.

Also, works fine in 1.13:

julia> using BFloat16s: BFloat16

julia> function f!(dst, src)
           for j in eachindex(dst)
               dst[j] = src[j]
           end
       end
f! (generic function with 1 method)

julia> f!(Vector{BFloat16}(undef, 16), rand(Float32, 16))

julia> versioninfo()
Julia Version 1.13.0-beta3
Commit 393f698a6cc (2026-03-14 13:54 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 9950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-20.1.8 (ORCJIT, znver5)
  GC: Built with stock GC
Threads: 16 default, 1 interactive, 16 GC (on 32 virtual cores)
Environment:
  LD_LIBRARY_PATH = /opt/cuda/lib64
  JULIA_NUM_THREADS = 16
  JULIA_PKG_USE_CLI_GIT = true

giordano · April 18, 2026, 5:20pm

That’s usually a good indication it’s an LLVM issue.

Topic		Replies	Views
Float16 with AMDGPU GPU	10	402	August 30, 2024
Does float16 run natively on a compatible CPU? General Usage	14	737	July 11, 2024
Status of BFloat16 Performance	4	932	August 24, 2023
Julia 1.9 on Intel Sapphire Rapid CPU doesn't work Internals & Design bug , llvm	5	341	August 21, 2024
How is Float16 implemented? Internals & Design	3	506	May 9, 2023

LLVM Cannot select: v16bf16 = insert_subvector on Float32→BFloat16 vector store (Julia 1.12.6, znver5)

Related topics