Hello all — is this a known LLVM codegen bug?
Minimal repro
using BFloat16s: BFloat16
function f!(dst, src)
for j in eachindex(dst)
dst[j] = src[j]
end
end
f!(Vector{BFloat16}(undef, 16), rand(Float32, 16))
Output
LLVM ERROR (click to expand)
LLVM ERROR: Cannot select: v16bf16 = insert_subvector <prev>, <v8bf16 from VFPROUND>, Constant:i64<8>
v16bf16 = insert_subvector undef:v16bf16, <v8bf16 from VFPROUND>, Constant:i64<0>
v8bf16 = X86ISD::VFPROUND <v8f32 from load>
v8bf16 = X86ISD::VFPROUND <v8f32 from load>
Constant:i64<8>
In function: julia_f!_...
Full DAG + stack trace:
LLVM ERROR: Cannot select: 0x3ddc45f0: v16bf16 = insert_subvector 0x3de48da0, 0x3ddc4f20, Constant:i64<8>, array.jl:991 @[ array.jl:986 @[ REPL[2]:3 ] ]
0x3de48da0: v16bf16 = insert_subvector undef:v16bf16, 0x3ddc4430, Constant:i64<0>, array.jl:991 @[ array.jl:986 @[ REPL[2]:3 ] ]
0x3ddc4b30: v16bf16 = undef
0x3ddc4430: v8bf16 = X86ISD::VFPROUND 0x3de48a90, /home/chad/.julia/packages/BFloat16s/lYUbX/src/bfloat16.jl:166 @[ REPL[2]:3 ]
0x3de48a90: v8f32 = insert_subvector undef:v8f32, 0x3de486a0, Constant:i64<0>, /home/chad/.julia/packages/BFloat16s/lYUbX/src/bfloat16.jl:166 @[ REPL[2]:3 ]
0x3de481d0: v8f32 = undef
0x3de486a0: v4f32,ch = load<(load (s128) from %ir.51, align 4, !tbaa !53, !alias.scope !56, !noalias !59)> 0x3cc2f6b0, 0x3de48710, undef:i64, essentials.jl:920 @[ REPL[2]:3 ]
0x3de48710: i64 = add 0x3de48b00, Constant:i64<32>, essentials.jl:920 @[ REPL[2]:3 ]
0x3de48b00: i64 = add 0x3ddc4510, 0x3de48c50, essentials.jl:920 @[ REPL[2]:3 ]
0x3ddc4510: i64,ch = CopyFromReg 0x3cc2f6b0, Register:i64 %2, essentials.jl:920 @[ REPL[2]:3 ]
0x3ddc4350: i64 = Register %2
0x3de48c50: i64 = shl 0x3de48fd0, Constant:i8<2>, essentials.jl:920 @[ REPL[2]:3 ]
0x3de48fd0: i64,ch = CopyFromReg 0x3cc2f6b0, Register:i64 %8, essentials.jl:920 @[ REPL[2]:3 ]
0x3ddc42e0: i64 = Register %8
0x3de48160: i8 = Constant<2>
0x3ddc4900: i64 = Constant<32>
0x3de48320: i64 = undef
0x3ddc51c0: i64 = Constant<0>
0x3ddc51c0: i64 = Constant<0>
0x3ddc4f20: v8bf16 = X86ISD::VFPROUND 0x3ddc4ba0, /home/chad/.julia/packages/BFloat16s/lYUbX/src/bfloat16.jl:166 @[ REPL[2]:3 ]
0x3ddc4ba0: v8f32 = insert_subvector undef:v8f32, 0x3ddc49e0, Constant:i64<0>, /home/chad/.julia/packages/BFloat16s/lYUbX/src/bfloat16.jl:166 @[ REPL[2]:3 ]
0x3de481d0: v8f32 = undef
0x3ddc49e0: v4f32,ch = load<(load (s128) from %ir.55, align 4, !tbaa !53, !alias.scope !56, !noalias !59)> 0x3cc2f6b0, 0x3ddc43c0, undef:i64, essentials.jl:920 @[ REPL[2]:3 ]
0x3ddc43c0: i64 = add 0x3de48b00, Constant:i64<48>, essentials.jl:920 @[ REPL[2]:3 ]
0x3de48b00: i64 = add 0x3ddc4510, 0x3de48c50, essentials.jl:920 @[ REPL[2]:3 ]
0x3ddc4510: i64,ch = CopyFromReg 0x3cc2f6b0, Register:i64 %2, essentials.jl:920 @[ REPL[2]:3 ]
0x3ddc4350: i64 = Register %2
0x3de48c50: i64 = shl 0x3de48fd0, Constant:i8<2>, essentials.jl:920 @[ REPL[2]:3 ]
0x3de48fd0: i64,ch = CopyFromReg 0x3cc2f6b0, Register:i64 %8, essentials.jl:920 @[ REPL[2]:3 ]
0x3ddc42e0: i64 = Register %8
0x3de48160: i8 = Constant<2>
0x3de48400: i64 = Constant<48>
0x3de48320: i64 = undef
0x3ddc51c0: i64 = Constant<0>
0x3ddc50e0: i64 = Constant<8>
In function: julia_f!_595
[116013] signal 6 (-6): Aborted
in expression starting at REPL[3]:1
unknown function (ip: 0x7f787a846a2c) at /usr/lib/libc.so.6
gsignal at /usr/lib/libc.so.6 (unknown line)
abort at /usr/lib/libc.so.6 (unknown line)
_ZN4llvm18report_fatal_errorERKNS_5TwineEb.cold at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel15CannotYetSelectEPNS_6SDNodeE at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel16SelectCodeCommonEPNS_6SDNodeEPKhj at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel6SelectEPN4llvm6SDNodeE at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel22DoInstructionSelectionEv at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.0 at /home/chad/.julia/juliaup/julia-1.12.6+0.x64.linux.gnu/bin/../lib/julia/libLLVM.so.18.1jl (unknown line)
...
Allocations: 2782850 (Pool: 2779994; Big: 2856); GC: 5
fish: Job 1, 'julia' terminated by signal SIGABRT (Abort)
What I’ve checked
| Variant | Result |
|---|---|
--cpu-target=generic |
OK |
--cpu-target=znver3 |
OK |
--cpu-target=znver4 |
crash |
--cpu-target=znver5 |
crash |
--cpu-target=sapphirerapids |
crash |
Scalar BFloat16(rand(Float32)) |
OK |
Same loop but Float16 instead of BFloat16 |
OK |
dst[j] = BFloat16(src[j]) (explicit) |
crash (same error) |
So: any AVX-512-BF16 native target + a vectorizable Float32 → BFloat16 conversion loop. Scalar is fine; emulated BF16 (znver3, generic) is fine; Float16 is fine.
Diagnosis (verified against the LLVM 18.x sources)
The failing DAG is two X86ISD::VFPROUND halves concatenated into a v16bf16 via two insert_subvectors — one at offset 0, one at offset 8. Only the offset-0 side has a lowering pattern:
- Offset-0 inserts are handled by
subvector_subreg_loweringinX86InstrVecCompiler.td. The multiclass is explicitly commented “Patterns for insert_subvector/extract_subvector to/from index=0” and its pattern hard-codes(iPTR 0). bf16 entries exist (lines 86, 99, 112), added by PR #83720 for #83358. - Non-zero-offset inserts (i.e. the
vinsertf128-style concat of two 128-bit halves into a 256/512-bit register) are handled byvinsert_for_size_loweringinX86InstrAVX512.td. f16 has three entries there (v8f16→v16f16,v8f16→v32f16,v16f16→v32f16); bf16 has zero.
So PR #83720 covered bf16 at offset 0, but the matching vinsert_for_size_lowering calls for bf16 were never added. The fix would be three lines mirroring the f16 entries at 495/502/509. (vextract_for_size_lowering at 796/811 for f16 also has no bf16 counterparts, so the extract-from-upper-half path is presumably broken too — I haven’t tripped it.)
Environment
Julia 1.12.6, LLVM 18.1.7
AMD Ryzen 9 9950X (znver5, AVX-512 + BF16)
BFloat16s v0.6.1
julia> versioninfo()
Julia Version 1.12.6
Commit 15346901f00 (2026-04-09 19:20 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × AMD Ryzen 9 9950X 16-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver5)
GC: Built with stock GC
Threads: 16 default, 1 interactive, 16 GC (on 32 virtual cores)
Environment:
LD_LIBRARY_PATH = /opt/cuda/lib64
JULIA_NUM_THREADS = 16
JULIA_PKG_USE_CLI_GIT = true