Calling AVX-512 intrinsics from Julia

I am having trouble calling some AVX-512 intrinsics from Julia, coming directly from this post.

The example on the blog post compiles and runs fine on the CPU, since it has the avx512_bitalg CPU flag.
The problematic instruction in question is _mm512_bitshuffle_epi64_mask. Using godbolt, I extract the corresponding LLVM name from the line

  %7 = tail call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %6, <64 x i8> %4), !dbg !377

but the following fails:

__m512i = NTuple{64, VecElement{Int8}}

x = __m512i(ntuple(_ -> rand(Int8), 64))
p = __m512i(ntuple(_ -> rand(Int8), 64))
ccall("llvm.x86.avx512.vpshufbitqmb.512", llvmcall, Int64, (__m512i, __m512i), x, p)

with ERROR: llvmcall only supports intrinsic calls.
Notice that the intrinsic returns a <64 x i1>, and I am hoping there is some casting to Int64 happening implicitly.

Trying to write it down explicitly as

using SIMD

function _test(x, p)
    __m512i = SIMD.LVec{64, Int8}

    return Base.llvmcall("""
        %3 = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %0, <64 x i8> %1)
        %4 = bitcast <64 x i1> %3 to i64
         ret i64 %4
    """, Int64, Tuple{__m512i,__m512i}, x, p)
end

also fails with a different error

ERROR: Failed to parse LLVM assembly:
<string>:3:21: error: use of undefined value '@llvm.x86.avx512.vpshufbitqmb.512'
%3 = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %0, <64 x i8> %1)
                    ^

It seems that the intrinsic is not recognized by LLVM. Is there a way to check if the intrinsic is available?

I am running Julia 1.9.0 with LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server).

Here is a C++ function calling the intrinsic via Godbolt with emit llvm.

Here us the generated IR.

; Function Attrs: argmemonly mustprogress nofree nosync nounwind readonly willreturn uwtable
define dso_local noundef i64 @bit_shuffle(unsigned long, unsigned char*)(i64 noundef %0, ptr nocapture noundef readonly %1) local_unnamed_addr #0 !dbg !358 {
  call void @llvm.dbg.value(metadata i64 %0, metadata !364, metadata !DIExpression()), !dbg !368
  call void @llvm.dbg.value(metadata ptr %1, metadata !365, metadata !DIExpression()), !dbg !368
  %3 = insertelement <8 x i64> undef, i64 %0, i64 0, !dbg !369
  call void @llvm.dbg.value(metadata <8 x i64> undef, metadata !366, metadata !DIExpression()), !dbg !368
  %4 = load <64 x i8>, ptr %1, align 1, !dbg !370
  %5 = bitcast <8 x i64> %3 to <64 x i8>, !dbg !374
  %6 = shufflevector <64 x i8> %5, <64 x i8> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !374
  %7 = tail call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %6, <64 x i8> %4), !dbg !374
  %8 = bitcast <64 x i1> %7 to i64, !dbg !374
  call void @llvm.dbg.value(metadata i64 %8, metadata !367, metadata !DIExpression()), !dbg !368
  ret i64 %8, !dbg !375
}

; Function Attrs: nofree nosync nounwind readnone
declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) #1

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare void @llvm.dbg.value(metadata, metadata, metadata) #2

attributes #0 = { argmemonly mustprogress nofree nosync nounwind readonly willreturn uwtable "frame-pointer"="none" "min-legal-vector-width"="512" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="icelake-server" "target-features"="+adx,+aes,+avx,+avx2,+avx512bitalg,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512ifma,+avx512vbmi,+avx512vbmi2,+avx512vl,+avx512vnni,+avx512vpopcntdq,+bmi,+bmi2,+clflushopt,+clwb,+crc32,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+gfni,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+pconfig,+pku,+popcnt,+prfchw,+rdpid,+rdrnd,+rdseed,+sahf,+sgx,+sha,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+vaes,+vpclmulqdq,+wbnoinvd,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" }
attributes #1 = { nofree nosync nounwind readnone }
attributes #2 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

Perhaps you are missing the external declaration

declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) #1

Thanks for your reply :slight_smile: .

I know nothing about LLVM IR, but it seems to me that declares should be outside of the function definition.
I tried

using SIMD

function _test(x, p)
    __m512i = SIMD.LVec{64, Int8}

    return Base.llvmcall("""
        %3 = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %0, <64 x i8> %1)
        %4 = bitcast <64 x i1> %3 to i64
         ret i64 %4
         declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) #
    """, Int64, Tuple{__m512i,__m512i}, x, p)
end

__m512i = NTuple{64, VecElement{Int8}}
x = __m512i(ntuple(_ -> rand(Int8), 64))
p = __m512i(ntuple(_ -> rand(Int8), 64))

_test(x, p)

and indeed I get a

<string>:6:27: error: expected instruction opcode
                          declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) #

It should be possible to call just the intrinsic, as described here.

For now I am compiling a C code and calling it from julia, but this approach has the disadvantage that the function call does not get inlined.

The declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) creates a global identifier.

LLVM identifiers come in two basic types: global and local. Global identifiers (functions, global variables) begin with the '@' character.

I’m guessing the LLVM requires the declaration but then references an external implementation somewhere.

Perhaps @kristoffer.carlsson , the author of that blog, would be of more help. @Elrod has more experience with SIMD with packages such as VectorizationBase.jl and LoopVectorization.jl.

2 Likes

VectorizationBase goes through a function to still support the old llvmcall API, so it is messier than necessary.
But you can see plenty of examples of it using avx512 specific intrinsics, e.g.:
VectorizationBase.jl/src/llvm_intrin/intrin_funcs.jl at 9174dcca731144935e438d44ba07f4e4ec3a66c6 · JuliaSIMD/VectorizationBase.jl · GitHub

1 Like

I found an icelake-server machine, and I sorted out a few things from your original code.

  1. __m512i needs to be const when used globally
  2. The return type is <64 x i1> and not i64 so we need an explicit bitcast.
julia> import Core.Intrinsics.llvmcall

julia> const __m512i = NTuple{64, VecElement{Int8}}
NTuple{64, VecElement{Int8}}

julia> vpshufbitqmb_512(a,b) = Core.Intrinsics.llvmcall(("""
       declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>)
       define i64 @i64_vpshufbitqmb_512(<64 x i8> %a, <64 x i8> %b) {
         %tmp = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %a, <64 x i8> %b)
         %tmp2 = bitcast <64 x i1> %tmp to i64
         ret i64 %tmp2
       }
       ""","i64_vpshufbitqmb_512"), Int64, Tuple{__m512i, __m512i}, a, b)
vpshufbitqmb_512 (generic function with 1 method)

julia> x = __m512i(ntuple(_ -> rand(Int8), 64));

julia> p = __m512i(ntuple(_ -> rand(Int8), 64));

julia> vpshufbitqmb_512(x,p)
-68453262247164531

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 56 × Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server)
  Threads: 1 on 112 virtual cores

julia> Base.BinaryPlatforms.CPUID.test_cpu_feature(Base.BinaryPlatforms.CPUID.JL_X86_avx512bitalg)
true

I figured this out by looking at the following examples.

  1. llvm-project/llvm/test/CodeGen/X86/vpshufbitqbm-intrinsics.ll at 147a61618989b6cca1f5f77ed96f930620ff193f · JuliaLang/llvm-project · GitHub
  2. VectorizationBase.jl/src/llvm_intrin/intrin_funcs.jl at 9174dcca731144935e438d44ba07f4e4ec3a66c6 · JuliaSIMD/VectorizationBase.jl · GitHub
4 Likes

Wow that’s exactly what I was looking for.
Thank you so much!