Calling AVX-512 intrinsics from Julia

I found an icelake-server machine, and I sorted out a few things from your original code.

  1. __m512i needs to be const when used globally
  2. The return type is <64 x i1> and not i64 so we need an explicit bitcast.
julia> import Core.Intrinsics.llvmcall

julia> const __m512i = NTuple{64, VecElement{Int8}}
NTuple{64, VecElement{Int8}}

julia> vpshufbitqmb_512(a,b) = Core.Intrinsics.llvmcall(("""
       declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>)
       define i64 @i64_vpshufbitqmb_512(<64 x i8> %a, <64 x i8> %b) {
         %tmp = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %a, <64 x i8> %b)
         %tmp2 = bitcast <64 x i1> %tmp to i64
         ret i64 %tmp2
       }
       ""","i64_vpshufbitqmb_512"), Int64, Tuple{__m512i, __m512i}, a, b)
vpshufbitqmb_512 (generic function with 1 method)

julia> x = __m512i(ntuple(_ -> rand(Int8), 64));

julia> p = __m512i(ntuple(_ -> rand(Int8), 64));

julia> vpshufbitqmb_512(x,p)
-68453262247164531

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 56 × Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server)
  Threads: 1 on 112 virtual cores

julia> Base.BinaryPlatforms.CPUID.test_cpu_feature(Base.BinaryPlatforms.CPUID.JL_X86_avx512bitalg)
true

I figured this out by looking at the following examples.

  1. llvm-project/llvm/test/CodeGen/X86/vpshufbitqbm-intrinsics.ll at 147a61618989b6cca1f5f77ed96f930620ff193f · JuliaLang/llvm-project · GitHub
  2. VectorizationBase.jl/src/llvm_intrin/intrin_funcs.jl at 9174dcca731144935e438d44ba07f4e4ec3a66c6 · JuliaSIMD/VectorizationBase.jl · GitHub
4 Likes