I found an icelake-server machine, and I sorted out a few things from your original code.
__m512i
needs to be const when used globally- The return type is
<64 x i1>
and noti64
so we need an explicitbitcast
.
julia> import Core.Intrinsics.llvmcall
julia> const __m512i = NTuple{64, VecElement{Int8}}
NTuple{64, VecElement{Int8}}
julia> vpshufbitqmb_512(a,b) = Core.Intrinsics.llvmcall(("""
declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>)
define i64 @i64_vpshufbitqmb_512(<64 x i8> %a, <64 x i8> %b) {
%tmp = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %a, <64 x i8> %b)
%tmp2 = bitcast <64 x i1> %tmp to i64
ret i64 %tmp2
}
""","i64_vpshufbitqmb_512"), Int64, Tuple{__m512i, __m512i}, a, b)
vpshufbitqmb_512 (generic function with 1 method)
julia> x = __m512i(ntuple(_ -> rand(Int8), 64));
julia> p = __m512i(ntuple(_ -> rand(Int8), 64));
julia> vpshufbitqmb_512(x,p)
-68453262247164531
julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 56 × Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server)
Threads: 1 on 112 virtual cores
julia> Base.BinaryPlatforms.CPUID.test_cpu_feature(Base.BinaryPlatforms.CPUID.JL_X86_avx512bitalg)
true
I figured this out by looking at the following examples.