Calling AVX-512 intrinsics from Julia

giacomogiudice · July 2, 2023, 12:47pm

I am having trouble calling some AVX-512 intrinsics from Julia, coming directly from this post.

The example on the blog post compiles and runs fine on the CPU, since it has the avx512_bitalg CPU flag.
The problematic instruction in question is _mm512_bitshuffle_epi64_mask. Using godbolt, I extract the corresponding LLVM name from the line

  %7 = tail call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %6, <64 x i8> %4), !dbg !377

but the following fails:

__m512i = NTuple{64, VecElement{Int8}}

x = __m512i(ntuple(_ -> rand(Int8), 64))
p = __m512i(ntuple(_ -> rand(Int8), 64))
ccall("llvm.x86.avx512.vpshufbitqmb.512", llvmcall, Int64, (__m512i, __m512i), x, p)

with ERROR: llvmcall only supports intrinsic calls.
Notice that the intrinsic returns a <64 x i1>, and I am hoping there is some casting to Int64 happening implicitly.

Trying to write it down explicitly as

using SIMD

function _test(x, p)
    __m512i = SIMD.LVec{64, Int8}

    return Base.llvmcall("""
        %3 = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %0, <64 x i8> %1)
        %4 = bitcast <64 x i1> %3 to i64
         ret i64 %4
    """, Int64, Tuple{__m512i,__m512i}, x, p)
end

also fails with a different error

ERROR: Failed to parse LLVM assembly:
<string>:3:21: error: use of undefined value '@llvm.x86.avx512.vpshufbitqmb.512'
%3 = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %0, <64 x i8> %1)
                    ^

It seems that the intrinsic is not recognized by LLVM. Is there a way to check if the intrinsic is available?

I am running Julia 1.9.0 with LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server).

mkitti · July 2, 2023, 1:40pm

Here is a C++ function calling the intrinsic via Godbolt with emit llvm.

Here us the generated IR.

; Function Attrs: argmemonly mustprogress nofree nosync nounwind readonly willreturn uwtable
define dso_local noundef i64 @bit_shuffle(unsigned long, unsigned char*)(i64 noundef %0, ptr nocapture noundef readonly %1) local_unnamed_addr #0 !dbg !358 {
  call void @llvm.dbg.value(metadata i64 %0, metadata !364, metadata !DIExpression()), !dbg !368
  call void @llvm.dbg.value(metadata ptr %1, metadata !365, metadata !DIExpression()), !dbg !368
  %3 = insertelement <8 x i64> undef, i64 %0, i64 0, !dbg !369
  call void @llvm.dbg.value(metadata <8 x i64> undef, metadata !366, metadata !DIExpression()), !dbg !368
  %4 = load <64 x i8>, ptr %1, align 1, !dbg !370
  %5 = bitcast <8 x i64> %3 to <64 x i8>, !dbg !374
  %6 = shufflevector <64 x i8> %5, <64 x i8> poison, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !374
  %7 = tail call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %6, <64 x i8> %4), !dbg !374
  %8 = bitcast <64 x i1> %7 to i64, !dbg !374
  call void @llvm.dbg.value(metadata i64 %8, metadata !367, metadata !DIExpression()), !dbg !368
  ret i64 %8, !dbg !375
}

; Function Attrs: nofree nosync nounwind readnone
declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) #1

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare void @llvm.dbg.value(metadata, metadata, metadata) #2

attributes #0 = { argmemonly mustprogress nofree nosync nounwind readonly willreturn uwtable "frame-pointer"="none" "min-legal-vector-width"="512" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="icelake-server" "target-features"="+adx,+aes,+avx,+avx2,+avx512bitalg,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512ifma,+avx512vbmi,+avx512vbmi2,+avx512vl,+avx512vnni,+avx512vpopcntdq,+bmi,+bmi2,+clflushopt,+clwb,+crc32,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+gfni,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+pconfig,+pku,+popcnt,+prfchw,+rdpid,+rdrnd,+rdseed,+sahf,+sgx,+sha,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+vaes,+vpclmulqdq,+wbnoinvd,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" }
attributes #1 = { nofree nosync nounwind readnone }
attributes #2 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

Perhaps you are missing the external declaration

declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) #1

giacomogiudice · July 3, 2023, 9:34am

Thanks for your reply .

I know nothing about LLVM IR, but it seems to me that declares should be outside of the function definition.
I tried

using SIMD

function _test(x, p)
    __m512i = SIMD.LVec{64, Int8}

    return Base.llvmcall("""
        %3 = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %0, <64 x i8> %1)
        %4 = bitcast <64 x i1> %3 to i64
         ret i64 %4
         declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) #
    """, Int64, Tuple{__m512i,__m512i}, x, p)
end

__m512i = NTuple{64, VecElement{Int8}}
x = __m512i(ntuple(_ -> rand(Int8), 64))
p = __m512i(ntuple(_ -> rand(Int8), 64))

_test(x, p)

and indeed I get a

<string>:6:27: error: expected instruction opcode
                          declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) #

It should be possible to call just the intrinsic, as described here.

For now I am compiling a C code and calling it from julia, but this approach has the disadvantage that the function call does not get inlined.

mkitti · July 3, 2023, 3:33pm

The declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>) creates a global identifier.

LLVM identifiers come in two basic types: global and local. Global identifiers (functions, global variables) begin with the '@' character.

I’m guessing the LLVM requires the declaration but then references an external implementation somewhere.

Perhaps @kristoffer.carlsson , the author of that blog, would be of more help. @Elrod has more experience with SIMD with packages such as VectorizationBase.jl and LoopVectorization.jl.

Elrod · July 3, 2023, 4:40pm

VectorizationBase goes through a function to still support the old llvmcall API, so it is messier than necessary.
But you can see plenty of examples of it using avx512 specific intrinsics, e.g.:
VectorizationBase.jl/src/llvm_intrin/intrin_funcs.jl at 9174dcca731144935e438d44ba07f4e4ec3a66c6 · JuliaSIMD/VectorizationBase.jl · GitHub

mkitti · July 3, 2023, 8:17pm

I found an icelake-server machine, and I sorted out a few things from your original code.

__m512i needs to be const when used globally
The return type is <64 x i1> and not i64 so we need an explicit bitcast.

julia> import Core.Intrinsics.llvmcall

julia> const __m512i = NTuple{64, VecElement{Int8}}
NTuple{64, VecElement{Int8}}

julia> vpshufbitqmb_512(a,b) = Core.Intrinsics.llvmcall(("""
       declare <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8>, <64 x i8>)
       define i64 @i64_vpshufbitqmb_512(<64 x i8> %a, <64 x i8> %b) {
         %tmp = call <64 x i1> @llvm.x86.avx512.vpshufbitqmb.512(<64 x i8> %a, <64 x i8> %b)
         %tmp2 = bitcast <64 x i1> %tmp to i64
         ret i64 %tmp2
       }
       ""","i64_vpshufbitqmb_512"), Int64, Tuple{__m512i, __m512i}, a, b)
vpshufbitqmb_512 (generic function with 1 method)

julia> x = __m512i(ntuple(_ -> rand(Int8), 64));

julia> p = __m512i(ntuple(_ -> rand(Int8), 64));

julia> vpshufbitqmb_512(x,p)
-68453262247164531

julia> versioninfo()
Julia Version 1.9.1
Commit 147bdf428c (2023-06-07 08:27 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 56 × Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server)
  Threads: 1 on 112 virtual cores

julia> Base.BinaryPlatforms.CPUID.test_cpu_feature(Base.BinaryPlatforms.CPUID.JL_X86_avx512bitalg)
true

I figured this out by looking at the following examples.

giacomogiudice · July 4, 2023, 8:45am

Wow that’s exactly what I was looking for.
Thank you so much!

Topic		Replies	Views
C routine uses AVX intrinsics General Usage interoperability , c	15	1471	September 26, 2022
Julia equivalent of C compiler intrinsics? General Usage	23	3105	November 8, 2018
`llvmcall` error "llvmcall only supports intrinsic calls" for scalar integer intrinsics on Apple M2 General Usage question , llvm	2	64	January 27, 2025
Can you call Julia methods with LLVM call? Performance	15	2127	October 1, 2022
Bit manipulations with llvmcall give strange results General Usage llvm	11	318	March 9, 2024

Calling AVX-512 intrinsics from Julia

Related topics