C routine uses AVX intrinsics

JeffreySarnoff · September 25, 2022, 10:51am

I would like to make the example in the docs work. The presentation is incomplete; while I have the requisite tools, it is unclear how best (most simply and effectively) apply them here: the easiest most foolproof way to create a file that does the C and allows it to be part of the distribution of a package. I understand this is a machine specific facility. It would be useful to do the same with similar C++.

#include <immintrin.h>

__m256 dist( __m256 a, __m256 b ) {
    return _mm256_sqrt_ps(
              _mm256_add_ps(_mm256_mul_ps(a, a),
                            _mm256_mul_ps(b, b)));
}

const m256 = NTuple{8, VecElement{Float32}}

a = m256(ntuple(i -> VecElement(sin(Float32(i))), 8))
b = m256(ntuple(i -> VecElement(cos(Float32(i))), 8))

function call_dist(a::m256, b::m256)
    @ccall "libdist".dist(a::m256, b::m256)::m256
end

println(call_dist(a,b))

JeffreySarnoff · September 25, 2022, 11:03am

To be sure, I would rather do this in Julia directly. While we have a bunch of accessible and utilitarian AVX2 acceleration, there are areas of available SIMD instructions that have not been well developed with Julia (as yet). This is not a gripe – the priorities have been sound. I need finer control of some of the other SIMD instructions to improve the throughput of error-free transformations. That would benefit Double64s.

woclass · September 25, 2022, 11:09am

We may want to rearrange the order of the documents.

Compile you code:

gcc -mavx -fPIC -shared -o libdist.so avx.c

open julia in the same dir:

julia> pwd()
"/mnt/d/proj/Julia/JuliaEn/87777-ccall"

shell> ls
avx.c  libdist.so

julia> const m256 = NTuple{8, VecElement{Float32}}
NTuple{8, VecElement{Float32}}

julia> a = m256(ntuple(i -> VecElement(sin(Float32(i))), 8))
(VecElement{Float32}(0.84147096f0), VecElement{Float32}(0.9092974f0), VecElement{Float32}(0.14112f0), VecElement{Float32}(-0.7568025f0), VecElement{Float32}(-0.9589243f0), VecElement{Float32}(-0.2794155f0), VecElement{Float32}(0.6569866f0), VecElement{Float32}(0.98935825f0))

julia> b = m256(ntuple(i -> VecElement(cos(Float32(i))), 8))
(VecElement{Float32}(0.5403023f0), VecElement{Float32}(-0.41614684f0), VecElement{Float32}(-0.9899925f0), VecElement{Float32}(-0.6536436f0), VecElement{Float32}(0.2836622f0), VecElement{Float32}(0.96017027f0), VecElement{Float32}(0.75390226f0), VecElement{Float32}(-0.14550003f0))

julia> function call_dist(a::m256, b::m256)
           @ccall "./libdist".dist(a::m256, b::m256)::m256
       end
call_dist (generic function with 1 method)

julia> println(call_dist(a,b))
(VecElement{Float32}(0.99999994f0), VecElement{Float32}(0.99999994f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0))

Note: I’m using "./libdist" to call the lib.
Calling the library directly using the name will give an error: “ERROR: could not load library “libdist””.
Maybe I need to set some environment variables.

To load the shared libraries under the current path, you need to modify DL_LOAD_PATH: push!(DL_LOAD_PATH, ".")

julia> ccall((:dist, "libdist"), (m256), (m256,m256), a,b)
ERROR: could not load library "libdist"
libdist.so: cannot open shared object file: No such file or directory
Stacktrace:
 [1] top-level scope
   @ ./REPL[36]:1

julia> push!(DL_LOAD_PATH, ".")
1-element Vector{String}:
 "."

julia> ccall((:dist, "libdist"), (m256), (m256,m256), a,b)
(VecElement{Float32}(0.99999994f0), VecElement{Float32}(0.99999994f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0))

woclass · September 25, 2022, 11:23am

Next, you may want to use GitHub - JuliaPackaging/BinaryBuilder.jl: Binary Dependency Builder for Julia to automatically compile the C libraries for each platform and package them as jll package.

Then you can simply using LibDist_jll.jl in julia.

Elrod · September 25, 2022, 11:53am

The ccall will not be inclined, so performance will be bad use llvmcall if you want performance.

You have access to everything intrinsic Clang is capable of that way.

JeffreySarnoff · September 25, 2022, 11:56am

It is too early for me to say how well it works

Is there something equally graspable that would allow me to make google’s highway [portable SIMD/vector intrinsics in C++] available? Or any similar facility.

Elrod · September 25, 2022, 11:59am

Here is an example of calling an instruction from a particular instruction set: VectorizationBase.jl/conflict.jl at master · JuliaSIMD/VectorizationBase.jl · GitHub

If you want portability, don’t call generic llvm Vec for instructions. LLVM does a great job. Better than highway.
Julia thankfully exposes that to users. Clang unfortunately does not.

JeffreySarnoff · September 25, 2022, 12:18pm

good information – thanks

mkitti · September 25, 2022, 1:39pm

Using Godbolt to emit LLVM IR:


define dso_local noundef <8 x float> @_Z4distDv8_fS_(<8 x float> noundef %0, <8 x float> noundef %1) #0 !dbg !261 {
  %3 = alloca <8 x float>, align 32
  %4 = alloca <8 x float>, align 32
  %5 = alloca <8 x float>, align 32
  %6 = alloca <8 x float>, align 32
  %7 = alloca <8 x float>, align 32
  %8 = alloca <8 x float>, align 32
  %9 = alloca <8 x float>, align 32
  %10 = alloca <8 x float>, align 32
  %11 = alloca <8 x float>, align 32
  store <8 x float> %0, <8 x float>* %10, align 32
  call void @llvm.dbg.declare(metadata <8 x float>* %10, metadata !266, metadata !DIExpression()), !dbg !267
  store <8 x float> %1, <8 x float>* %11, align 32
  call void @llvm.dbg.declare(metadata <8 x float>* %11, metadata !268, metadata !DIExpression()), !dbg !269
  %12 = load <8 x float>, <8 x float>* %10, align 32, !dbg !270
  %13 = load <8 x float>, <8 x float>* %10, align 32, !dbg !271
  store <8 x float> %12, <8 x float>* %3, align 32, !dbg !272
  store <8 x float> %13, <8 x float>* %4, align 32, !dbg !272
  %14 = load <8 x float>, <8 x float>* %3, align 32, !dbg !272
  %15 = load <8 x float>, <8 x float>* %4, align 32, !dbg !272
  %16 = fmul <8 x float> %14, %15, !dbg !272
  %17 = load <8 x float>, <8 x float>* %11, align 32, !dbg !273
  %18 = load <8 x float>, <8 x float>* %11, align 32, !dbg !274
  store <8 x float> %17, <8 x float>* %5, align 32, !dbg !275
  store <8 x float> %18, <8 x float>* %6, align 32, !dbg !275
  %19 = load <8 x float>, <8 x float>* %5, align 32, !dbg !275
  %20 = load <8 x float>, <8 x float>* %6, align 32, !dbg !275
  %21 = fmul <8 x float> %19, %20, !dbg !275
  store <8 x float> %16, <8 x float>* %7, align 32, !dbg !276
  store <8 x float> %21, <8 x float>* %8, align 32, !dbg !276
  %22 = load <8 x float>, <8 x float>* %7, align 32, !dbg !276
  %23 = load <8 x float>, <8 x float>* %8, align 32, !dbg !276
  %24 = fadd <8 x float> %22, %23, !dbg !276
  store <8 x float> %24, <8 x float>* %9, align 32, !dbg !277
  %25 = load <8 x float>, <8 x float>* %9, align 32, !dbg !277
  %26 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %25) #2, !dbg !277
  ret <8 x float> %26, !dbg !278
}

declare void @llvm.dbg.declare(metadata, metadata, metadata) #1

declare <8 x float> @llvm.sqrt.v8f32(<8 x float>) #1

attributes #0 = { mustprogress noinline optnone uwtable "frame-pointer"="all" "min-legal-vector-width"="256" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+avx,+avx2,+crc32,+cx8,+fxsr,+mmx,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave" "tune-cpu"="generic" }
attributes #1 = { nofree nosync nounwind readnone speculatable willreturn }
attributes #2 = { nounwind }

JeffreySarnoff · September 25, 2022, 5:06pm

How do I use the emitted instructions, massage them into julia?

Elrod · September 25, 2022, 5:37pm

You could llvmcall them.
But you should already be able to see that it isn’t doing anything SIMD.jl or VectorizationBase.jl aren’t already doing. They’ve both already wrapped llvmcall you’d need (you don’t need the allocas, and while they have wrapped the load and stores, which are useful for many things, they’re not needed here either).

mkitti · September 25, 2022, 7:37pm

@kristoffer.carlsson wrote a blog post on how this could be done.

Once you know the LLVM intrinsics of interest you could look in thr following files and see id they have wrapped for you already.

github.com

eschnett/SIMD.jl/blob/master/src/LLVM_intrinsics.jl

# LLVM operations and intrinsics
module Intrinsics

# Note, that in the functions below, some care needs to be taken when passing
# Julia Bools to LLVM. Julia passes Bools as LLVM i8 but expect them to only
# have the last bit as non-zero. Failure to comply with this can give weird errors
# like false !== false where the first false is the result of some computation.

# Note, no difference is made between Julia usigned integers and signed integers
# when passed to LLVM. It is up to the caller to make sure that the correct
# intrinsic is called (e.g uitofp vs sitofp).

import ..SIMD: SIMD, VE, LVec, FloatingTypes
# Include Bool in IntegerTypes
const IntegerTypes = Union{SIMD.IntegerTypes, Bool}

const d = Dict{DataType, String}(
    Bool         => "i8",
    Int8         => "i8",
    Int16        => "i16",

This file has been truncated. show original

mkitti · September 25, 2022, 8:21pm

Let’s see if I can help map this out.

Intel Intrinsics

The gist is that we are trying to work on 8 Float32s at a time.

LLVM IR to SIMD.jl and VectorizationBase.jl

load
- SIMD.Intrinsics.load
- VectorizationBase.vload
store
- SIMD.Intrinsics.store
- VectorizationBase.vstore!
fmul
- SIMD.Intrinsics.fmul
- VectorizationBase.vmul
fadd
- SIMD.Intrinsics.fadd
- VectorizationBase.vadd
llvm.sqrt
- SIMD.Intrinsics.sqrt
- VectorizationBase.vsqrt

Essentially have all the pieces. Now to figure out how to assemble them.

JeffreySarnoff · September 25, 2022, 8:22pm

Thank you for the gracious walkway.

mkitti · September 25, 2022, 9:03pm

Here is the SIMD.jl version:

julia> function dist(a::NTuple{8,Float32}, b::NTuple{8,Float32})::NTuple{8,Float32}
           va = Vec{8,Float32}(a)
           vb = Vec{8,Float32}(b)
           vc = sqrt(va^2 + vb^2)
           return Tuple(vc)
       end
dist (generic function with 1 method)

julia> @code_llvm debuginfo=:none dist(a,b)
define void @julia_dist_508([8 x float]* noalias nocapture sret([8 x float]) %0, [8 x float]* nocapture nonnull readonly align 4 dereferenceable(32) %1, [8 x float]* nocapture nonnull readonly align 4 dereferenceable(32) %2) #0 {
top:
  %3 = bitcast [8 x float]* %1 to <8 x float>*
  %4 = load <8 x float>, <8 x float>* %3, align 4
  %5 = bitcast [8 x float]* %2 to <8 x float>*
  %6 = load <8 x float>, <8 x float>* %5, align 4
  %7 = fmul <8 x float> %4, %4
  %8 = fmul <8 x float> %6, %6
  %9 = fadd <8 x float> %7, %8
  %10 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %9)
  %11 = bitcast [8 x float]* %0 to <8 x float>*
  store <8 x float> %10, <8 x float>* %11, align 4
  ret void
}

julia> @code_native debuginfo=:none dist(a,b)
	.text
	.file	"dist"
	.globl	julia_dist_513                  # -- Begin function julia_dist_513
	.p2align	4, 0x90
	.type	julia_dist_513,@function
julia_dist_513:                         # @julia_dist_513
	.cfi_startproc
# %bb.0:                                # %top
	vmovups	(%rsi), %ymm0
	vmovups	(%rdx), %ymm1
	movq	%rdi, %rax
	vmulps	%ymm0, %ymm0, %ymm0
	vmulps	%ymm1, %ymm1, %ymm1
	vaddps	%ymm1, %ymm0, %ymm0
	vsqrtps	%ymm0, %ymm0
	vmovups	%ymm0, (%rdi)
	vzeroupper
	retq
.Lfunc_end0:
	.size	julia_dist_513, .Lfunc_end0-julia_dist_513
	.cfi_endproc
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

That looks similar to what the C version compiles to on x86_64:

dist(float __vector(8), float __vector(8)):                         # @dist(float __vector(8), float __vector(8))
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 320
        vmovaps ymmword ptr [rsp + 32], ymm0
        vmovaps ymmword ptr [rsp], ymm1
        vmovaps ymm1, ymmword ptr [rsp + 32]
        vmovaps ymm0, ymmword ptr [rsp + 32]
        vmovaps ymmword ptr [rsp + 256], ymm1
        vmovaps ymmword ptr [rsp + 224], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 256]
        vmulps  ymm1, ymm0, ymmword ptr [rsp + 224]
        vmovaps ymm2, ymmword ptr [rsp]
        vmovaps ymm0, ymmword ptr [rsp]
        vmovaps ymmword ptr [rsp + 192], ymm2
        vmovaps ymmword ptr [rsp + 160], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 192]
        vmulps  ymm0, ymm0, ymmword ptr [rsp + 160]
        vmovaps ymmword ptr [rsp + 128], ymm1
        vmovaps ymmword ptr [rsp + 96], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 128]
        vaddps  ymm0, ymm0, ymmword ptr [rsp + 96]
        vmovaps ymmword ptr [rsp + 64], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 64]
        vsqrtps ymm0, ymm0
        mov     rsp, rbp
        pop     rbp
        ret

Godbolt link to assembly

mkitti · September 26, 2022, 9:19pm

I also did this with VectorizationBase.jl on another computer with a Xeon as an exercise for myself. Admittedly, we could use some improved documentation there, but perhaps it is still not meant to be used directly?

julia> using VectorizationBase

julia> dist(a::NTuple{N,T}, b::NTuple{N,T}) where {N,T} = return ntuple(sqrt(Vec(a...)^2 + Vec(b...)^2), Val(N))
dist (generic function with 2 methods)

julia> a = b = (1.f0,2.f0,3.f0,4.f0,5.f0,6.f0,7.f0,8.f0)
(1.0f0, 2.0f0, 3.0f0, 4.0f0, 5.0f0, 6.0f0, 7.0f0, 8.0f0)

julia> @code_llvm debuginfo=:none dist(a,b)
; Function Attrs: uwtable
define void @julia_dist_1430([8 x float]* noalias nocapture sret([8 x float]) %0, [8 x float]* nocapture nonnull readonly align 4 dereferenceable(32) %1, [8 x float]* nocapture nonnull readonly align 4 dereferenceable(32) %2) #0 {
top:
  %3 = bitcast [8 x float]* %1 to <8 x float>*
  %4 = load <8 x float>, <8 x float>* %3, align 4
  %res.i26 = fmul reassoc nsz arcp contract afn <8 x float> %4, %4
  %5 = bitcast [8 x float]* %2 to <8 x float>*
  %6 = load <8 x float>, <8 x float>* %5, align 4
  %res.i17 = fmul reassoc nsz arcp contract afn <8 x float> %6, %6
  %res.i16 = fadd nsz contract <8 x float> %res.i26, %res.i17
  %res.i15 = call fast <8 x float> @llvm.sqrt.v8f32(<8 x float> %res.i16)
  %7 = bitcast [8 x float]* %0 to <8 x float>*
  store <8 x float> %res.i15, <8 x float>* %7, align 4
  ret void
}

julia> @code_native debuginfo=:none dist(a,b)
        .text
        .file   "dist"
        .globl  julia_dist_1446                 # -- Begin function julia_dist_1446
        .p2align        4, 0x90
        .type   julia_dist_1446,@function
julia_dist_1446:                        # @julia_dist_1446
        .cfi_startproc
# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        vmovups (%rdx), %ymm0
        vmovups (%r8), %ymm1
        movq    %rcx, %rax
        vmulps  %ymm1, %ymm1, %ymm1
        vfmadd231ps     %ymm0, %ymm0, %ymm1     # ymm1 = (ymm0 * ymm0) + ymm1
        vsqrtps %ymm1, %ymm0
        vmovups %ymm0, (%rcx)
        popq    %rbp
        vzeroupper
        retq
.Lfunc_end0:
        .size   julia_dist_1446, .Lfunc_end0-julia_dist_1446
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

What is impressive here is that the compiler chose an advanced instruction,vfmadd231ps which is a fused multiply add. The great thing about Julia is that it can adapt to the architecture it is being used on easily. Here it is using AVX512 instruction:

https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

This version can also be easily adapted to use 512-bit registers (see the zmm*):

julia> a = b = (1, 2, 3, 4, 5, 6, 7, 8)
(1, 2, 3, 4, 5, 6, 7, 8)

julia> @code_native debuginfo=:none dist(a,b)
        .text
        .file   "dist"
        .globl  julia_dist_1521                 # -- Begin function julia_dist_1521
        .p2align        4, 0x90
        .type   julia_dist_1521,@function
julia_dist_1521:                        # @julia_dist_1521
        .cfi_startproc
# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        vcvtqq2pd       (%rdx), %zmm0
        vcvtqq2pd       (%r8), %zmm1
        vmulpd  %zmm1, %zmm1, %zmm1
        vfmadd231pd     %zmm0, %zmm0, %zmm1     # zmm1 = (zmm0 * zmm0) + zmm1
        vsqrtpd %zmm1, %zmm0
        movq    %rcx, %rax
        vmovupd %ymm0, (%rcx)
        vextractf64x4   $1, %zmm0, 32(%rcx)
        popq    %rbp
        vzeroupper
        retq
.Lfunc_end0:
        .size   julia_dist_1521, .Lfunc_end0-julia_dist_1521
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

Topic		Replies	Views
Julia equivalent of C compiler intrinsics? General Usage	23	3095	November 8, 2018
Calling AVX-512 intrinsics from Julia General Usage bit-twiddling	6	709	July 4, 2023
Simple loop won't vectorize New to Julia	12	1629	January 29, 2019
Division isn't auto-vectorized when function inlined Performance simd , benchmark	12	1014	March 11, 2020
SIMD Complex Numbers General Usage simd , complex-numbers	19	2689	July 22, 2021

C routine uses AVX intrinsics

Intel Intrinsics

LLVM IR to SIMD.jl and VectorizationBase.jl

Related topics