C routine uses AVX intrinsics

I would like to make the example in the docs work. The presentation is incomplete; while I have the requisite tools, it is unclear how best (most simply and effectively) apply them here: the easiest most foolproof way to create a file that does the C and allows it to be part of the distribution of a package. I understand this is a machine specific facility. It would be useful to do the same with similar C++.

#include <immintrin.h>

__m256 dist( __m256 a, __m256 b ) {
    return _mm256_sqrt_ps(
              _mm256_add_ps(_mm256_mul_ps(a, a),
                            _mm256_mul_ps(b, b)));
}
const m256 = NTuple{8, VecElement{Float32}}

a = m256(ntuple(i -> VecElement(sin(Float32(i))), 8))
b = m256(ntuple(i -> VecElement(cos(Float32(i))), 8))

function call_dist(a::m256, b::m256)
    @ccall "libdist".dist(a::m256, b::m256)::m256
end

println(call_dist(a,b))

To be sure, I would rather do this in Julia directly. While we have a bunch of accessible and utilitarian AVX2 acceleration, there are areas of available SIMD instructions that have not been well developed with Julia (as yet). This is not a gripe – the priorities have been sound. I need finer control of some of the other SIMD instructions to improve the throughput of error-free transformations. That would benefit Double64s.

We may want to rearrange the order of the documents.

Compile you code:

gcc -mavx -fPIC -shared -o libdist.so avx.c

open julia in the same dir:

julia> pwd()
"/mnt/d/proj/Julia/JuliaEn/87777-ccall"

shell> ls
avx.c  libdist.so

julia> const m256 = NTuple{8, VecElement{Float32}}
NTuple{8, VecElement{Float32}}

julia> a = m256(ntuple(i -> VecElement(sin(Float32(i))), 8))
(VecElement{Float32}(0.84147096f0), VecElement{Float32}(0.9092974f0), VecElement{Float32}(0.14112f0), VecElement{Float32}(-0.7568025f0), VecElement{Float32}(-0.9589243f0), VecElement{Float32}(-0.2794155f0), VecElement{Float32}(0.6569866f0), VecElement{Float32}(0.98935825f0))

julia> b = m256(ntuple(i -> VecElement(cos(Float32(i))), 8))
(VecElement{Float32}(0.5403023f0), VecElement{Float32}(-0.41614684f0), VecElement{Float32}(-0.9899925f0), VecElement{Float32}(-0.6536436f0), VecElement{Float32}(0.2836622f0), VecElement{Float32}(0.96017027f0), VecElement{Float32}(0.75390226f0), VecElement{Float32}(-0.14550003f0))

julia> function call_dist(a::m256, b::m256)
           @ccall "./libdist".dist(a::m256, b::m256)::m256
       end
call_dist (generic function with 1 method)

julia> println(call_dist(a,b))
(VecElement{Float32}(0.99999994f0), VecElement{Float32}(0.99999994f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0))

Note: I’m using "./libdist" to call the lib.
Calling the library directly using the name will give an error: “ERROR: could not load library “libdist””.
Maybe I need to set some environment variables.


To load the shared libraries under the current path, you need to modify DL_LOAD_PATH: push!(DL_LOAD_PATH, ".")

julia> ccall((:dist, "libdist"), (m256), (m256,m256), a,b)
ERROR: could not load library "libdist"
libdist.so: cannot open shared object file: No such file or directory
Stacktrace:
 [1] top-level scope
   @ ./REPL[36]:1

julia> push!(DL_LOAD_PATH, ".")
1-element Vector{String}:
 "."

julia> ccall((:dist, "libdist"), (m256), (m256,m256), a,b)
(VecElement{Float32}(0.99999994f0), VecElement{Float32}(0.99999994f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0), VecElement{Float32}(1.0f0))

Next, you may want to use GitHub - JuliaPackaging/BinaryBuilder.jl: Binary Dependency Builder for Julia to automatically compile the C libraries for each platform and package them as jll package.

Then you can simply using LibDist_jll.jl in julia.

The ccall will not be inclined, so performance will be bad use llvmcall if you want performance.

You have access to everything intrinsic Clang is capable of that way.

1 Like

It is too early for me to say how well it works :slight_smile:

Is there something equally graspable that would allow me to make google’s highway [portable SIMD/vector intrinsics in C++] available? Or any similar facility.

Here is an example of calling an instruction from a particular instruction set: VectorizationBase.jl/conflict.jl at master · JuliaSIMD/VectorizationBase.jl · GitHub

If you want portability, don’t call generic llvm Vec for instructions. LLVM does a great job. Better than highway.
Julia thankfully exposes that to users. Clang unfortunately does not.

1 Like

good information – thanks

Using Godbolt to emit LLVM IR:


define dso_local noundef <8 x float> @_Z4distDv8_fS_(<8 x float> noundef %0, <8 x float> noundef %1) #0 !dbg !261 {
  %3 = alloca <8 x float>, align 32
  %4 = alloca <8 x float>, align 32
  %5 = alloca <8 x float>, align 32
  %6 = alloca <8 x float>, align 32
  %7 = alloca <8 x float>, align 32
  %8 = alloca <8 x float>, align 32
  %9 = alloca <8 x float>, align 32
  %10 = alloca <8 x float>, align 32
  %11 = alloca <8 x float>, align 32
  store <8 x float> %0, <8 x float>* %10, align 32
  call void @llvm.dbg.declare(metadata <8 x float>* %10, metadata !266, metadata !DIExpression()), !dbg !267
  store <8 x float> %1, <8 x float>* %11, align 32
  call void @llvm.dbg.declare(metadata <8 x float>* %11, metadata !268, metadata !DIExpression()), !dbg !269
  %12 = load <8 x float>, <8 x float>* %10, align 32, !dbg !270
  %13 = load <8 x float>, <8 x float>* %10, align 32, !dbg !271
  store <8 x float> %12, <8 x float>* %3, align 32, !dbg !272
  store <8 x float> %13, <8 x float>* %4, align 32, !dbg !272
  %14 = load <8 x float>, <8 x float>* %3, align 32, !dbg !272
  %15 = load <8 x float>, <8 x float>* %4, align 32, !dbg !272
  %16 = fmul <8 x float> %14, %15, !dbg !272
  %17 = load <8 x float>, <8 x float>* %11, align 32, !dbg !273
  %18 = load <8 x float>, <8 x float>* %11, align 32, !dbg !274
  store <8 x float> %17, <8 x float>* %5, align 32, !dbg !275
  store <8 x float> %18, <8 x float>* %6, align 32, !dbg !275
  %19 = load <8 x float>, <8 x float>* %5, align 32, !dbg !275
  %20 = load <8 x float>, <8 x float>* %6, align 32, !dbg !275
  %21 = fmul <8 x float> %19, %20, !dbg !275
  store <8 x float> %16, <8 x float>* %7, align 32, !dbg !276
  store <8 x float> %21, <8 x float>* %8, align 32, !dbg !276
  %22 = load <8 x float>, <8 x float>* %7, align 32, !dbg !276
  %23 = load <8 x float>, <8 x float>* %8, align 32, !dbg !276
  %24 = fadd <8 x float> %22, %23, !dbg !276
  store <8 x float> %24, <8 x float>* %9, align 32, !dbg !277
  %25 = load <8 x float>, <8 x float>* %9, align 32, !dbg !277
  %26 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %25) #2, !dbg !277
  ret <8 x float> %26, !dbg !278
}

declare void @llvm.dbg.declare(metadata, metadata, metadata) #1

declare <8 x float> @llvm.sqrt.v8f32(<8 x float>) #1

attributes #0 = { mustprogress noinline optnone uwtable "frame-pointer"="all" "min-legal-vector-width"="256" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+avx,+avx2,+crc32,+cx8,+fxsr,+mmx,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave" "tune-cpu"="generic" }
attributes #1 = { nofree nosync nounwind readnone speculatable willreturn }
attributes #2 = { nounwind }

How do I use the emitted instructions, massage them into julia?

You could llvmcall them.
But you should already be able to see that it isn’t doing anything SIMD.jl or VectorizationBase.jl aren’t already doing. They’ve both already wrapped llvmcall you’d need (you don’t need the allocas, and while they have wrapped the load and stores, which are useful for many things, they’re not needed here either).

1 Like

@kristoffer.carlsson wrote a blog post on how this could be done.

Once you know the LLVM intrinsics of interest you could look in thr following files and see id they have wrapped for you already.

Let’s see if I can help map this out.

Intel Intrinsics

  1. _mm256_mul_ps
  2. _mm256_add_ps
  3. _mm256_sqrt_ps

The gist is that we are trying to work on 8 Float32s at a time.

LLVM IR to SIMD.jl and VectorizationBase.jl

  1. load
  2. store
  3. fmul
  4. fadd
  5. llvm.sqrt

Essentially have all the pieces. Now to figure out how to assemble them.

Thank you for the gracious walkway.

1 Like

Here is the SIMD.jl version:

julia> function dist(a::NTuple{8,Float32}, b::NTuple{8,Float32})::NTuple{8,Float32}
           va = Vec{8,Float32}(a)
           vb = Vec{8,Float32}(b)
           vc = sqrt(va^2 + vb^2)
           return Tuple(vc)
       end
dist (generic function with 1 method)

julia> @code_llvm debuginfo=:none dist(a,b)
define void @julia_dist_508([8 x float]* noalias nocapture sret([8 x float]) %0, [8 x float]* nocapture nonnull readonly align 4 dereferenceable(32) %1, [8 x float]* nocapture nonnull readonly align 4 dereferenceable(32) %2) #0 {
top:
  %3 = bitcast [8 x float]* %1 to <8 x float>*
  %4 = load <8 x float>, <8 x float>* %3, align 4
  %5 = bitcast [8 x float]* %2 to <8 x float>*
  %6 = load <8 x float>, <8 x float>* %5, align 4
  %7 = fmul <8 x float> %4, %4
  %8 = fmul <8 x float> %6, %6
  %9 = fadd <8 x float> %7, %8
  %10 = call <8 x float> @llvm.sqrt.v8f32(<8 x float> %9)
  %11 = bitcast [8 x float]* %0 to <8 x float>*
  store <8 x float> %10, <8 x float>* %11, align 4
  ret void
}

julia> @code_native debuginfo=:none dist(a,b)
	.text
	.file	"dist"
	.globl	julia_dist_513                  # -- Begin function julia_dist_513
	.p2align	4, 0x90
	.type	julia_dist_513,@function
julia_dist_513:                         # @julia_dist_513
	.cfi_startproc
# %bb.0:                                # %top
	vmovups	(%rsi), %ymm0
	vmovups	(%rdx), %ymm1
	movq	%rdi, %rax
	vmulps	%ymm0, %ymm0, %ymm0
	vmulps	%ymm1, %ymm1, %ymm1
	vaddps	%ymm1, %ymm0, %ymm0
	vsqrtps	%ymm0, %ymm0
	vmovups	%ymm0, (%rdi)
	vzeroupper
	retq
.Lfunc_end0:
	.size	julia_dist_513, .Lfunc_end0-julia_dist_513
	.cfi_endproc
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

That looks similar to what the C version compiles to on x86_64:

dist(float __vector(8), float __vector(8)):                         # @dist(float __vector(8), float __vector(8))
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 320
        vmovaps ymmword ptr [rsp + 32], ymm0
        vmovaps ymmword ptr [rsp], ymm1
        vmovaps ymm1, ymmword ptr [rsp + 32]
        vmovaps ymm0, ymmword ptr [rsp + 32]
        vmovaps ymmword ptr [rsp + 256], ymm1
        vmovaps ymmword ptr [rsp + 224], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 256]
        vmulps  ymm1, ymm0, ymmword ptr [rsp + 224]
        vmovaps ymm2, ymmword ptr [rsp]
        vmovaps ymm0, ymmword ptr [rsp]
        vmovaps ymmword ptr [rsp + 192], ymm2
        vmovaps ymmword ptr [rsp + 160], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 192]
        vmulps  ymm0, ymm0, ymmword ptr [rsp + 160]
        vmovaps ymmword ptr [rsp + 128], ymm1
        vmovaps ymmword ptr [rsp + 96], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 128]
        vaddps  ymm0, ymm0, ymmword ptr [rsp + 96]
        vmovaps ymmword ptr [rsp + 64], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 64]
        vsqrtps ymm0, ymm0
        mov     rsp, rbp
        pop     rbp
        ret

Godbolt link to assembly

3 Likes

I also did this with VectorizationBase.jl on another computer with a Xeon as an exercise for myself. Admittedly, we could use some improved documentation there, but perhaps it is still not meant to be used directly?

julia> using VectorizationBase

julia> dist(a::NTuple{N,T}, b::NTuple{N,T}) where {N,T} = return ntuple(sqrt(Vec(a...)^2 + Vec(b...)^2), Val(N))
dist (generic function with 2 methods)

julia> a = b = (1.f0,2.f0,3.f0,4.f0,5.f0,6.f0,7.f0,8.f0)
(1.0f0, 2.0f0, 3.0f0, 4.0f0, 5.0f0, 6.0f0, 7.0f0, 8.0f0)

julia> @code_llvm debuginfo=:none dist(a,b)
; Function Attrs: uwtable
define void @julia_dist_1430([8 x float]* noalias nocapture sret([8 x float]) %0, [8 x float]* nocapture nonnull readonly align 4 dereferenceable(32) %1, [8 x float]* nocapture nonnull readonly align 4 dereferenceable(32) %2) #0 {
top:
  %3 = bitcast [8 x float]* %1 to <8 x float>*
  %4 = load <8 x float>, <8 x float>* %3, align 4
  %res.i26 = fmul reassoc nsz arcp contract afn <8 x float> %4, %4
  %5 = bitcast [8 x float]* %2 to <8 x float>*
  %6 = load <8 x float>, <8 x float>* %5, align 4
  %res.i17 = fmul reassoc nsz arcp contract afn <8 x float> %6, %6
  %res.i16 = fadd nsz contract <8 x float> %res.i26, %res.i17
  %res.i15 = call fast <8 x float> @llvm.sqrt.v8f32(<8 x float> %res.i16)
  %7 = bitcast [8 x float]* %0 to <8 x float>*
  store <8 x float> %res.i15, <8 x float>* %7, align 4
  ret void
}

julia> @code_native debuginfo=:none dist(a,b)
        .text
        .file   "dist"
        .globl  julia_dist_1446                 # -- Begin function julia_dist_1446
        .p2align        4, 0x90
        .type   julia_dist_1446,@function
julia_dist_1446:                        # @julia_dist_1446
        .cfi_startproc
# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        vmovups (%rdx), %ymm0
        vmovups (%r8), %ymm1
        movq    %rcx, %rax
        vmulps  %ymm1, %ymm1, %ymm1
        vfmadd231ps     %ymm0, %ymm0, %ymm1     # ymm1 = (ymm0 * ymm0) + ymm1
        vsqrtps %ymm1, %ymm0
        vmovups %ymm0, (%rcx)
        popq    %rbp
        vzeroupper
        retq
.Lfunc_end0:
        .size   julia_dist_1446, .Lfunc_end0-julia_dist_1446
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

What is impressive here is that the compiler chose an advanced instruction,vfmadd231ps which is a fused multiply add. The great thing about Julia is that it can adapt to the architecture it is being used on easily. Here it is using AVX512 instruction:

https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

This version can also be easily adapted to use 512-bit registers (see the zmm*):

julia> a = b = (1, 2, 3, 4, 5, 6, 7, 8)
(1, 2, 3, 4, 5, 6, 7, 8)

julia> @code_native debuginfo=:none dist(a,b)
        .text
        .file   "dist"
        .globl  julia_dist_1521                 # -- Begin function julia_dist_1521
        .p2align        4, 0x90
        .type   julia_dist_1521,@function
julia_dist_1521:                        # @julia_dist_1521
        .cfi_startproc
# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        vcvtqq2pd       (%rdx), %zmm0
        vcvtqq2pd       (%r8), %zmm1
        vmulpd  %zmm1, %zmm1, %zmm1
        vfmadd231pd     %zmm0, %zmm0, %zmm1     # zmm1 = (zmm0 * zmm0) + zmm1
        vsqrtpd %zmm1, %zmm0
        movq    %rcx, %rax
        vmovupd %ymm0, (%rcx)
        vextractf64x4   $1, %zmm0, 32(%rcx)
        popq    %rbp
        vzeroupper
        retq
.Lfunc_end0:
        .size   julia_dist_1521, .Lfunc_end0-julia_dist_1521
        .cfi_endproc
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits
3 Likes