Min/max swap using SIMD

gitboy16 · March 11, 2025, 9:46pm

I would like to request some help to SIMD the following code:

import Test: @test

n = 10_000_000

A = repeat([3, 1, 4, 2, 7, 8, 5, 6], outer=n);
B = repeat([5, 6, 0, 7, 2, 3, 1, 4], outer=n);

C = copy(A);
D = copy(B);

min__(x,y) = x < y ? x : y
max__(x,y) = x > y ? x : y

function test1(x,y)
  for i in eachindex(x)
    tmp = x[i]
    x[i] = min__(x[i],y[i])
    y[i] = max__(tmp,y[i])
  end
end

for i in 1:3
  C = copy(A)
  D = copy(B)
  @time test1(C,D)
end

@test iszero(C .> D)

This code compares 2 vectors of same size (C and D) and for each index moves the smallest value to C and the largest value to D. It is possible to speed up the computation using multithreading but what I am interested in is doing it using SIMD, ideally AVX-512. I tried @simd but it did not yield any gain.

Thank you in advance.

yolhan_mannes · March 11, 2025, 9:49pm

You could try using ifelse() instead of control flow ensuring inlining of the __min function and (if not enough )try LoopVectorization.jl

gitboy16 · March 11, 2025, 9:50pm

I tried LoopVectorization, it returned me a warning and did not yield any gain.

yolhan_mannes · March 11, 2025, 9:52pm

Loop vectorization does not support control flow which is why I told you to use ifelse() function instead and it doesn’t support eachindex too so you would need to set L=length(X) and make the for loop for i in 1:L. Also don’t time the first call simd happens at compile time you won’t see it. Check the code_llvm to see if it simd

yolhan_mannes · March 11, 2025, 10:00pm

I think it does simd already

julia> @code_llvm test1(A,B)
; Function Signature: test1(Array{Int64, 1}, Array{Int64, 1})
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:15 within `test1`
; Function Attrs: uwtable
define void @julia_test1_42981(ptr noundef nonnull align 8 dereferenceable(24) %"x::Array", ptr noundef nonnull align 8 dereferenceable(24) %"y::Array") #0 {
top:
  %"new::Tuple95" = alloca [1 x i64], align 8
  %"new::Tuple103" = alloca [1 x i64], align 8
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:16 within `test1`
; ┌ @ abstractarray.jl:321 within `eachindex`
; │┌ @ abstractarray.jl:137 within `axes1`
; ││┌ @ abstractarray.jl:98 within `axes`
; │││┌ @ array.jl:194 within `size`
      %0 = getelementptr inbounds i8, ptr %"x::Array", i64 16
      %.size.sroa.0.0.copyload = load i64, ptr %0, align 8
; └└└└
; ┌ @ range.jl:904 within `iterate`
; │┌ @ range.jl:681 within `isempty`
; ││┌ @ operators.jl:379 within `>`
; │││┌ @ int.jl:83 within `<`
      %1 = icmp slt i64 %.size.sroa.0.0.copyload, 1
; └└└└
  br i1 %1, label %L144, label %L13.preheader

L13.preheader:                                    ; preds = %top
  %2 = load ptr, ptr %"x::Array", align 8
  %3 = getelementptr inbounds i8, ptr %"y::Array", i64 16
  %.size30.sroa.0.0.copyload = load i64, ptr %3, align 8
  %4 = load ptr, ptr %"y::Array", align 8
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:17 within `test1`
; ┌ @ essentials.jl:916 within `getindex`
   %smin = call i64 @llvm.smin.i64(i64 %.size30.sroa.0.0.copyload, i64 0)
   %5 = sub i64 %.size30.sroa.0.0.copyload, %smin
   %smax = call i64 @llvm.smax.i64(i64 %smin, i64 -1)
   %6 = add nsw i64 %smax, 1
   %7 = mul nuw nsw i64 %5, %6
   %umin = call i64 @llvm.umin.i64(i64 %.size.sroa.0.0.copyload, i64 %7)
   %.not = icmp eq i64 %umin, 0
   br i1 %.not, label %main.pseudo.exit, label %L129.preheader

L129.preheader:                                   ; preds = %L13.preheader
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:20 within `test1`
  %min.iters.check = icmp ult i64 %umin, 8
  br i1 %min.iters.check, label %scalar.ph, label %vector.memcheck

vector.memcheck:                                  ; preds = %L129.preheader
  %8 = shl i64 %umin, 3
  %uglygep = getelementptr i8, ptr %2, i64 %8
  %uglygep164 = getelementptr i8, ptr %4, i64 %8
  %bound0 = icmp ult ptr %2, %uglygep164
  %bound1 = icmp ult ptr %4, %uglygep
  %found.conflict = and i1 %bound0, %bound1
  br i1 %found.conflict, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.memcheck
  %n.vec = and i64 %umin, -8
  %ind.end = or i64 %n.vec, 1
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:17 within `test1`
; ┌ @ essentials.jl:917 within `getindex`
   %9 = getelementptr inbounds i64, ptr %2, i64 %index
   %wide.load = load <4 x i64>, ptr %9, align 8
   %10 = getelementptr inbounds i64, ptr %9, i64 4
   %wide.load165 = load <4 x i64>, ptr %10, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:18 within `test1`
; ┌ @ essentials.jl:917 within `getindex`
   %11 = getelementptr inbounds i64, ptr %4, i64 %index
   %wide.load166 = load <4 x i64>, ptr %11, align 8
   %12 = getelementptr inbounds i64, ptr %11, i64 4
   %wide.load167 = load <4 x i64>, ptr %12, align 8
; └
; ┌ @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:12 within `min__`
   %13 = call <4 x i64> @llvm.smin.v4i64(<4 x i64> %wide.load, <4 x i64> %wide.load166)
   %14 = call <4 x i64> @llvm.smin.v4i64(<4 x i64> %wide.load165, <4 x i64> %wide.load167)
; └
; ┌ @ array.jl:987 within `setindex!`
   store <4 x i64> %13, ptr %9, align 8
   store <4 x i64> %14, ptr %10, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:19 within `test1`
; ┌ @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:13 within `max__`
   %15 = call <4 x i64> @llvm.smax.v4i64(<4 x i64> %wide.load166, <4 x i64> %wide.load)
   %16 = call <4 x i64> @llvm.smax.v4i64(<4 x i64> %wide.load167, <4 x i64> %wide.load165)
; └
; ┌ @ array.jl:987 within `setindex!`
   store <4 x i64> %15, ptr %11, align 8
   store <4 x i64> %16, ptr %12, align 8
   %index.next = add nuw i64 %index, 8
   %17 = icmp eq i64 %index.next, %n.vec
   br i1 %17, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:20 within `test1`
  %cmp.n = icmp eq i64 %umin, %n.vec
  br i1 %cmp.n, label %main.exit.selector, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %vector.memcheck, %L129.preheader
  %bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ 1, %L129.preheader ], [ 1, %vector.memcheck ]
  br label %L129

L26:                                              ; preds = %L13.postloop
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:17 within `test1`
; ┌ @ essentials.jl:916 within `getindex`
   store i64 %value_phi4.postloop, ptr %"new::Tuple103", align 8
   call void @j_throw_boundserror_42997(ptr nonnull %"x::Array", ptr nocapture nonnull readonly %"new::Tuple103") #8
   unreachable

L62:                                              ; preds = %L47.postloop
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:18 within `test1`
; ┌ @ essentials.jl:916 within `getindex`
   store i64 %value_phi4.postloop, ptr %"new::Tuple95", align 8
   call void @j_throw_boundserror_42997(ptr nonnull %"y::Array", ptr nocapture nonnull readonly %"new::Tuple95") #8
   unreachable

L129:                                             ; preds = %L129, %scalar.ph
   %value_phi4 = phi i64 [ %24, %L129 ], [ %bc.resume.val, %scalar.ph ]
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:17 within `test1`
; ┌ @ essentials.jl:916 within `getindex`
   %18 = add nsw i64 %value_phi4, -1
; │ @ essentials.jl:917 within `getindex`
   %19 = getelementptr inbounds i64, ptr %2, i64 %18
   %20 = load i64, ptr %19, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:18 within `test1`
; ┌ @ essentials.jl:917 within `getindex`
   %21 = getelementptr inbounds i64, ptr %4, i64 %18
   %22 = load i64, ptr %21, align 8
; └
; ┌ @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:12 within `min__`
   %value_phi41 = call i64 @llvm.smin.i64(i64 %20, i64 %22)
; └
; ┌ @ array.jl:987 within `setindex!`
   store i64 %value_phi41, ptr %19, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:19 within `test1`
; ┌ @ essentials.jl:917 within `getindex`
   %23 = load i64, ptr %21, align 8
; └
; ┌ @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:13 within `max__`
   %value_phi68 = call i64 @llvm.smax.i64(i64 %23, i64 %20)
; └
; ┌ @ array.jl:987 within `setindex!`
   store i64 %value_phi68, ptr %21, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:20 within `test1`
; ┌ @ range.jl:908 within `iterate`
   %24 = add nuw i64 %value_phi4, 1
; └
  %.not161 = icmp ult i64 %value_phi4, %umin
  br i1 %.not161, label %L129, label %main.exit.selector

main.exit.selector:                               ; preds = %L129, %middle.block
  %value_phi4.lcssa = phi i64 [ %umin, %middle.block ], [ %value_phi4, %L129 ]
; ┌ @ range.jl:908 within `iterate`
   %.lcssa = phi i64 [ %ind.end, %middle.block ], [ %24, %L129 ]
; └
  %25 = icmp ult i64 %value_phi4.lcssa, %.size.sroa.0.0.copyload
  br i1 %25, label %main.pseudo.exit, label %L144

main.pseudo.exit:                                 ; preds = %main.exit.selector, %L13.preheader
  %value_phi4.copy = phi i64 [ 1, %L13.preheader ], [ %.lcssa, %main.exit.selector ]
  br label %L13.postloop

L144:                                             ; preds = %L129.postloop, %main.exit.selector, %top
  ret void

L13.postloop:                                     ; preds = %L129.postloop, %main.pseudo.exit
  %value_phi4.postloop = phi i64 [ %32, %L129.postloop ], [ %value_phi4.copy, %main.pseudo.exit ]
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:17 within `test1`
; ┌ @ essentials.jl:916 within `getindex`
   %26 = add i64 %value_phi4.postloop, -1
   %.not.postloop = icmp ult i64 %26, %.size.sroa.0.0.copyload
   br i1 %.not.postloop, label %L47.postloop, label %L26

L47.postloop:                                     ; preds = %L13.postloop
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:18 within `test1`
; ┌ @ essentials.jl:916 within `getindex`
   %.not126.postloop = icmp ult i64 %26, %.size30.sroa.0.0.copyload
   br i1 %.not126.postloop, label %L129.postloop, label %L62

L129.postloop:                                    ; preds = %L47.postloop
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:17 within `test1`
; ┌ @ essentials.jl:917 within `getindex`
   %27 = getelementptr inbounds i64, ptr %2, i64 %26
   %28 = load i64, ptr %27, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:18 within `test1`
; ┌ @ essentials.jl:917 within `getindex`
   %29 = getelementptr inbounds i64, ptr %4, i64 %26
   %30 = load i64, ptr %29, align 8
; └
; ┌ @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:12 within `min__`
   %value_phi41.postloop = call i64 @llvm.smin.i64(i64 %28, i64 %30)
; └
; ┌ @ array.jl:987 within `setindex!`
   store i64 %value_phi41.postloop, ptr %27, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:19 within `test1`
; ┌ @ essentials.jl:917 within `getindex`
   %31 = load i64, ptr %29, align 8
; └
; ┌ @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:13 within `max__`
   %value_phi68.postloop = call i64 @llvm.smax.i64(i64 %31, i64 %28)
; └
; ┌ @ array.jl:987 within `setindex!`
   store i64 %value_phi68.postloop, ptr %29, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:20 within `test1`
; ┌ @ range.jl:908 within `iterate`
; │┌ @ promotion.jl:639 within `==`
    %.not132.not.postloop = icmp eq i64 %value_phi4.postloop, %.size.sroa.0.0.copyload
; │└
   %32 = add i64 %value_phi4.postloop, 1
; └
  br i1 %.not132.not.postloop, label %L144, label %L13.postloop
}

you see it in those line

   %9 = getelementptr inbounds i64, ptr %2, i64 %index
   %wide.load = load <4 x i64>, ptr %9, align 8
   %10 = getelementptr inbounds i64, ptr %9, i64 4
   %wide.load165 = load <4 x i64>, ptr %10, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:18 within `test1`
; ┌ @ essentials.jl:917 within `getindex`
   %11 = getelementptr inbounds i64, ptr %4, i64 %index
   %wide.load166 = load <4 x i64>, ptr %11, align 8
   %12 = getelementptr inbounds i64, ptr %11, i64 4
   %wide.load167 = load <4 x i64>, ptr %12, align 8
; └
; ┌ @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:12 within `min__`
   %13 = call <4 x i64> @llvm.smin.v4i64(<4 x i64> %wide.load, <4 x i64> %wide.load166)
   %14 = call <4 x i64> @llvm.smin.v4i64(<4 x i64> %wide.load165, <4 x i64> %wide.load167)
; └
; ┌ @ array.jl:987 within `setindex!`
   store <4 x i64> %13, ptr %9, align 8
   store <4 x i64> %14, ptr %10, align 8
; └
;  @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:19 within `test1`
; ┌ @ c:\Users\yolha\Desktop\juju_tests\TEST\main.jl:13 within `max__`
   %15 = call <4 x i64> @llvm.smax.v4i64(<4 x i64> %wide.load166, <4 x i64> %wide.load)
   %16 = call <4 x i64> @llvm.smax.v4i64(<4 x i64> %wide.load167, <4 x i64> %wide.load165)
; └
; ┌ @ array.jl:987 within `setindex!`
   store <4 x i64> %15, ptr %11, align 8
   store <4 x i64> %16, ptr %12, align 8

which explains why LV won’t help. if you find a language where this is faster than 0.13s, don’t hesitate to show it

yolhan_mannes · March 11, 2025, 11:00pm

From what I tried, I get the same in frotran with -O3 (simd)

program Testminman
    implicit none
    integer(kind=8), parameter :: n = 10000000
    integer(kind=8) :: A(n*8), B(n*8), res, i ,tmp,j
    double precision :: t1, t2
    i= 1
    do while (i < n*8)
        A(i) = 3
        A(i+1) = 1
        A(i+2) = 4
        A(i+3) = 2
        A(i+4) = 7
        A(i+5) = 8
        A(i+6) = 5
        A(i+7) = 6
        B(i) = 5
        B(i+1) = 6
        B(i+2) = 0
        B(i+3) = 7
        B(i+4) = 2
        B(i+5) = 3
        B(i+6) = 1
        B(i+7) = 4
        i = i + 8
    end do
    do j = 1, 3
        call cpu_time(t1)
        do i = 1, 8*n
            tmp = A(i)
            call min__(res, A(i), B(i))
            A(i) = res
            call max__(res, tmp, B(i))
            B(i) = res
        end do
        call cpu_time(t2)
        print*, t2-t1,sum(A),sum(B)
    end do
end program Testminman

subroutine min__(res, x,  y)
    integer(kind=8), intent(in) :: x,y
    integer(kind=8), intent(out) ::  res
    if ( x<y ) then
        res = x
    else
        res = y
    end if
end subroutine min__

subroutine max__(res, x,  y)
    integer(kind=8), intent(in) :: x,y
    integer(kind=8), intent(out) ::  res
    if ( x>y ) then
        res = x
    else
        res = y
    end if
end subroutine max__

gives

PS C:\Users\yolha\Desktop\juju_tests\TEST> gfortran -O3 main.f90 -o main 
PS C:\Users\yolha\Desktop\juju_tests\TEST> ./main
  0.14062500000000000                 160000000            480000000
  0.15625000000000000                 160000000            480000000
  0.15625000000000000                 160000000            480000000

and without simd

PS C:\Users\yolha\Desktop\juju_tests\TEST> gfortran  main.f90 -o main   
PS C:\Users\yolha\Desktop\juju_tests\TEST> ./main
  0.31250000000000000                 160000000            480000000
  0.34375000000000000                 160000000            480000000
  0.32812500000000000                 160000000            480000000

and with your code

julia .\main.jl

  0.155328 seconds (73.00 k allocations: 3.656 MiB, 11.25% compilation time)
160000000 480000000
  0.132632 seconds
160000000 480000000
  0.158859 seconds
160000000 480000000

JeffreySarnoff · March 11, 2025, 11:03pm

with the understanding that there may be more improvements available …
I found this implementation ran faster than some alternatives.

using SIMD: Vec, vload, vstore, vifelse

NumType = Float32

# supports_AVX512 does not exist, presumably you know
const VecWidth = supports_AVX512 ? 16 : 8 
const VecT = Vec{VecWidth, NumType}

function compare_swap!(C::Vector{NumType}, D::Vector{NumType})
    length(C) === length(D) || throw(ArgumentError("lengths must match"))

    n = length(C)
    i = 1
    @inbounds while i <= n - VecWidth + 1
        # Load VecWidth elements at a time from C and D.
        c_vec = vload(VecWidth, pointer(C, i))
        d_vec = vload(VecWidth, pointer(D, i))

        # Use the SIMD operator (without dot) for elementwise comparison.
        mask = c_vec > d_vec
        # vifelse selects from arg 2 [arg 3] where mask is true [false]
        newC = vifelse(mask, d_vec, c_vec)
        newD = vifelse(mask, c_vec, d_vec)

        # Store the results: min into C, max into D.
        vstore(newC, pointer(C, i))
        vstore(newD, pointer(D, i))
        i += 8
    end

    # Process any remaining elements (fewer than a full SIMD vector).
    @inbounds for j in i:n
        if C[j] > D[j]
            C[j], D[j] = D[j], C[j]
        end
    end
    return nothing
end

# test
using Chairmarks

n = 200
C = rand(Float32, n)
D = rand(Float32, n)

@be compare_swap!($C, $D)

n = 2048
C = rand(Float32, n)
D = rand(Float32, n)

@be compare_swap!($C, $D)

giordano · March 12, 2025, 2:11am

If you don’t use -march=native you’re targeting the generic ISA, which for example for x86_64 isn’t particularly good at SIMD (and I presume you use x86_64 since you appear to be on Windows)

photor · March 12, 2025, 3:26am

You mean in Fortran or in Julia?

yolhan_mannes · March 12, 2025, 6:58am

Oh ok thank you I didn’t know about those, I just found funny to get the exact same performance with gfortran -O3 and with Julia in this case. I still get around the same time with the -march=native here, most of the time is taken in reading the memory anyway.
In julia, the main thing it does looks like this on my machine

        vmovdqu ymm0, ymmword ptr [rax + 8*r9]
        vmovdqu ymm1, ymmword ptr [rcx + 8*r9]
        vpcmpgtq        ymm2, ymm1, ymm0
        vblendvpd       ymm3, ymm1, ymm0, ymm2
        vmovdqu ymm4, ymmword ptr [rax + 8*r9 + 32]
        vmovdqu ymm5, ymmword ptr [rcx + 8*r9 + 32]
        vpcmpgtq        ymm6, ymm5, ymm4
        vblendvpd       ymm7, ymm5, ymm4, ymm6
        vblendvpd       ymm0, ymm0, ymm1, ymm2
        vblendvpd       ymm1, ymm4, ymm5, ymm6
        vmovupd ymmword ptr [rax + 8*r9], ymm3
        vmovupd ymmword ptr [rax + 8*r9 + 32], ymm7
        vmovupd ymmword ptr [rcx + 8*r9], ymm0
        vmovupd ymmword ptr [rcx + 8*r9 + 32], ymm1

I wonder if someone with AVX512 will get better results

DNF · March 12, 2025, 8:37am

JeffreySarnoff:

c_vec = vload(VecWidth, pointer(C, i))
d_vec = vload(VecWidth, pointer(D, i))

# Use the SIMD operator (without dot) for elementwise comparison.
mask = c_vec > d_vec
# vifelse selects from arg 2 [arg 3] where mask is true [false]
newC = vifelse(mask, d_vec, c_vec)
newD = vifelse(mask, c_vec, d_vec)

# Store the results: min into C, max into D.
vstore(newC, pointer(C, i))
vstore(newD, pointer(D, i))

I find the indexing syntax in SIMD.jl very nice and convenient. Then we can do things like

lane = VecRange{8}(1)
@inbounds while i <= n - 8 + 1
    inds = lane + i
    c_vec, d_vec = C[inds], D[inds]
    mask = c_vec > d_vec
    C[inds] = vifelse(mask, d_vec, c_vec)
    D[inds] = vifelse(mask, c_vec, d_vec)
    i += 8
end

If we define

function _vminmax(a, b)
    mask = b > a
    return vifelse(mask, a, b), vifelse(mask, b, a)
end

we can even write

inds = lane + i
(C[inds], D[inds]) = _vminmax(C[inds], D[inds])

gitboy16 · March 12, 2025, 8:49am

@JeffreySarnoff , thank you for the reply and interesting code, however it does not run for me. I get the following error:

julia> @be compare_swap!($C, $D)
ERROR: MethodError: no method matching vload(::Int64, ::Ptr{Float32})
The function `vload` exists, but no method is defined for this combination of argument types.

DNF · March 12, 2025, 9:03am

vload(VecWidth, pointer(C, i))  # VecWidth is 8

should actually be

vload(Vec{8, Float32}, pointer(C, i))

But I suggest trying the indexing syntax; it is, IMO, nicer and more intuitive.

gitboy16 · March 12, 2025, 9:34am

Thank you for the fix. When I run it vs my basic function test1 is slightly slower.

DNF · March 12, 2025, 10:07am

This is much neater (minmax is a function in Base):

function test2!(x, y)
    for i in eachindex(x, y)  # this will automatically check for compatible sizes.
        x[i], y[i] = minmax(x[i], y[i])
    end
end

JeffreySarnoff · March 12, 2025, 10:14am

Thanks, @DNF. <step by step, inch by inch>

foobar_lv2 · March 12, 2025, 10:20am

General recommendation for benchmarking:

In order to approximate real-world performance, your benchmark problems should approximate the true use-case. The thing you’re benchmarking has an implicit “repeat this a couple if times” around it.

This is difficult, because CPUs are complex beasts, and it’s not trivial to know what aspects of your problem affect runtime performance through what part of micro-architectural state.

So first, some humility – when performance numbers surprise you, this may well be a benchmarking artifact due to micro-architectural state that you did not consider. If you have problem parameters that “should not matter”, maybe plot normalized runtime against that. Just to make sure.

Second: Repeating the same vector many times will prime your branch-predictor. Is this a good model for your real-world problem?

If yes, then use the repeat-length in your solution! I.e. solve your problem by computing the first 8 values, and then copy that over to the rest!
If no, then adjust your benchmark!

Third, consider the length of your test-vector and the memory hierarchy. Your test vector has a length of 640MB. So it won’t fit into L3. Is this indicative of your real workload?

If no, then adjust your benchmark! For example, don’t benchmark a single number, instead always plot runtime-per-element against vector length!
If yes, then you should change your surrounding code! Loading and storing the data will take more effort for the CPU than any operations you’re doing. So you need to fuse that.

gitboy16 · March 12, 2025, 10:48am

Yes you are right. Benchmark should be adjusted to maybe have random numbers like Jeffrey’s one. Regarding the size of the data it can be anything from 1 to 10^9 or more. FYI, at the beginning I used a repeat because I wanted to check that my output was right.

giordano · March 12, 2025, 11:35am

If you want to see the generated code for your Fortran program for generic x86-64 vs AVX2 vs AVX512 targets it’s easy to do that on godbolt: Compiler Explorer

Topic		Replies	Views
Replicate @tturbo performance Performance	23	2430	August 23, 2022
Faster `findmin` without LoopVectorization.jl Performance question	35	905	May 14, 2025
Fast short Float32 vector General Usage	9	561	July 7, 2020
Branching in loop decrease perf by ~x5 Performance	6	779	December 13, 2020
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1890	June 27, 2021

Min/max swap using SIMD

Related topics