Building julia with `march=native`

I decided to build julia from source and was thinking of setting march=native in a futile hope of getting a speed up. After reading through some of the docs, i decided to use the following user.make

ARCH=x86_64
MARCH=native
JULIA_CPU_TARGET=native

OPENBLAS_DYNAMIC_ARCH=0
OPENBLAS_TARGET_ARCH=SKYLAKEX

USE_BINARYBUILDER=0
USE_BINARYBUILDER_OPENBLAS=0
USE_BINARYBUILDER_LIBSUITESPARSE=0
USE_BINARYBUILDER_OPENLIBM=0

Sparse solve seemed to be 10% faster which is nice but overall performance gains seemed to be inconclusive at best. Is this a sane Make.user? are there any configs worth testing? can i RICE even harder?

PS

the build is for a i9-10980XE

1 Like

I personally would say it’s not worth spending so much time building Julia just to set up for possible future debugging with negligible performance gain

the build time is pretty ok IMHO.

Note that by default, Julia has multiversioning, so building with march=native isn’t expected to give noticable results.

multiversioning is for everything outside stdlib, right?
would i get the same config julia is built with if i leave the make.user blank?

OPENBLAS_TARGET_ARCH=SKYLAKEX

I suggest starting Julia with -C"native,-prefer-256-bit" to allow it to use 512 bit vectors.
By default, LLVM sets the option prefer-256-bit, which you can disable via the -.
LoopVectorization.jl tends to be much better at using AVX512 than the default options.

Also, disabling dynamic arch and setting the target to SKYLAKEX can be a nice trick for CPUs with AVX512 that aren’t yet supported by OpenBLAS, as otherwise OpenBLAS will use a nehalem kernel.
It’s not ideal that OpenBLAS detects specific CPUs rather than their features for dispatch.

But otherwise, the overhead is pretty negligible. That is, if the dispatch is actually working – as it should for your cascadelake CPU – I don’t think you’ll see a performance benefit from that.

But using MKL will probably give you a pretty substantial benefit for most operations.

4 Likes

Maybe we can adapt some ideas from ClearLinux.
β€œClear Linux OS uses aggressive compiler flags to optimize software builds”

( phoronix test from 2022-05-19 )

openblas

export CFLAGS="$CFLAGS -march=skylake-avx512  -mprefer-vector-width=256 -mtune=skylake-avx512"
export FFLAGS="$FFLAGS -march=skylake-avx512  -mprefer-vector-width=256 -mtune=skylake-avx512"

export CFLAGS="$CFLAGS -flto -ffunction-sections -fno-semantic-interposition -O3 "
export CXXFLAGS="$CXXFLAGS -flto -ffunction-sections -fno-semantic-interposition -O3 "

with special OpenBlas patches …

Patch1:  0001-Update-lto-related-for-v0.3.7.patch
Patch10: 0001-ported-blas-ht-patch.patch 
Patch11: 0001-ported-blas-ht-patch-2.patch 
#Patch11: 0001-Add-sgemm-direct-code-for-avx2.patch
Patch12: 0001-Remove-AVX2-macro-detection-as-not-supported.patch
Patch13: 0001-Set-OMP-thread-count-to-best-utilize-HT-CPU.patch
Patch14: cmpxchg.patch

LLVM14

paches:

Patch1: llvm-0001-Improve-physical-core-count-detection.patch
Patch2: llvm-0002-Produce-a-normally-versioned-libLLVM.patch
Patch3: llvm-0003-Allow-one-more-FMA-fusion.patch
Patch4: clang-0001-Detect-Clear-Linux-and-apply-Clear-s-default-linker-.patch
Patch5: clang-0002-Make-Clang-default-to-Westmere-on-Clear-Linux.patch
Patch6: clang-0003-Add-the-LLVM-major-version-number-to-the-Gold-LTO-pl.patch
Patch7: clang-0004-Add-a-couple-more-f-instructions-that-GCC-has-that-C.patch
Patch8: clang-0005-Don-t-error-on-ftrivial-auto-var-init-zero.patch
Patch9: clang-soname.patch

LLVM13

paches:

Patch1: llvm-0001-Improve-physical-core-count-detection.patch
Patch2: llvm-0002-Produce-a-normally-versioned-libLLVM.patch
Patch3: llvm-0003-Allow-one-more-FMA-fusion.patch
Patch4: clang-0001-Detect-Clear-Linux-and-apply-Clear-s-default-linker-.patch
Patch5: clang-0002-Make-Clang-default-to-Westmere-on-Clear-Linux.patch
Patch6: clang-0003-Add-the-LLVM-major-version-number-to-the-Gold-LTO-pl.patch
Patch7: clang-0004-Add-a-couple-more-f-instructions-that-GCC-has-that-C.patch

zlib


export CFLAGS="$CFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export FCFLAGS="$FFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export FFLAGS="$FFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export CXXFLAGS="$CXXFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "

libuv

export CFLAGS="$CFLAGS -O3 -ffat-lto-objects -flto=auto "
export FCFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=auto "
export FFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=auto "
export CXXFLAGS="$CXXFLAGS -O3 -ffat-lto-objects -flto=auto "

gmp

export CFLAGS="-O3  -g -fno-semantic-interposition -march=haswell -ffat-lto-objects  -flto=4 -mno-vzeroupper -march=x86-64-v3 "

mpfr

export CFLAGS="$CFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export FCFLAGS="$FFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export FFLAGS="$FFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export CXXFLAGS="$CXXFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "

...

pushd ../buildavx2/
export CFLAGS="$CFLAGS -m64 -march=x86-64-v3"
export CXXFLAGS="$CXXFLAGS -m64 -march=x86-64-v3"
export FFLAGS="$FFLAGS -m64 -march=x86-64-v3"
export FCFLAGS="$FCFLAGS -m64 -march=x86-64-v3"
export LDFLAGS="$LDFLAGS -m64 -march=x86-64-v3"

....

pushd ../buildavx512/
export CFLAGS="$CFLAGS -m64 -march=x86-64-v4 -mprefer-vector-width=256"
export CXXFLAGS="$CXXFLAGS -m64 -march=x86-64-v4 -mprefer-vector-width=256"
export FFLAGS="$FFLAGS -m64 -march=x86-64-v4 -mprefer-vector-width=256"
export FCFLAGS="$FCFLAGS -m64 -march=x86-64-v4 -mprefer-vector-width=256"
export LDFLAGS="$LDFLAGS -m64 -march=x86-64-v4"

libssh2

export CFLAGS="$CFLAGS -O3 -ffat-lto-objects -flto=auto -fstack-protector-strong -fzero-call-used-regs=used "
export FCFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=auto -fstack-protector-strong -fzero-call-used-regs=used "
export FFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=auto -fstack-protector-strong -fzero-call-used-regs=used "
export CXXFLAGS="$CXXFLAGS -O3 -ffat-lto-objects -flto=auto -fstack-protector-strong -fzero-call-used-regs=used "

curl

export CFLAGS="$CFLAGS -Os -fdata-sections -ffunction-sections -fno-lto -fno-semantic-interposition -fstack-protector-strong -fzero-call-used-regs=used "
export FCFLAGS="$FFLAGS -Os -fdata-sections -ffunction-sections -fno-lto -fno-semantic-interposition -fstack-protector-strong -fzero-call-used-regs=used "
export FFLAGS="$FFLAGS -Os -fdata-sections -ffunction-sections -fno-lto -fno-semantic-interposition -fstack-protector-strong -fzero-call-used-regs=used "
export CXXFLAGS="$CXXFLAGS -Os -fdata-sections -ffunction-sections -fno-lto -fno-semantic-interposition -fstack-protector-strong -fzero-call-used-regs=used "

p7zip

export CFLAGS="$CFLAGS -O3 -ffat-lto-objects -flto=4 -fstack-protector-strong -fzero-call-used-regs=used "
export FCFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=4 -fstack-protector-strong -fzero-call-used-regs=used "
export FFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=4 -fstack-protector-strong -fzero-call-used-regs=used "
export CXXFLAGS="$CXXFLAGS -O3 -ffat-lto-objects -flto=4 -fstack-protector-strong -fzero-call-used-regs=used "
export CFLAGS_GENERATE="$CFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export FCFLAGS_GENERATE="$FCFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export FFLAGS_GENERATE="$FFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export CXXFLAGS_GENERATE="$CXXFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export LDFLAGS_GENERATE="$LDFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export CFLAGS_USE="$CFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "
export FCFLAGS_USE="$FCFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "
export FFLAGS_USE="$FFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "
export CXXFLAGS_USE="$CXXFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "
export LDFLAGS_USE="$LDFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "

+ gcc patches for the build.

Patch0   : gcc-stable-branch.patch
Patch1   : 0001-Fix-stack-protection-issues.patch
Patch2   : openmp-vectorize-v2.patch
Patch3   : fortran-vector-v2.patch
Patch5   : optimize.patch
Patch6   : vectorize.patch
Patch9   : gomp-relax.patch
Patch11  : memcpy-avx2.patch
Patch12	 : avx512-when-we-ask-for-it.patch
Patch14  : arch-native-override.patch
Patch15  : 0001-Ignore-Werror-if-GCC_IGNORE_WERROR-environment-varia.patch
Patch16  : 0001-Always-use-z-now-when-linking-with-pie.patch
Patch19  : tune-inline.patch
Patch20  : vectorcost.patch
1 Like

This looks very extensive but seems like a pain integrate into a julia.

it’s also interesting that @Elrod recommends β€œ-prefer-256-bit” while clearlinux seems to do the opposite. i guess it’s CPU specific?

Also those CFLAGS for curl seems…abused…for my case. i can’t read.

I’m probably overly optimistic about the compiler/length of the vectors you may be working with.
But someone on slack commented recently that LV often gave about a 2x speedup on many of the simple loops they were working with, and it seemed this was purely because it used larger vectors.

The problem with LLVM’s vectorization is that it unrolls aggressively, and then doesn’t vectorize the unroll*vectorization remainder. So using 512 bit vectors with Float64 means it will only vectorize blocks of 32. If your loop is 63 iterations, then with 512 bit vectors it will likely run 1 unrolled and vectorized iteration, followed by 31 scalar iterations.
With 256 iterations, it’ll run 2 unrolled and vectorized iterations, followed by 15 scalar iterations – much faster.

LoopVectorization.jl does this better. I was hoping the @vp macro would get LLVM to do something more similar, but not quite.
The code below runs dot products at all lengths 1:1024 in a random order, and benchmarks how long it takes.

julia> @time using LoopVectorization, Random, BenchmarkTools
  0.000189 seconds (470 allocations: 46.227 KiB)

julia> function dot_fast(x,y)
           s = zero(eltype(x))
           for i = eachindex(x)
               @inbounds @fastmath s += x[i]*y[i]
           end
           s
       end
dot_fast (generic function with 1 method)

julia> macro vp(expr)
            nodes = (Symbol("llvm.loop.vectorize.predicate.enable"), 1)
            if expr.head != :for
                error("Syntax error: loopinfo needs a for loop")
            end
            push!(expr.args[2].args, Expr(:loopinfo, nodes))
            return esc(expr)
       end
@vp (macro with 1 method)

julia> function dot_vp(x,y)
           s = zero(eltype(x))
           @vp for i = eachindex(x)
               @inbounds @fastmath s += x[i]*y[i]
           end
           s
       end
dot_vp (generic function with 1 method)

julia> function dot_turbo(x,y)
           s = zero(eltype(x))
           @turbo for i = eachindex(x)
               s += x[i]*y[i]
           end
           s
       end
dot_turbo (generic function with 1 method)

julia> x = rand(1024); y = rand(length(x)); Ns = randperm(length(x)); z = similar(x);

julia> @benchmark map!(n -> dot_vp(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  61.602 ΞΌs … 101.234 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     61.731 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   61.795 ΞΌs Β± 547.266 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–ˆβ–„β–β–†β–ˆβ–„                                              ▂▁▁▁▁▁  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆ
  61.6 ΞΌs       Histogram: log(frequency) by time      63.3 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_fast(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  41.530 ΞΌs …  71.679 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     41.659 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   41.735 ΞΌs Β± 549.633 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ƒβ–†β–ˆβ–ˆβ–‡β–„β–‚                                            ▁▁▁      β–‚
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–…β–„β–„β–„β–β–β–β–ƒβ–β–β–β–β–β–β–ƒβ–β–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–ƒβ–β–ƒβ–…β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–‡ β–ˆ
  41.5 ΞΌs       Histogram: log(frequency) by time      43.4 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_turbo(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  31.564 ΞΌs …  73.508 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     31.845 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   31.905 ΞΌs Β± 547.370 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–β–„β–‡β–ˆβ–‡β–„β–‚
  β–‚β–‚β–‚β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–ƒβ–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–‚β–β–β–β–‚β–β–β–β–β–β–β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  31.6 ΞΌs         Histogram: frequency by time         33.5 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Reducing the length to 1:256:

julia> x = rand(256); y = rand(length(x)); Ns = randperm(length(x)); z = similar(x);

julia> @benchmark map!(n -> dot_vp(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.635 ΞΌs …  9.801 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.664 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.671 ΞΌs Β± 65.712 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–‚β–†β–ˆβ–†β–‚
  β–‚β–ƒβ–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  4.63 ΞΌs        Histogram: frequency by time        4.89 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_fast(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.650 ΞΌs … 10.561 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     5.678 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   5.688 ΞΌs Β± 83.408 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     β–ƒβ–ˆβ–ˆβ–…β–
  β–‚β–‚β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ƒβ–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  5.65 ΞΌs        Histogram: frequency by time        5.94 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_turbo(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.181 ΞΌs …  8.153 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.194 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.199 ΞΌs Β± 60.401 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–„β–‡β–ˆβ–‡β–…β–‚                                                 ▁  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–†β–‡β–ˆβ–ˆβ–ˆ β–ˆ
  3.18 ΞΌs      Histogram: log(frequency) by time     3.39 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

and now the @vp version is faster than the default, but both are of course still slower than @turbo.
This was with native,-prefer-256-bit.
With the default (native is the default):

julia> x = rand(256); y = rand(length(x)); Ns = randperm(length(x)); z = similar(x);

julia> @benchmark map!(n -> dot_vp(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.562 ΞΌs …  9.660 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     8.587 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   8.601 ΞΌs Β± 78.964 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–‡β–ˆβ–‡β–‡β–…β–                                                 ▁  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–…β–†β–ˆβ–ˆβ–ˆβ–ˆ β–ˆ
  8.56 ΞΌs      Histogram: log(frequency) by time     9.05 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_fast(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.940 ΞΌs …  7.153 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.961 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.969 ΞΌs Β± 49.064 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     β–…β–‡β–ˆβ–„
  β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  4.94 ΞΌs        Histogram: frequency by time        5.18 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_turbo(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.740 ΞΌs …  9.553 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.751 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.759 ΞΌs Β± 88.349 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–…β–‡β–ˆβ–‡β–ƒ                                                 ▁▁▁ β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–„β–†β–‡β–ˆβ–ˆβ–ˆ β–ˆ
  3.74 ΞΌs      Histogram: log(frequency) by time     3.95 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@turbo doesn’t care, and will use full sized vectors anyway. I’m not quite sure why it got worse performance than before; the assembly is actually exactly the same.

But you can see that the β€œnormal” version of dot product (dot_fast) was actually faster with 256 bit vectors than 512 bit vectors!

Using predicates, e.g. @vp or @turbo made the 512 bit code faster. And in the case of @turbo, it is significantly faster at most sizes over the range of 16 or so up until a few hundred.

But note that @vp with 512 bit vectors was faster than not having @vp with 256 bit.
I do think @vp+512 bit vectors is a better default than not-@vp and 256 bit, unless the expected vector length is very long.

EDIT:
Also, our results should be fairly comparable.

julia> versioninfo()
Julia Version 1.9.0-DEV.635
Commit 5ef75cbf5b (2022-05-24 19:07 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 Γ— Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.3 (ORCJIT, cascadelake)
  Threads: 36 on 36 virtual cores
6 Likes

i tested -C"native,-prefer-256-bit" it made things slower .

Keep in mind that julia is a compiler itself: the performance of the code it generates shouldn’t in principle depend much (or at all?) on how it was compiled, by default it compiles natively for the target system in any case. That said, by doing a native build what could improve is performance of the runtime (compilation latency, garbage collector, etc…), but depending on workload you benchmark this can have varying impact, sometimes negligible, other times more significant.

4 Likes

@ImreSamu: regarding Clear Linux and the Phoronix benchmark, you have to take it with a grain of salt. For example they have amazing numbers for Zstd, but the reason is they set the default number of threads to 4, where other distro’s only use 1; the benchmark is builtin zstd -b ..., so it’s apples and oranges (zstd benchmark misleading Β· Issue #633 Β· phoronix-test-suite/phoronix-test-suite Β· GitHub). Another issue is Phoronix compares Zstd 1.4 with 1.5, and there’s huge performance improvements between those releases, so it’s really not about Clear Linux being faster, but about them shipping a more recent version.

But sure, in some cases they get pretty good speedups, usually though about 5 - 10% thanks to PGO or LTO.

Getting PGO right is tricky, but I think this is the best type of optimization for large and branchy code bases like LLVM. I’ve spent some time enabling getting PGO to work in the Spack package manager for software stacks in general, and for Julia + LLVM things are really good.

Using JULIA_LLVM_ARGS=-time-passes julia -O3 ./script.jl where script.jl is pretty much using LoopVectorization with a @turbo’d inner product, compilation time drops by 25% for the most expensive LLVM passes:

[official binaries, generic, GCC, no PGO]:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3508 ( 18.3%)   0.3359 ( 19.3%)   1.6867 ( 18.5%)   1.6809 ( 18.5%)  X86 DAG->DAG Instruction Selection
   0.6616 (  9.0%)   0.3140 ( 18.0%)   0.9756 ( 10.7%)   0.9732 ( 10.7%)  X86 Assembly Printer
   0.3632 (  4.9%)   0.0501 (  2.9%)   0.4132 (  4.5%)   0.4126 (  4.5%)  Greedy Register Allocator
   0.3423 (  4.6%)   0.0511 (  2.9%)   0.3934 (  4.3%)   0.3904 (  4.3%)  Combine redundant instructions

[spack, generic, clang 14, no PGO]
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.2367 ( 18.8%)   0.1931 ( 19.6%)   1.4299 ( 18.9%)   1.4298 ( 18.9%)  X86 DAG->DAG Instruction Selection
   0.5216 (  7.9%)   0.1528 ( 15.5%)   0.6744 (  8.9%)   0.6744 (  8.9%)  X86 Assembly Printer
   0.3438 (  5.2%)   0.0281 (  2.9%)   0.3719 (  4.9%)   0.3718 (  4.9%)  Greedy Register Allocator
   0.3196 (  4.8%)   0.0333 (  3.4%)   0.3529 (  4.7%)   0.3525 (  4.7%)  Combine redundant instructions
   
[spack, generic, clang 14, PGO]:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.9713 ( 17.9%)   0.1708 ( 19.8%)   1.1421 ( 18.2%)   1.1421 ( 18.2%)  X86 DAG->DAG Instruction Selection
   0.5047 (  9.3%)   0.1283 ( 14.9%)   0.6329 ( 10.1%)   0.6329 ( 10.1%)  X86 Assembly Printer
   0.3070 (  5.7%)   0.0325 (  3.8%)   0.3395 (  5.4%)   0.3395 (  5.4%)  Greedy Register Allocator
   0.2403 (  4.4%)   0.0340 (  3.9%)   0.2743 (  4.4%)   0.2741 (  4.4%)  Post RA top-down list latency scheduler
5 Likes

yes but stdlib is pre-compiled w/o march native right? most of my workload is in stdlib

I seem to remember the sysimage of the official binary is compiled targeting different levels for this reason, but I’m not entirely sure and I don’t even know where to search this detail.

2 Likes

I think the multiple targets are set here (taking the x86-64 example):
https://github.com/JuliaCI/julia-buildbot/blob/1fcd70dc92f3ca34fae8dbfa2d1c21a5efe499ee/master/inventory.py#L130

3 Likes