Building julia with `march=native`

sobhan · June 5, 2022, 2:07am

I decided to build julia from source and was thinking of setting march=native in a futile hope of getting a speed up. After reading through some of the docs, i decided to use the following user.make

ARCH=x86_64
MARCH=native
JULIA_CPU_TARGET=native

OPENBLAS_DYNAMIC_ARCH=0
OPENBLAS_TARGET_ARCH=SKYLAKEX

USE_BINARYBUILDER=0
USE_BINARYBUILDER_OPENBLAS=0
USE_BINARYBUILDER_LIBSUITESPARSE=0
USE_BINARYBUILDER_OPENLIBM=0

Sparse solve seemed to be 10% faster which is nice but overall performance gains seemed to be inconclusive at best. Is this a sane Make.user? are there any configs worth testing? can i RICE even harder?

PS

the build is for a i9-10980XE

jling · June 5, 2022, 2:08am

I personally would say it’s not worth spending so much time building Julia just to set up for possible future debugging with negligible performance gain

sobhan · June 5, 2022, 2:09am

the build time is pretty ok IMHO.

Oscar_Smith · June 5, 2022, 2:11am

Note that by default, Julia has multiversioning, so building with march=native isn’t expected to give noticable results.

sobhan · June 5, 2022, 2:12am

multiversioning is for everything outside stdlib, right?
would i get the same config julia is built with if i leave the make.user blank?

Elrod · June 5, 2022, 3:05am

OPENBLAS_TARGET_ARCH=SKYLAKEX

I suggest starting Julia with -C"native,-prefer-256-bit" to allow it to use 512 bit vectors.
By default, LLVM sets the option prefer-256-bit, which you can disable via the -.
LoopVectorization.jl tends to be much better at using AVX512 than the default options.

Also, disabling dynamic arch and setting the target to SKYLAKEX can be a nice trick for CPUs with AVX512 that aren’t yet supported by OpenBLAS, as otherwise OpenBLAS will use a nehalem kernel.
It’s not ideal that OpenBLAS detects specific CPUs rather than their features for dispatch.

But otherwise, the overhead is pretty negligible. That is, if the dispatch is actually working – as it should for your cascadelake CPU – I don’t think you’ll see a performance benefit from that.

But using MKL will probably give you a pretty substantial benefit for most operations.

ImreSamu · June 5, 2022, 3:53am

Maybe we can adapt some ideas from ClearLinux.
“Clear Linux OS uses aggressive compiler flags to optimize software builds”

( phoronix test from 2022-05-19 )

openblas

openblas/openblas.spec at main · clearlinux-pkgs/openblas · GitHub

export CFLAGS="$CFLAGS -march=skylake-avx512  -mprefer-vector-width=256 -mtune=skylake-avx512"
export FFLAGS="$FFLAGS -march=skylake-avx512  -mprefer-vector-width=256 -mtune=skylake-avx512"

export CFLAGS="$CFLAGS -flto -ffunction-sections -fno-semantic-interposition -O3 "
export CXXFLAGS="$CXXFLAGS -flto -ffunction-sections -fno-semantic-interposition -O3 "

with special OpenBlas patches …

Patch1:  0001-Update-lto-related-for-v0.3.7.patch
Patch10: 0001-ported-blas-ht-patch.patch 
Patch11: 0001-ported-blas-ht-patch-2.patch 
#Patch11: 0001-Add-sgemm-direct-code-for-avx2.patch
Patch12: 0001-Remove-AVX2-macro-detection-as-not-supported.patch
Patch13: 0001-Set-OMP-thread-count-to-best-utilize-HT-CPU.patch
Patch14: cmpxchg.patch

LLVM14

llvm/llvm.spec at main · clearlinux-pkgs/llvm · GitHub

paches:

Patch1: llvm-0001-Improve-physical-core-count-detection.patch
Patch2: llvm-0002-Produce-a-normally-versioned-libLLVM.patch
Patch3: llvm-0003-Allow-one-more-FMA-fusion.patch
Patch4: clang-0001-Detect-Clear-Linux-and-apply-Clear-s-default-linker-.patch
Patch5: clang-0002-Make-Clang-default-to-Westmere-on-Clear-Linux.patch
Patch6: clang-0003-Add-the-LLVM-major-version-number-to-the-Gold-LTO-pl.patch
Patch7: clang-0004-Add-a-couple-more-f-instructions-that-GCC-has-that-C.patch
Patch8: clang-0005-Don-t-error-on-ftrivial-auto-var-init-zero.patch
Patch9: clang-soname.patch

LLVM13

llvm13/llvm13.spec at main · clearlinux-pkgs/llvm13 · GitHub

paches:

Patch1: llvm-0001-Improve-physical-core-count-detection.patch
Patch2: llvm-0002-Produce-a-normally-versioned-libLLVM.patch
Patch3: llvm-0003-Allow-one-more-FMA-fusion.patch
Patch4: clang-0001-Detect-Clear-Linux-and-apply-Clear-s-default-linker-.patch
Patch5: clang-0002-Make-Clang-default-to-Westmere-on-Clear-Linux.patch
Patch6: clang-0003-Add-the-LLVM-major-version-number-to-the-Gold-LTO-pl.patch
Patch7: clang-0004-Add-a-couple-more-f-instructions-that-GCC-has-that-C.patch

zlib

zlib/zlib.spec at main · clearlinux-pkgs/zlib · GitHub


export CFLAGS="$CFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export FCFLAGS="$FFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export FFLAGS="$FFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export CXXFLAGS="$CXXFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "

libuv

libuv/libuv.spec at main · clearlinux-pkgs/libuv · GitHub

export CFLAGS="$CFLAGS -O3 -ffat-lto-objects -flto=auto "
export FCFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=auto "
export FFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=auto "
export CXXFLAGS="$CXXFLAGS -O3 -ffat-lto-objects -flto=auto "

gmp

gmp/gmp.spec at main · clearlinux-pkgs/gmp · GitHub

export CFLAGS="-O3  -g -fno-semantic-interposition -march=haswell -ffat-lto-objects  -flto=4 -mno-vzeroupper -march=x86-64-v3 "

mpfr

mpfr/mpfr.spec at main · clearlinux-pkgs/mpfr · GitHub

export CFLAGS="$CFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export FCFLAGS="$FFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export FFLAGS="$FFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "
export CXXFLAGS="$CXXFLAGS -O3 -Ofast -falign-functions=32 -ffat-lto-objects -flto=auto -fno-semantic-interposition -mno-vzeroupper -mprefer-vector-width=256 "

...

pushd ../buildavx2/
export CFLAGS="$CFLAGS -m64 -march=x86-64-v3"
export CXXFLAGS="$CXXFLAGS -m64 -march=x86-64-v3"
export FFLAGS="$FFLAGS -m64 -march=x86-64-v3"
export FCFLAGS="$FCFLAGS -m64 -march=x86-64-v3"
export LDFLAGS="$LDFLAGS -m64 -march=x86-64-v3"

....

pushd ../buildavx512/
export CFLAGS="$CFLAGS -m64 -march=x86-64-v4 -mprefer-vector-width=256"
export CXXFLAGS="$CXXFLAGS -m64 -march=x86-64-v4 -mprefer-vector-width=256"
export FFLAGS="$FFLAGS -m64 -march=x86-64-v4 -mprefer-vector-width=256"
export FCFLAGS="$FCFLAGS -m64 -march=x86-64-v4 -mprefer-vector-width=256"
export LDFLAGS="$LDFLAGS -m64 -march=x86-64-v4"

libssh2

GitHub - clearlinux-pkgs/libssh2

export CFLAGS="$CFLAGS -O3 -ffat-lto-objects -flto=auto -fstack-protector-strong -fzero-call-used-regs=used "
export FCFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=auto -fstack-protector-strong -fzero-call-used-regs=used "
export FFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=auto -fstack-protector-strong -fzero-call-used-regs=used "
export CXXFLAGS="$CXXFLAGS -O3 -ffat-lto-objects -flto=auto -fstack-protector-strong -fzero-call-used-regs=used "

curl

curl/curl.spec at main · clearlinux-pkgs/curl · GitHub

export CFLAGS="$CFLAGS -Os -fdata-sections -ffunction-sections -fno-lto -fno-semantic-interposition -fstack-protector-strong -fzero-call-used-regs=used "
export FCFLAGS="$FFLAGS -Os -fdata-sections -ffunction-sections -fno-lto -fno-semantic-interposition -fstack-protector-strong -fzero-call-used-regs=used "
export FFLAGS="$FFLAGS -Os -fdata-sections -ffunction-sections -fno-lto -fno-semantic-interposition -fstack-protector-strong -fzero-call-used-regs=used "
export CXXFLAGS="$CXXFLAGS -Os -fdata-sections -ffunction-sections -fno-lto -fno-semantic-interposition -fstack-protector-strong -fzero-call-used-regs=used "

p7zip

p7zip/p7zip.spec at main · clearlinux-pkgs/p7zip · GitHub

export CFLAGS="$CFLAGS -O3 -ffat-lto-objects -flto=4 -fstack-protector-strong -fzero-call-used-regs=used "
export FCFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=4 -fstack-protector-strong -fzero-call-used-regs=used "
export FFLAGS="$FFLAGS -O3 -ffat-lto-objects -flto=4 -fstack-protector-strong -fzero-call-used-regs=used "
export CXXFLAGS="$CXXFLAGS -O3 -ffat-lto-objects -flto=4 -fstack-protector-strong -fzero-call-used-regs=used "
export CFLAGS_GENERATE="$CFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export FCFLAGS_GENERATE="$FCFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export FFLAGS_GENERATE="$FFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export CXXFLAGS_GENERATE="$CXXFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export LDFLAGS_GENERATE="$LDFLAGS -fprofile-generate -fprofile-dir=/var/tmp/pgo -fprofile-update=atomic "
export CFLAGS_USE="$CFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "
export FCFLAGS_USE="$FCFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "
export FFLAGS_USE="$FFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "
export CXXFLAGS_USE="$CXXFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "
export LDFLAGS_USE="$LDFLAGS -fprofile-use -fprofile-dir=/var/tmp/pgo -fprofile-correction "

+ gcc patches for the build.

gcc/gcc.spec at main · clearlinux-pkgs/gcc · GitHub

Patch0   : gcc-stable-branch.patch
Patch1   : 0001-Fix-stack-protection-issues.patch
Patch2   : openmp-vectorize-v2.patch
Patch3   : fortran-vector-v2.patch
Patch5   : optimize.patch
Patch6   : vectorize.patch
Patch9   : gomp-relax.patch
Patch11  : memcpy-avx2.patch
Patch12	 : avx512-when-we-ask-for-it.patch
Patch14  : arch-native-override.patch
Patch15  : 0001-Ignore-Werror-if-GCC_IGNORE_WERROR-environment-varia.patch
Patch16  : 0001-Always-use-z-now-when-linking-with-pie.patch
Patch19  : tune-inline.patch
Patch20  : vectorcost.patch

sobhan · June 5, 2022, 3:56am

This looks very extensive but seems like a pain integrate into a julia.

sobhan · June 5, 2022, 3:58am

it’s also interesting that @Elrod recommends “-prefer-256-bit” while clearlinux seems to do the opposite. i guess it’s CPU specific?

~~Also those CFLAGS for curl seems…abused…for my case.~~ i can’t read.

Elrod · June 5, 2022, 4:36am

I’m probably overly optimistic about the compiler/length of the vectors you may be working with.
But someone on slack commented recently that LV often gave about a 2x speedup on many of the simple loops they were working with, and it seemed this was purely because it used larger vectors.

The problem with LLVM’s vectorization is that it unrolls aggressively, and then doesn’t vectorize the unroll*vectorization remainder. So using 512 bit vectors with Float64 means it will only vectorize blocks of 32. If your loop is 63 iterations, then with 512 bit vectors it will likely run 1 unrolled and vectorized iteration, followed by 31 scalar iterations.
With 256 iterations, it’ll run 2 unrolled and vectorized iterations, followed by 15 scalar iterations – much faster.

LoopVectorization.jl does this better. I was hoping the @vp macro would get LLVM to do something more similar, but not quite.
The code below runs dot products at all lengths 1:1024 in a random order, and benchmarks how long it takes.

julia> @time using LoopVectorization, Random, BenchmarkTools
  0.000189 seconds (470 allocations: 46.227 KiB)

julia> function dot_fast(x,y)
           s = zero(eltype(x))
           for i = eachindex(x)
               @inbounds @fastmath s += x[i]*y[i]
           end
           s
       end
dot_fast (generic function with 1 method)

julia> macro vp(expr)
            nodes = (Symbol("llvm.loop.vectorize.predicate.enable"), 1)
            if expr.head != :for
                error("Syntax error: loopinfo needs a for loop")
            end
            push!(expr.args[2].args, Expr(:loopinfo, nodes))
            return esc(expr)
       end
@vp (macro with 1 method)

julia> function dot_vp(x,y)
           s = zero(eltype(x))
           @vp for i = eachindex(x)
               @inbounds @fastmath s += x[i]*y[i]
           end
           s
       end
dot_vp (generic function with 1 method)

julia> function dot_turbo(x,y)
           s = zero(eltype(x))
           @turbo for i = eachindex(x)
               s += x[i]*y[i]
           end
           s
       end
dot_turbo (generic function with 1 method)

julia> x = rand(1024); y = rand(length(x)); Ns = randperm(length(x)); z = similar(x);

julia> @benchmark map!(n -> dot_vp(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  61.602 μs … 101.234 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     61.731 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   61.795 μs ± 547.266 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▄▁▆█▄                                              ▂▁▁▁▁▁  ▂
  ███████▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▆█████████ █
  61.6 μs       Histogram: log(frequency) by time      63.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_fast(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  41.530 μs …  71.679 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     41.659 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   41.735 μs ± 549.633 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▆██▇▄▂                                            ▁▁▁      ▂
  ▆███████▇▅▅▄▄▄▁▁▁▃▁▁▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▃▅▅▆█████▆▆▆▇ █
  41.5 μs       Histogram: log(frequency) by time      43.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_turbo(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  31.564 μs …  73.508 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     31.845 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.905 μs ± 547.370 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▁▄▇█▇▄▂
  ▂▂▂▄▆███████▇▄▃▂▂▂▂▁▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  31.6 μs         Histogram: frequency by time         33.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Reducing the length to 1:256:

julia> x = rand(256); y = rand(length(x)); Ns = randperm(length(x)); z = similar(x);

julia> @benchmark map!(n -> dot_vp(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.635 μs …  9.801 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.664 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.671 μs ± 65.712 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▂▆█▆▂
  ▂▃▄▆█████▇▄▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  4.63 μs        Histogram: frequency by time        4.89 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_fast(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.650 μs … 10.561 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.678 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.688 μs ± 83.408 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▃██▅▁
  ▂▂▅█████▆▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  5.65 μs        Histogram: frequency by time        5.94 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_turbo(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.181 μs …  8.153 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.194 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.199 μs ± 60.401 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▇▅▂                                                 ▁  ▂
  ███████▆▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆▇███ █
  3.18 μs      Histogram: log(frequency) by time     3.39 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

and now the @vp version is faster than the default, but both are of course still slower than @turbo.
This was with native,-prefer-256-bit.
With the default (native is the default):

julia> x = rand(256); y = rand(length(x)); Ns = randperm(length(x)); z = similar(x);

julia> @benchmark map!(n -> dot_vp(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.562 μs …  9.660 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.587 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.601 μs ± 78.964 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█▇▇▅▁                                                 ▁  ▂
  ███████▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆████ █
  8.56 μs      Histogram: log(frequency) by time     9.05 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_fast(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.940 μs …  7.153 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.961 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.969 μs ± 49.064 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▅▇█▄
  ▂▄█████▆▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂ ▃
  4.94 μs        Histogram: frequency by time        5.18 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark map!(n -> dot_turbo(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.740 μs …  9.553 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.751 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.759 μs ± 88.349 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅▇█▇▃                                                 ▁▁▁ ▂
  ██████▇▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▆▇███ █
  3.74 μs      Histogram: log(frequency) by time     3.95 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@turbo doesn’t care, and will use full sized vectors anyway. I’m not quite sure why it got worse performance than before; the assembly is actually exactly the same.

But you can see that the “normal” version of dot product (dot_fast) was actually faster with 256 bit vectors than 512 bit vectors!

Using predicates, e.g. @vp or @turbo made the 512 bit code faster. And in the case of @turbo, it is significantly faster at most sizes over the range of 16 or so up until a few hundred.

But note that @vp with 512 bit vectors was faster than not having @vp with 256 bit.
I do think @vp+512 bit vectors is a better default than not-@vp and 256 bit, unless the expected vector length is very long.

EDIT:
Also, our results should be fairly comparable.

julia> versioninfo()
Julia Version 1.9.0-DEV.635
Commit 5ef75cbf5b (2022-05-24 19:07 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.3 (ORCJIT, cascadelake)
  Threads: 36 on 36 virtual cores

sobhan · June 5, 2022, 7:22am

i tested -C"native,-prefer-256-bit" it made things slower .

giordano · June 5, 2022, 10:55am

Keep in mind that julia is a compiler itself: the performance of the code it generates shouldn’t in principle depend much (or at all?) on how it was compiled, by default it compiles natively for the target system in any case. That said, by doing a native build what could improve is performance of the runtime (compilation latency, garbage collector, etc…), but depending on workload you benchmark this can have varying impact, sometimes negligible, other times more significant.

stabbles · June 5, 2022, 11:39am

@ImreSamu: regarding Clear Linux and the Phoronix benchmark, you have to take it with a grain of salt. For example they have amazing numbers for Zstd, but the reason is they set the default number of threads to 4, where other distro’s only use 1; the benchmark is builtin zstd -b ..., so it’s apples and oranges (zstd benchmark misleading · Issue #633 · phoronix-test-suite/phoronix-test-suite · GitHub). Another issue is Phoronix compares Zstd 1.4 with 1.5, and there’s huge performance improvements between those releases, so it’s really not about Clear Linux being faster, but about them shipping a more recent version.

But sure, in some cases they get pretty good speedups, usually though about 5 - 10% thanks to PGO or LTO.

Getting PGO right is tricky, but I think this is the best type of optimization for large and branchy code bases like LLVM. I’ve spent some time enabling getting PGO to work in the Spack package manager for software stacks in general, and for Julia + LLVM things are really good.

Using JULIA_LLVM_ARGS=-time-passes julia -O3 ./script.jl where script.jl is pretty much using LoopVectorization with a @turbo’d inner product, compilation time drops by 25% for the most expensive LLVM passes:

[official binaries, generic, GCC, no PGO]:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3508 ( 18.3%)   0.3359 ( 19.3%)   1.6867 ( 18.5%)   1.6809 ( 18.5%)  X86 DAG->DAG Instruction Selection
   0.6616 (  9.0%)   0.3140 ( 18.0%)   0.9756 ( 10.7%)   0.9732 ( 10.7%)  X86 Assembly Printer
   0.3632 (  4.9%)   0.0501 (  2.9%)   0.4132 (  4.5%)   0.4126 (  4.5%)  Greedy Register Allocator
   0.3423 (  4.6%)   0.0511 (  2.9%)   0.3934 (  4.3%)   0.3904 (  4.3%)  Combine redundant instructions

[spack, generic, clang 14, no PGO]
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.2367 ( 18.8%)   0.1931 ( 19.6%)   1.4299 ( 18.9%)   1.4298 ( 18.9%)  X86 DAG->DAG Instruction Selection
   0.5216 (  7.9%)   0.1528 ( 15.5%)   0.6744 (  8.9%)   0.6744 (  8.9%)  X86 Assembly Printer
   0.3438 (  5.2%)   0.0281 (  2.9%)   0.3719 (  4.9%)   0.3718 (  4.9%)  Greedy Register Allocator
   0.3196 (  4.8%)   0.0333 (  3.4%)   0.3529 (  4.7%)   0.3525 (  4.7%)  Combine redundant instructions
   
[spack, generic, clang 14, PGO]:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.9713 ( 17.9%)   0.1708 ( 19.8%)   1.1421 ( 18.2%)   1.1421 ( 18.2%)  X86 DAG->DAG Instruction Selection
   0.5047 (  9.3%)   0.1283 ( 14.9%)   0.6329 ( 10.1%)   0.6329 ( 10.1%)  X86 Assembly Printer
   0.3070 (  5.7%)   0.0325 (  3.8%)   0.3395 (  5.4%)   0.3395 (  5.4%)  Greedy Register Allocator
   0.2403 (  4.4%)   0.0340 (  3.9%)   0.2743 (  4.4%)   0.2741 (  4.4%)  Post RA top-down list latency scheduler

sobhan · June 5, 2022, 2:24pm

yes but stdlib is pre-compiled w/o march native right? most of my workload is in stdlib

giordano · June 5, 2022, 3:20pm

I seem to remember the sysimage of the official binary is compiled targeting different levels for this reason, but I’m not entirely sure and I don’t even know where to search this detail.

nalimilan · June 5, 2022, 8:51pm

I think the multiple targets are set here (taking the x86-64 example):
https://github.com/JuliaCI/julia-buildbot/blob/1fcd70dc92f3ca34fae8dbfa2d1c21a5efe499ee/master/inventory.py#L130

Topic		Replies	Views
Speeding up julia on aarch64 Internals & Design aarch64 , arm	15	2458	April 29, 2020
Show off Julia performance on your PC! Performance	53	4295	April 26, 2020
Compilation options for Downfall mitigation Performance question	4	878	October 25, 2023
Intel C/C++ compiler performance versus Julia Offtopic	20	6230	August 11, 2021
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36470	June 19, 2020