Show off Julia performance on your PC!

There is something wrong with your machine :wink:

N = 10^8 and @jonathanBieler s functions (Ryzen 5 3600):

julia> Threads.nthreads()
12

julia> @btime sequential_add!($y, $x)
  8.551 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  1.540 s (88 allocations: 10.30 KiB)

Or maybe: don’t type during benchmarks?

No, just kidding here, but I think we are still not really comparing the CPUs only. I don’t know if this is really possible.

1 Like

There is something wrong with your machine :wink:

@oheil As long as nobody tells my wife that I could have spent half the money to achieve the same performance, I’m totally okay with this :laughing:

In all seriousness, this is really interesting. Can you post the result of versioninfo() here? Mine is below:

julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 3950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
  JULIA_EDITOR = "C:\Users\mthel\AppData\Local\Programs\Microsoft VS Code\Code.exe"
  JULIA_NUM_THREADS = 16

Are you using VS Code/Juno? Would this even matter? I just closed everything else I had running except for my browser (Chrome) and VS Code and then re-ran the functions with N = 10^8 and got this:

julia> function parallel_add!(y, x)
           Threads.@threads for i in eachindex(y, x)
               for j=1:10
                   @inbounds y[i] += log(abs(x[i]))^j
               end
           end
           return nothing
       end
parallel_add! (generic function with 1 method)

julia> function sequential_add!(y, x)
           for i in eachindex(y, x)
               for j=1:10
                   @inbounds y[i] += log(abs(x[i]))^j
               end
           end
           return nothing
       end
sequential_add! (generic function with 1 method)

julia> @btime sequential_add!($y, $x)
  7.778 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  764.285 ms (118 allocations: 13.73 KiB)

It seems that the results can vary pretty significantly depending on what other tasks are running, and even from trial to trial.

1 Like
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 5 3600 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 12

Plain Julia REPL, but full load of other applications, browsers, a VSCode with another running REPL and of course Steam! :wink:

2 Likes

With 16 cores you may try with 32 threads?

1 Like

Another one despite the “equal” performance: your rig is much more beautiful!

1 Like

I noticed you have JULIA_NUM_THREADS set to 12, so I went ahead and bumped mine up to 32 and got this:

julia> N = 10^8
100000000

julia> @btime sequential_add!($y, $x)
  7.860 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  557.116 ms (228 allocations: 27.33 KiB)

I’m feeling good again about my excessive expense on this new PC :sunglasses:

3 Likes

Number of thread do scale well, 3x threads = 3x performance

I just checked the specs and I think I understand it now:
The CPUs clock of the Ryzen 9 is 3.50 GHz/ 4.70 GHz, the Ryzen 5 has 3.60 GHz/ 4.20 GHz, which shouldn’t result in large performance differences. The 0.1 smaller of the Ryzen 9 is probably evened out by the higher turbo cycles and this depends on how long the turbo cycles are allowed to be on.

For real data calculation I would expect that the 64 MB L3-cache compared to 32 MB would make significant differences.

julia> Threads.nthreads()
12

julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 5 3600 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 12

julia>  N = 10^8; x = fill(1.0f0, N);y = fill(2.0f0, N);

julia> function parallel_add!(y, x)
           Threads.@threads for i in eachindex(y, x)
               for j=1:10
                   @inbounds y[i] += log(abs(x[i]))^j
               end
           end
           return nothing
       end
parallel_add! (generic function with 1 method)

julia> function sequential_add!(y, x)
           for i in eachindex(y, x)
               for j=1:10
                   @inbounds y[i] += log(abs(x[i]))^j
               end
           end
           return nothing
       end
sequential_add! (generic function with 1 method)

julia> @btime sequential_add!($y, $x)
  8.483 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  1.506 s (92 allocations: 10.36 KiB)

I closed everything before running again and it had no influence.

I’m feeling good again about my excessive expense on this new PC

You really should, but because of the beauty! :wink: Your wife should have checked this before, not the expense!

3 Likes

I believe the 3950x does not include a wraith prism.

This is correct. The CPU did not come with a cooler.

You really should, but because of the beauty! :wink:

The LEDs on the mobo/fans/case are fully addressable so I might completely nerd out and make them cycle between the Julia logo colors…I’ll post more pics if I do

1 Like

I’m myself a Windows user, but I’ve read that AMD has recommended for their Threadripper-CPUs the Clear Linux Linux distribution as OS (which is an Intel initiative :rofl:). Supposedly benchmarks have proven (significant) performance improvements over Windows 10 and (all) other Linux distros, also for lower tier processors. I’ve myself haven’t tested this, but if you have some free time and feel adventurous enough it would be cool to make some benchmarks for Julia.

BTW: Nice Machine :sunglasses:

1 Like

FWIW, I use Clear Linux on my machine with a 10980XE.
I’m not sure how much difference this will make with Julia, given that you aren’t running software from their repositories (and they don’t provide Julia, meaning you’ll either be running the official binaries, or building from source*), nor are you running software compiled using their aggressive default CFLAGS environment variables.

But there may still be some random settings that make a difference, like setting transparent huge page tables to madvise by default. Foobarlv2 mentioned at least one distro (I don’t recall which) with a different setting.

If you do try it, and do try building from source, it requires adding F_COMPILER=GFORTRAN to open blas’s flags in deps/blas.mk, due to open blas’s build system mistaking Clear Linux’s gfortran for ifort, and thus passing the wrong compiler flags.
I should file an issue with OpenBLAS.

The only other problem I’ve had getting set up with the distro is that their fontconfig is in a different place than some software expects, so you need to set a path to it for VegaLite.jl to find it and let you save plots, for example.
Otherwise, I like it. Simple, up to date, reliable.

Not an aesthetically pleasing setup:

julia> using LinearAlgebra

julia> BLAS.vendor()
:openblas64

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)

julia> @time peakflops(16_000) # I'd already precompiled this function, but forgot to set num threads
  5.294615 seconds (12 allocations: 3.815 GiB, 0.29% gc time)
1.6221124310445762e12

after an add https://github.com/JuliaComputing/MKL.jl

julia> using LinearAlgebra

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)

julia> BLAS.vendor()
:mkl

julia> @time peakflops(16_000)
  4.685193 seconds (3.08 M allocations: 3.955 GiB, 2.17% gc time)
2.1108278728990073e12

julia> @time peakflops(16_000)
  4.116164 seconds (12 allocations: 3.815 GiB, 1.55% gc time)
2.1415206310497896e12

That’s over 2.1 teraflops. I’ve overclocked it to 4.1 GHz all-core AVX512 (all-core SSE and AVX(2) speeds are 4.6 and 4.3 GHz). That means the theoretical peak is

julia> 4.1 * 18 * 16 * 2
2361.6

julia> 4.1 * 18 * 16 * 2 / 1000
2.3615999999999997

2.36 teraflops. The numbers are 4.1 GHz (4.1 billion clock cycles / second) * 18 physical cores * 16 flops per double-precision avx512 fma * 2 fma / clock cycle = 2.36 trillion double precision floating point operations / second.

Interestingly, MKL may use Strassen on Haswell at large sizes, because on my employer’s HPC, I got more flops from MKL than the above calculation suggested possible. Or maybe I looked up the wrong CPU model for determining specs.

How well these CPUs do in various benchmarks however will depend on what the bottlenecks are, and how they perform with respect to those bottlenecks. The 3950X has a larger L3 cache than the 10980XE, so the 3950X will perform better in a benchmark dominated by memory bandwidth that fits in its cache but not he 10980’s cache, but the 10980 will do better if it fits in neither CPU’s cache because the 10980 has more memory channels.

9 Likes

I was wanting to run Linux on this machine but it wasn’t until after I purchased the motherboard that I realized that Asus doesn’t support Linux for this partcular mobo yet (based on their Linux status report for desktop mobos).

Beauty is in the eye of the beholder. I appreciate this kind of build as it reminds me of my cryptocurrency mining rig that I built and ran for a couple of years.

1 Like

I have never encountered a motherboard having problems with Linux, so I wouldn’t worry too much about whether Asus officially supports it. You can just try it out on a live USB stick, if you are worried about your hardware not working properly.

9 Likes

Interesting! How can software from the repositories enhance the performance? Is it specially curated for the OS? Could a tailored repository provide better performance than the official Julia binaries?

What does really differentiate the performance between operating systems? At least performance from a Julia point of view. Maybe too big of a question but I am very interested.

I would like to give Clear Linux a try, but I have heard it is a pain to get NVIDIA drivers to work an that is something I unfortunately need.
Do you use the NVIDIA drivers and if yes what was your experience of getting them to work?

BTW, your machine looks like a proper Workstation :slight_smile:

1 Like

Does MKL make such a difference? I would have expected less! I tried to check it out but failed the build, I have an ivybridge i7 mobile processor.

julia> InteractiveUtils.versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, ivybridge)

My MATLAB seems to be using MKL.

>> version -blas

ans =

    'Intel(R) Math Kernel Library Version 2019.0.3 Product Build 20190125 for Intel(R) 64 architecture applications, CNR branch AVX
     '

My “performance PC” is smaller than a box of Poptarts! I put together an Asus DeskMini 310 with an i5-8400 and 32GB RAM.

Super quiet, low power, but it definitely doesn’t lag behind on anything I need to do. I just piggyback on my university’s HPC if I absolutely need.

2 Likes

So I did:

# Benchmarking Julia
using LinearAlgebra
BLAS.vendor()
BLAS.set_num_threads(Sys.CPU_THREADS >> 1)
@time peakflops(16_000)

On my outdated 4 year old i5 Asus media PC with 8GB RAM, I got (second run, Julia v1.3):

172.585682 seconds (18 allocations: 3.815 GiB, 0.78% gc time)
4.8201604406054054e10

On my new i9 ThinkCentre m920q with 16 GB RAM, I got (Julia v1.4):

 42.414877 seconds (12 allocations: 3.815 GiB, 0.98% gc time)
1.9729402288461346e11

Should I be a little disappointed? Or is this what I could expect?

Update with results from Surface Book 2, with 16 GB RAM:

julia> @time peakflops(16_000)
 70.155445 seconds (12 allocations: 3.815 GiB, 0.47% gc time)
1.1857879366848157e11

Benchmark using:

Make PRs to this repository for the result and I will merge it:

1 Like

Very cool. So if I understood correctly, this can help track performance of evolving versions of the specific functions.

Another cool option would have been to somehow compare between machines, or Julia versions, or the interaction between both. In which case, some default test suite along the lines of:

using AcuteBenchmark
# uses default settings
# tests default functions
# plots results
# submits the results to some central repo for comparing 
benchmark() 
1 Like

FWIW, on the base model MacbookPro 16 (on battery) I get

julia> peakflops(16_000) # with turbo boost
2.7436851959295053e11

julia> peakflops(16_000) # without turbo boost
1.8706332200540894e11

I’m quite pleased with the result as it beats my university Desktop computer with an i5-7600

julia> peakflops(16_000)
2.0120022253158212e11

(I wouldn’t be worthy of a monster computer like @Elrod’s anyways)

UPDATE:

MacBook Pro 16

julia> peakflops(16_000) # 6 threads
2.7436851959295053e11

julia> peakflops(16_000) # 6 threads (without turbo boost)
1.8706332200540894e11

julia> peakflops(16_000) # 4 threads
2.0224534900761673e11

julia> peakflops(16_000) # 1 thread
5.6771273624088234e10

Desktop i5-7600

julia> peakflops(16_000) # 4 threads
2.0120022253158212e11

julia> peakflops(16_000) # 1 thread
5.8240529799739655e10
2 Likes