Show off Julia performance on your PC!

I just built a new PC and, after asking some questions here about performance and getting some interesting answers, I thought it would be fun to start a thread where people can show off their builds and, more importantly (relevantly), how Julia is performing on their hardware. I will (shamelessly) start by flaunting my new build! :blush:

Hereā€™s my parts list:

  • CPU: AMD Ryzen 9 3950X 3.5 GHz 16-Core Processor

  • CPU Cooler: ARCTIC Liquid Freezer II 120 56.3 CFM Liquid CPU Cooler

  • Motherboard: Asus ROG Strix X570-I Gaming Mini ITX AM4 Motherboard

  • Memory: Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-3600 Memory

  • Storage: Sabrent Rocket 4.0 1 TB M.2-2280 NVME Solid State Drive

  • GPU: EVGA GeForce RTX 2080 SUPER 8 GB BLACK GAMING Video Card

  • Case + Power Supply: InWin A1 Plus Mini ITX Tower Case w/ 650 W Power Supply

Itā€™s a small-form-factor PC (mini ITX) so it sits nicely atop my desk without being imposing. The post wouldnā€™t be complete without pics, of course (yes, Iā€™m a total Julia fanboy and I put that decal on my brand new case and Iā€™m not ashamed one bit :stuck_out_tongue:):

I struggled a bit to find some ā€œstandardā€ code for benchmarking but I decided on the following (from the CUDA.jl docs, I just changed the length of the arrays to make them longer):

using BenchmarkTools
using CuArrays

N = 2^20

x = fill(1.0f0, N)  # a vector filled with 1.0 (Float32)
y = fill(2.0f0, N)  # a vector filled with 2.0
y .+= x   

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2)

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

# Parallelizaton on the GPU
x_d = CuArrays.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CuArrays.fill(2.0f0, N)  # a vector stored on the GPU filled with 2.0
y_d .+= x_d

function add_broadcast!(y, x)
    CuArrays.@sync y .+= x
    return
end

With N = 2^20, I get the following results:

julia> @btime sequential_add!($y, $x)
  72.500 Ī¼s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  37.599 Ī¼s (114 allocations: 13.67 KiB)

julia> @btime add_broadcast!($y_d, $x_d)
  70.401 Ī¼s (56 allocations: 2.22 KiB)

With N = 2^27, it looks like this:

julia> @btime sequential_add!($y, $x)
  60.721 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  57.512 ms (114 allocations: 13.67 KiB)

julia> @btime add_broadcast!($y_d, $x_d)
  3.754 ms (56 allocations: 2.22 KiB)

Go ahead, Julians, discard your modesty and show us what Julia can do on your PC!!!

16 Likes

Donā€™t know if it is showing off on a scale anyone cares about, but I am currently able to beat (on my PC) professional integrated circuit CAD software costing 6 figures on Linux, having picked up Julia in roughly the November/December timeframe. Works well in software pricing negotiationsā€¦ :slight_smile:

14 Likes

Are you on 1.2 or 1.3 and are you sure you have JULIA_NUM_THREADS set properly?

The overhead on the new threading system means that with 1.1 I get a faster threaded run than you, despite ā€œonlyā€ having 6 cores. Could you try with 1.1 to see how it changes? Hopefully 1.4 should help a little bit with that.

julia> @btime sequential_add!($y, $x)
  129.800 Ī¼s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  30.400 Ī¼s (1 allocation: 32 bytes)

EE here. Can you elaborate on what task you are using Julia for? Thanks!

Neat. Why didnā€™t you use the fan that came with the CPU? Is the liquid one better, or just more silent?

This is interesting, I am on julia 1.3.1 (Linux, i7-7700 - 8 virtual cores/ 4 physical cores, GeForce GTX 1080).
My runtimes do not improve on the parallel tests case (for N = 2^27). I thought that the benchmark is too small, but surprisingly @jebej sees a significant improvement. I was running top to confirm that julia uses indeed several threads. Maybe this benchmark is bound by the bandwidth to the main memory? I ran the Linux tool mbw 1000, and it get 8497.043 MiB/s for memcpy. Maybe you have a higher bandwidth.

for i in 1 2 4 6 8; do JULIA_NUM_THREADS=$i julia ~/Test/benchmark.jl; done
Threads.nthreads() = 1
sequential_add!  55.356 ms (0 allocations: 0 bytes)
parallel_add!  55.420 ms (9 allocations: 816 bytes)
add_broadcast!  6.723 ms (61 allocations: 2.34 KiB)
Threads.nthreads() = 2
sequential_add!  55.531 ms (0 allocations: 0 bytes)
parallel_add!  52.917 ms (16 allocations: 1.53 KiB)
add_broadcast!  6.703 ms (61 allocations: 2.34 KiB)
Threads.nthreads() = 4
sequential_add!  55.555 ms (0 allocations: 0 bytes)
parallel_add!  53.784 ms (30 allocations: 3.02 KiB)
add_broadcast!  6.669 ms (61 allocations: 2.34 KiB)
Threads.nthreads() = 6
sequential_add!  55.684 ms (0 allocations: 0 bytes)
parallel_add!  54.305 ms (44 allocations: 4.50 KiB)
add_broadcast!  6.670 ms (61 allocations: 2.34 KiB)
Threads.nthreads() = 8
sequential_add!  55.264 ms (0 allocations: 0 bytes)
parallel_add!  54.934 ms (58 allocations: 5.98 KiB)
add_broadcast!  6.729 ms (61 allocations: 2.34 KiB)

With the lower CPU
AMD Ryzen 5 3600 6-Core
I get somehow comparable results (without Cuda):

N = 2^20:

julia> @btime sequential_add!($y, $x)
  79.400 Ī¼s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  86.300 Ī¼s (9 allocations: 928 bytes)

N = 2^27:

julia> @btime sequential_add!($y, $x)
  60.593 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  60.613 ms (9 allocations: 928 bytes)

Board, Memory and SSD are quite similar to your hardware (MSI X570ā€¦, DDR4 2x16, NVMEā€¦)

The only noticeable performance gain is for N^20 parallel (about doubled).
>800 ā‚¬ for the AMD Ryzen 9 but only about <200ā‚¬ for the Ryzen 5.

Price is quadrupled but performance only doubled in a single measurement and even in all others.

Clearly, we need another performance measurement :slight_smile:
Julia 1.3.1 here by the way.

1 Like

Thanks for sharing, @oheil. How many threads did you use?

(If the interest is real and there is an actual benefit for knowing this kind of stats, maybe a good idea would be to have a package that does all of that automatically and then uploads the results somewhere useful. You know, something like Matlabā€™s bench.)

9 Likes

I didnā€™t change anything to the code from OP.
So I guess it is a single Core with therefore 2 Threads.

On a very different hardware, a Dell convertible Tablet/Laptop, Intel(R) Coreā„¢ i5-8350U CPU @ 1.70GHz, 4 Physical / 8 Logical cores, Julia 1.3, I have the same ā€œproblemā€: minor improvment using 2^20 and slower timing using parallelisation on N=2^27:

(all tests from the REPL)

*** no env setting, 2^20:

julia> @btime sequential_add!($y, $x)
  408.735 Ī¼s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  427.768 Ī¼s (9 allocations: 816 bytes)

*** no env setting, 2^27:
julia> @btime sequential_add!($y, $x)
  83.413 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  80.505 ms (9 allocations: 816 bytes)


*** JULIA_NUM_THREADS=4; 2^20:

julia> @btime sequential_add!($y, $x)
  396.994 Ī¼s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  332.526 Ī¼s (29 allocations: 3.00 KiB)


*** JULIA_NUM_THREADS=4; julia 2^27:
julia> @btime sequential_add!($y, $x)
  83.481 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  91.902 ms (30 allocations: 3.02 KiB)

*** JULIA_NUM_THREADS=8; 2^20:

julia> @btime sequential_add!($y, $x)
  402.733 Ī¼s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  285.414 Ī¼s (57 allocations: 5.97 KiB)

*** JULIA_NUM_THREADS=8; 2^27:

julia> @btime sequential_add!($y, $x)
  79.492 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  89.712 ms (58 allocations: 5.98 KiB)

Actually wrong guess :wink:

julia> Threads.nthreads()
1

But I guess (again) that this is the way to compare to the OP.
If OP changes his number of threads I will check again :wink:

A bunch of stuff, but here I am referencing physical net extraction and ERCing.

2 Likes

I believe the 3950x does not include a wraith prism. There are several shops listing them with it, but these are probably copy&paste errors.

1 Like

Are you on 1.2 or 1.3 and are you sure you have JULIA_NUM_THREADS set properly?

@jebej I am on 1.3 and I have JULIA_NUM_THREADS set to 16:

julia> Threads.nthreads()
16

Neat. Why didnā€™t you use the fan that came with the CPU? Is the liquid one better, or just more silent?

@Tamas_Papp The case is very small so thermals were a big concern given that the AMD Ryzen 3000 series CPUs are notorious for running hot. It may not actually be necessary though as my idle temps are less than 40Ā° C and even under heavy load itā€™s only getting up to around 70Ā°. Nonetheless, itā€™s a simple all-in-one cooler (no liquid to add/maintain) so itā€™s not really any more difficult to install than a regular fan but provides much better cooling.

Regarding these performance results, I think @Elrod is on to something in his reply to a question I posted in another thread. He said:

Your CPU was mostly sitting, waiting for data. For every nanosecond it spent computing, there were 40 doing nothingā€¦For memory bound operations, memory performance dominates.

It would be nice to come up with another measurement thatā€™s less memory dependent.

1 Like

This works quite well on my machine :

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        for j=1:10
            @inbounds y[i] += log(abs(x[i]))^j
        end
    end
    return nothing
end

function sequential_add!(y, x)
    for i in eachindex(y, x)
        for j=1:10
            @inbounds y[i] += log(abs(x[i]))^j
        end
    end
    return nothing
end 

On my computer with 12 threads it goes from one minute to 5 seconds (N=10^8), I can even see the cpu working at 1000% in the process monitor, nice.

1 Like

My understanding is that there were some issues, but BIOS updates fixed most of these. Recently, I am seeing reports of 80-85C max temperatures on the under typical loads, with the stock fans.

Of course, I am sure that with some special apps can drive this up, eg

That said, I fully understand why people get liquid cooling, especially if they overclock.

2 Likes

With JULIA_NUM_THREADS set to 12 for my Ryzen 5 I am on par with OP:

julia> Threads.nthreads()
12

julia> N = 2^20; x = fill(1.0f0, N);y = fill(2.0f0, N);

julia> @btime sequential_add!($y, $x)
  79.701 Ī¼s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  43.800 Ī¼s (85 allocations: 10.25 KiB)

julia> N = 2^27; x = fill(1.0f0, N);y = fill(2.0f0, N);

julia> @btime sequential_add!($y, $x)
  61.027 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  60.390 ms (86 allocations: 10.27 KiB)

So it seems to be clear, that we do not measure the CPU here.

OP, do you like to repeat with @jonathanBieler s functions? They seem to be more appropriate to compare the CPUs.

1 Like

No problems here with boxed cooler ( AMD Ryzen 5 3600 with Wraith Stealth).
The recent BIOS update adressed the loading time of the BIOS, it reduced startup time significantly. Wasnā€™t aware of any temperature issues. But I would never overclock.

I would recommend my setup.

1 Like

OP, do you like to repeat with @jonathanBieler s functions? They seem to be more appropriate to compare the CPUs.

Whoa :open_mouth: :grin:

I changed the functions to match @jonathanBielerā€™s and got the following (with N = 2^27):

julia> function parallel_add!(y, x)
           Threads.@threads for i in eachindex(y, x)
               for j=1:10
                   @inbounds y[i] += log(abs(x[i]))^j
               end
           end
           return nothing
       end
parallel_add! (generic function with 1 method)

julia> function sequential_add!(y, x)
           for i in eachindex(y, x)
               for j=1:10
                   @inbounds y[i] += log(abs(x[i]))^j
               end
           end
           return nothing
       end
sequential_add! (generic function with 1 method)

julia> @btime sequential_add!($y, $x)
  10.490 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  926.793 ms (118 allocations: 13.73 KiB)

At N = 10^8 (as @jonathanBieler stated) I get:

julia> @btime sequential_add!($y, $x)
  10.377 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  1.011 s (118 allocations: 13.73 KiB)