Show off Julia performance on your PC!

I just built a new PC and, after asking some questions here about performance and getting some interesting answers, I thought it would be fun to start a thread where people can show off their builds and, more importantly (relevantly), how Julia is performing on their hardware. I will (shamelessly) start by flaunting my new build! :blush:

Here’s my parts list:

  • CPU: AMD Ryzen 9 3950X 3.5 GHz 16-Core Processor

  • CPU Cooler: ARCTIC Liquid Freezer II 120 56.3 CFM Liquid CPU Cooler

  • Motherboard: Asus ROG Strix X570-I Gaming Mini ITX AM4 Motherboard

  • Memory: Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-3600 Memory

  • Storage: Sabrent Rocket 4.0 1 TB M.2-2280 NVME Solid State Drive

  • GPU: EVGA GeForce RTX 2080 SUPER 8 GB BLACK GAMING Video Card

  • Case + Power Supply: InWin A1 Plus Mini ITX Tower Case w/ 650 W Power Supply

It’s a small-form-factor PC (mini ITX) so it sits nicely atop my desk without being imposing. The post wouldn’t be complete without pics, of course (yes, I’m a total Julia fanboy and I put that decal on my brand new case and I’m not ashamed one bit :stuck_out_tongue:):

I struggled a bit to find some “standard” code for benchmarking but I decided on the following (from the CUDA.jl docs, I just changed the length of the arrays to make them longer):

using BenchmarkTools
using CuArrays

N = 2^20

x = fill(1.0f0, N)  # a vector filled with 1.0 (Float32)
y = fill(2.0f0, N)  # a vector filled with 2.0
y .+= x   

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2)

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

# Parallelizaton on the GPU
x_d = CuArrays.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CuArrays.fill(2.0f0, N)  # a vector stored on the GPU filled with 2.0
y_d .+= x_d

function add_broadcast!(y, x)
    CuArrays.@sync y .+= x
    return
end

With N = 2^20, I get the following results:

julia> @btime sequential_add!($y, $x)
  72.500 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  37.599 μs (114 allocations: 13.67 KiB)

julia> @btime add_broadcast!($y_d, $x_d)
  70.401 μs (56 allocations: 2.22 KiB)

With N = 2^27, it looks like this:

julia> @btime sequential_add!($y, $x)
  60.721 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  57.512 ms (114 allocations: 13.67 KiB)

julia> @btime add_broadcast!($y_d, $x_d)
  3.754 ms (56 allocations: 2.22 KiB)

Go ahead, Julians, discard your modesty and show us what Julia can do on your PC!!!

10 Likes

Don’t know if it is showing off on a scale anyone cares about, but I am currently able to beat (on my PC) professional integrated circuit CAD software costing 6 figures on Linux, having picked up Julia in roughly the November/December timeframe. Works well in software pricing negotiations… :slight_smile:

11 Likes

Are you on 1.2 or 1.3 and are you sure you have JULIA_NUM_THREADS set properly?

The overhead on the new threading system means that with 1.1 I get a faster threaded run than you, despite “only” having 6 cores. Could you try with 1.1 to see how it changes? Hopefully 1.4 should help a little bit with that.

julia> @btime sequential_add!($y, $x)
  129.800 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  30.400 μs (1 allocation: 32 bytes)

EE here. Can you elaborate on what task you are using Julia for? Thanks!

Neat. Why didn’t you use the fan that came with the CPU? Is the liquid one better, or just more silent?

This is interesting, I am on julia 1.3.1 (Linux, i7-7700 - 8 virtual cores/ 4 physical cores, GeForce GTX 1080).
My runtimes do not improve on the parallel tests case (for N = 2^27). I thought that the benchmark is too small, but surprisingly @jebej sees a significant improvement. I was running top to confirm that julia uses indeed several threads. Maybe this benchmark is bound by the bandwidth to the main memory? I ran the Linux tool mbw 1000, and it get 8497.043 MiB/s for memcpy. Maybe you have a higher bandwidth.

for i in 1 2 4 6 8; do JULIA_NUM_THREADS=$i julia ~/Test/benchmark.jl; done
Threads.nthreads() = 1
sequential_add!  55.356 ms (0 allocations: 0 bytes)
parallel_add!  55.420 ms (9 allocations: 816 bytes)
add_broadcast!  6.723 ms (61 allocations: 2.34 KiB)
Threads.nthreads() = 2
sequential_add!  55.531 ms (0 allocations: 0 bytes)
parallel_add!  52.917 ms (16 allocations: 1.53 KiB)
add_broadcast!  6.703 ms (61 allocations: 2.34 KiB)
Threads.nthreads() = 4
sequential_add!  55.555 ms (0 allocations: 0 bytes)
parallel_add!  53.784 ms (30 allocations: 3.02 KiB)
add_broadcast!  6.669 ms (61 allocations: 2.34 KiB)
Threads.nthreads() = 6
sequential_add!  55.684 ms (0 allocations: 0 bytes)
parallel_add!  54.305 ms (44 allocations: 4.50 KiB)
add_broadcast!  6.670 ms (61 allocations: 2.34 KiB)
Threads.nthreads() = 8
sequential_add!  55.264 ms (0 allocations: 0 bytes)
parallel_add!  54.934 ms (58 allocations: 5.98 KiB)
add_broadcast!  6.729 ms (61 allocations: 2.34 KiB)

With the lower CPU
AMD Ryzen 5 3600 6-Core
I get somehow comparable results (without Cuda):

N = 2^20:

julia> @btime sequential_add!($y, $x)
  79.400 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  86.300 μs (9 allocations: 928 bytes)

N = 2^27:

julia> @btime sequential_add!($y, $x)
  60.593 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  60.613 ms (9 allocations: 928 bytes)

Board, Memory and SSD are quite similar to your hardware (MSI X570…, DDR4 2x16, NVME…)

The only noticeable performance gain is for N^20 parallel (about doubled).
>800 € for the AMD Ryzen 9 but only about <200€ for the Ryzen 5.

Price is quadrupled but performance only doubled in a single measurement and even in all others.

Clearly, we need another performance measurement :slight_smile:
Julia 1.3.1 here by the way.

1 Like

Thanks for sharing, @oheil. How many threads did you use?

(If the interest is real and there is an actual benefit for knowing this kind of stats, maybe a good idea would be to have a package that does all of that automatically and then uploads the results somewhere useful. You know, something like Matlab’s bench.)

5 Likes

I didn’t change anything to the code from OP.
So I guess it is a single Core with therefore 2 Threads.

On a very different hardware, a Dell convertible Tablet/Laptop, Intel® Core™ i5-8350U CPU @ 1.70GHz, 4 Physical / 8 Logical cores, Julia 1.3, I have the same “problem”: minor improvment using 2^20 and slower timing using parallelisation on N=2^27:

(all tests from the REPL)

*** no env setting, 2^20:

julia> @btime sequential_add!($y, $x)
  408.735 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  427.768 μs (9 allocations: 816 bytes)

*** no env setting, 2^27:
julia> @btime sequential_add!($y, $x)
  83.413 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  80.505 ms (9 allocations: 816 bytes)


*** JULIA_NUM_THREADS=4; 2^20:

julia> @btime sequential_add!($y, $x)
  396.994 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  332.526 μs (29 allocations: 3.00 KiB)


*** JULIA_NUM_THREADS=4; julia 2^27:
julia> @btime sequential_add!($y, $x)
  83.481 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  91.902 ms (30 allocations: 3.02 KiB)

*** JULIA_NUM_THREADS=8; 2^20:

julia> @btime sequential_add!($y, $x)
  402.733 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  285.414 μs (57 allocations: 5.97 KiB)

*** JULIA_NUM_THREADS=8; 2^27:

julia> @btime sequential_add!($y, $x)
  79.492 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  89.712 ms (58 allocations: 5.98 KiB)

Actually wrong guess :wink:

julia> Threads.nthreads()
1

But I guess (again) that this is the way to compare to the OP.
If OP changes his number of threads I will check again :wink:

A bunch of stuff, but here I am referencing physical net extraction and ERCing.

2 Likes

I believe the 3950x does not include a wraith prism. There are several shops listing them with it, but these are probably copy&paste errors.

1 Like

Are you on 1.2 or 1.3 and are you sure you have JULIA_NUM_THREADS set properly?

@jebej I am on 1.3 and I have JULIA_NUM_THREADS set to 16:

julia> Threads.nthreads()
16

Neat. Why didn’t you use the fan that came with the CPU? Is the liquid one better, or just more silent?

@Tamas_Papp The case is very small so thermals were a big concern given that the AMD Ryzen 3000 series CPUs are notorious for running hot. It may not actually be necessary though as my idle temps are less than 40° C and even under heavy load it’s only getting up to around 70°. Nonetheless, it’s a simple all-in-one cooler (no liquid to add/maintain) so it’s not really any more difficult to install than a regular fan but provides much better cooling.

Regarding these performance results, I think @Elrod is on to something in his reply to a question I posted in another thread. He said:

Your CPU was mostly sitting, waiting for data. For every nanosecond it spent computing, there were 40 doing nothing…For memory bound operations, memory performance dominates.

It would be nice to come up with another measurement that’s less memory dependent.

1 Like

This works quite well on my machine :

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        for j=1:10
            @inbounds y[i] += log(abs(x[i]))^j
        end
    end
    return nothing
end

function sequential_add!(y, x)
    for i in eachindex(y, x)
        for j=1:10
            @inbounds y[i] += log(abs(x[i]))^j
        end
    end
    return nothing
end 

On my computer with 12 threads it goes from one minute to 5 seconds (N=10^8), I can even see the cpu working at 1000% in the process monitor, nice.

1 Like

My understanding is that there were some issues, but BIOS updates fixed most of these. Recently, I am seeing reports of 80-85C max temperatures on the under typical loads, with the stock fans.

Of course, I am sure that with some special apps can drive this up, eg

That said, I fully understand why people get liquid cooling, especially if they overclock.

2 Likes

With JULIA_NUM_THREADS set to 12 for my Ryzen 5 I am on par with OP:

julia> Threads.nthreads()
12

julia> N = 2^20; x = fill(1.0f0, N);y = fill(2.0f0, N);

julia> @btime sequential_add!($y, $x)
  79.701 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  43.800 μs (85 allocations: 10.25 KiB)

julia> N = 2^27; x = fill(1.0f0, N);y = fill(2.0f0, N);

julia> @btime sequential_add!($y, $x)
  61.027 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  60.390 ms (86 allocations: 10.27 KiB)

So it seems to be clear, that we do not measure the CPU here.

OP, do you like to repeat with @jonathanBieler s functions? They seem to be more appropriate to compare the CPUs.

1 Like

No problems here with boxed cooler ( AMD Ryzen 5 3600 with Wraith Stealth).
The recent BIOS update adressed the loading time of the BIOS, it reduced startup time significantly. Wasn’t aware of any temperature issues. But I would never overclock.

I would recommend my setup.

1 Like

OP, do you like to repeat with @jonathanBieler s functions? They seem to be more appropriate to compare the CPUs.

Whoa :open_mouth: :grin:

I changed the functions to match @jonathanBieler’s and got the following (with N = 2^27):

julia> function parallel_add!(y, x)
           Threads.@threads for i in eachindex(y, x)
               for j=1:10
                   @inbounds y[i] += log(abs(x[i]))^j
               end
           end
           return nothing
       end
parallel_add! (generic function with 1 method)

julia> function sequential_add!(y, x)
           for i in eachindex(y, x)
               for j=1:10
                   @inbounds y[i] += log(abs(x[i]))^j
               end
           end
           return nothing
       end
sequential_add! (generic function with 1 method)

julia> @btime sequential_add!($y, $x)
  10.490 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  926.793 ms (118 allocations: 13.73 KiB)

At N = 10^8 (as @jonathanBieler stated) I get:

julia> @btime sequential_add!($y, $x)
  10.377 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  1.011 s (118 allocations: 13.73 KiB)