# Show off Julia performance on your PC!

I just built a new PC and, after asking some questions here about performance and getting some interesting answers, I thought it would be fun to start a thread where people can show off their builds and, more importantly (relevantly), how Julia is performing on their hardware. I will (shamelessly) start by flaunting my new build!

Hereās my parts list:

• CPU: AMD Ryzen 9 3950X 3.5 GHz 16-Core Processor

• CPU Cooler: ARCTIC Liquid Freezer II 120 56.3 CFM Liquid CPU Cooler

• Motherboard: Asus ROG Strix X570-I Gaming Mini ITX AM4 Motherboard

• Memory: Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-3600 Memory

• Storage: Sabrent Rocket 4.0 1 TB M.2-2280 NVME Solid State Drive

• GPU: EVGA GeForce RTX 2080 SUPER 8 GB BLACK GAMING Video Card

• Case + Power Supply: InWin A1 Plus Mini ITX Tower Case w/ 650 W Power Supply

Itās a small-form-factor PC (mini ITX) so it sits nicely atop my desk without being imposing. The post wouldnāt be complete without pics, of course (yes, Iām a total Julia fanboy and I put that decal on my brand new case and Iām not ashamed one bit ):

I struggled a bit to find some āstandardā code for benchmarking but I decided on the following (from the CUDA.jl docs, I just changed the length of the arrays to make them longer):

``````using BenchmarkTools
using CuArrays

N = 2^20

x = fill(1.0f0, N)  # a vector filled with 1.0 (Float32)
y = fill(2.0f0, N)  # a vector filled with 2.0
y .+= x

for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
return nothing
end

fill!(y, 2)

@inbounds y[i] += x[i]
end
return nothing
end

# Parallelizaton on the GPU
x_d = CuArrays.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CuArrays.fill(2.0f0, N)  # a vector stored on the GPU filled with 2.0
y_d .+= x_d

CuArrays.@sync y .+= x
return
end
``````

With `N = 2^20`, I get the following results:

``````julia> @btime sequential_add!(\$y, \$x)
72.500 Ī¼s (0 allocations: 0 bytes)

37.599 Ī¼s (114 allocations: 13.67 KiB)

70.401 Ī¼s (56 allocations: 2.22 KiB)
``````

With `N = 2^27`, it looks like this:

``````julia> @btime sequential_add!(\$y, \$x)
60.721 ms (0 allocations: 0 bytes)

57.512 ms (114 allocations: 13.67 KiB)

3.754 ms (56 allocations: 2.22 KiB)
``````

16 Likes

Donāt know if it is showing off on a scale anyone cares about, but I am currently able to beat (on my PC) professional integrated circuit CAD software costing 6 figures on Linux, having picked up Julia in roughly the November/December timeframe. Works well in software pricing negotiationsā¦

14 Likes

Are you on 1.2 or 1.3 and are you sure you have JULIA_NUM_THREADS set properly?

The overhead on the new threading system means that with 1.1 I get a faster threaded run than you, despite āonlyā having 6 cores. Could you try with 1.1 to see how it changes? Hopefully 1.4 should help a little bit with that.

``````julia> @btime sequential_add!(\$y, \$x)
129.800 Ī¼s (0 allocations: 0 bytes)

30.400 Ī¼s (1 allocation: 32 bytes)
``````

EE here. Can you elaborate on what task you are using Julia for? Thanks!

Neat. Why didnāt you use the fan that came with the CPU? Is the liquid one better, or just more silent?

This is interesting, I am on julia 1.3.1 (Linux, i7-7700 - 8 virtual cores/ 4 physical cores, GeForce GTX 1080).
My runtimes do not improve on the parallel tests case (for N = 2^27). I thought that the benchmark is too small, but surprisingly @jebej sees a significant improvement. I was running top to confirm that julia uses indeed several threads. Maybe this benchmark is bound by the bandwidth to the main memory? I ran the Linux tool `mbw 1000`, and it get 8497.043 MiB/s for memcpy. Maybe you have a higher bandwidth.

``````for i in 1 2 4 6 8; do JULIA_NUM_THREADS=\$i julia ~/Test/benchmark.jl; done
sequential_add!  55.356 ms (0 allocations: 0 bytes)
parallel_add!  55.420 ms (9 allocations: 816 bytes)
sequential_add!  55.531 ms (0 allocations: 0 bytes)
parallel_add!  52.917 ms (16 allocations: 1.53 KiB)
sequential_add!  55.555 ms (0 allocations: 0 bytes)
parallel_add!  53.784 ms (30 allocations: 3.02 KiB)
sequential_add!  55.684 ms (0 allocations: 0 bytes)
parallel_add!  54.305 ms (44 allocations: 4.50 KiB)
sequential_add!  55.264 ms (0 allocations: 0 bytes)
parallel_add!  54.934 ms (58 allocations: 5.98 KiB)
``````

With the lower CPU
AMD Ryzen 5 3600 6-Core
I get somehow comparable results (without Cuda):

`N = 2^20`:

``````julia> @btime sequential_add!(\$y, \$x)
79.400 Ī¼s (0 allocations: 0 bytes)

86.300 Ī¼s (9 allocations: 928 bytes)
``````

`N = 2^27`:

``````julia> @btime sequential_add!(\$y, \$x)
60.593 ms (0 allocations: 0 bytes)

60.613 ms (9 allocations: 928 bytes)
``````

Board, Memory and SSD are quite similar to your hardware (MSI X570ā¦, DDR4 2x16, NVMEā¦)

The only noticeable performance gain is for N^20 parallel (about doubled).
>800 ā¬ for the AMD Ryzen 9 but only about <200ā¬ for the Ryzen 5.

Price is quadrupled but performance only doubled in a single measurement and even in all others.

Clearly, we need another performance measurement
Julia 1.3.1 here by the way.

1 Like

Thanks for sharing, @oheil. How many threads did you use?

(If the interest is real and there is an actual benefit for knowing this kind of stats, maybe a good idea would be to have a package that does all of that automatically and then uploads the results somewhere useful. You know, something like Matlabās bench.)

9 Likes

I didnāt change anything to the code from OP.
So I guess it is a single Core with therefore 2 Threads.

On a very different hardware, a Dell convertible Tablet/Laptop, IntelĀ® Coreā¢ i5-8350U CPU @ 1.70GHz, 4 Physical / 8 Logical cores, Julia 1.3, I have the same āproblemā: minor improvment using 2^20 and slower timing using parallelisation on N=2^27:

(all tests from the REPL)

``````*** no env setting, 2^20:

408.735 Ī¼s (0 allocations: 0 bytes)

427.768 Ī¼s (9 allocations: 816 bytes)

*** no env setting, 2^27:
83.413 ms (0 allocations: 0 bytes)

80.505 ms (9 allocations: 816 bytes)

396.994 Ī¼s (0 allocations: 0 bytes)

332.526 Ī¼s (29 allocations: 3.00 KiB)

83.481 ms (0 allocations: 0 bytes)

91.902 ms (30 allocations: 3.02 KiB)

402.733 Ī¼s (0 allocations: 0 bytes)

285.414 Ī¼s (57 allocations: 5.97 KiB)

79.492 ms (0 allocations: 0 bytes)

89.712 ms (58 allocations: 5.98 KiB)
``````

Actually wrong guess

``````julia> Threads.nthreads()
1
``````

But I guess (again) that this is the way to compare to the OP.
If OP changes his number of threads I will check again

A bunch of stuff, but here I am referencing physical net extraction and ERCing.

2 Likes

I believe the 3950x does not include a wraith prism. There are several shops listing them with it, but these are probably copy&paste errors.

1 Like

Are you on 1.2 or 1.3 and are you sure you have JULIA_NUM_THREADS set properly?

@jebej I am on 1.3 and I have `JULIA_NUM_THREADS` set to 16:

``````julia> Threads.nthreads()
16
``````

Neat. Why didnāt you use the fan that came with the CPU? Is the liquid one better, or just more silent?

@Tamas_Papp The case is very small so thermals were a big concern given that the AMD Ryzen 3000 series CPUs are notorious for running hot. It may not actually be necessary though as my idle temps are less than 40Ā° C and even under heavy load itās only getting up to around 70Ā°. Nonetheless, itās a simple all-in-one cooler (no liquid to add/maintain) so itās not really any more difficult to install than a regular fan but provides much better cooling.

Regarding these performance results, I think @Elrod is on to something in his reply to a question I posted in another thread. He said:

Your CPU was mostly sitting, waiting for data. For every nanosecond it spent computing, there were 40 doing nothingā¦For memory bound operations, memory performance dominates.

It would be nice to come up with another measurement thatās less memory dependent.

1 Like

This works quite well on my machine :

``````function parallel_add!(y, x)
for j=1:10
@inbounds y[i] += log(abs(x[i]))^j
end
end
return nothing
end

for i in eachindex(y, x)
for j=1:10
@inbounds y[i] += log(abs(x[i]))^j
end
end
return nothing
end
``````

On my computer with 12 threads it goes from one minute to 5 seconds (N=10^8), I can even see the cpu working at 1000% in the process monitor, nice.

1 Like

My understanding is that there were some issues, but BIOS updates fixed most of these. Recently, I am seeing reports of 80-85C max temperatures on the under typical loads, with the stock fans.

Of course, I am sure that with some special apps can drive this up, eg

That said, I fully understand why people get liquid cooling, especially if they overclock.

2 Likes

With JULIA_NUM_THREADS set to 12 for my Ryzen 5 I am on par with OP:

``````julia> Threads.nthreads()
12

julia> N = 2^20; x = fill(1.0f0, N);y = fill(2.0f0, N);

79.701 Ī¼s (0 allocations: 0 bytes)

43.800 Ī¼s (85 allocations: 10.25 KiB)

julia> N = 2^27; x = fill(1.0f0, N);y = fill(2.0f0, N);

61.027 ms (0 allocations: 0 bytes)

60.390 ms (86 allocations: 10.27 KiB)
``````

So it seems to be clear, that we do not measure the CPU here.

OP, do you like to repeat with @jonathanBieler s functions? They seem to be more appropriate to compare the CPUs.

1 Like

No problems here with boxed cooler ( AMD Ryzen 5 3600 with Wraith Stealth).
The recent BIOS update adressed the loading time of the BIOS, it reduced startup time significantly. Wasnāt aware of any temperature issues. But I would never overclock.

I would recommend my setup.

1 Like

OP, do you like to repeat with @jonathanBieler s functions? They seem to be more appropriate to compare the CPUs.

Whoa

I changed the functions to match @jonathanBielerās and got the following (with `N = 2^27`):

``````julia> function parallel_add!(y, x)
for j=1:10
@inbounds y[i] += log(abs(x[i]))^j
end
end
return nothing
end
parallel_add! (generic function with 1 method)

for i in eachindex(y, x)
for j=1:10
@inbounds y[i] += log(abs(x[i]))^j
end
end
return nothing
end
sequential_add! (generic function with 1 method)

10.490 s (0 allocations: 0 bytes)

At `N = 10^8` (as @jonathanBieler stated) I get:
``````julia> @btime sequential_add!(\$y, \$x)