Does Julia support Zen 4 CPUs?

I have a laptop with an Ryzen 7 7840U CPU, which - according to https://www.amd.com/en/products/processors/laptop/ryzen/7000-series/amd-ryzen-7-7840u.html - has a Zen 4 architecture.

In Julia 1.10.5, the output of

Sys.CPU_NAME

is zenver3, though.

Any idea why?

1 Like

the LLVM that we’re using probably doesn’t know about Zen4 specifically, but even if you think it is just Zen3, you’ll get pretty good codegen. It seems possible that there might be some nice AVX-512 improvements in Julia 1.11 for Zen4. Definitely worth testing.

It is not only about AVX-512. Zen 4 also has a lot more integer and floating point registers: https://www.servethehome.com/wp-content/uploads/2022/09/AMD-Zen-3-to-Zen-4-Comparison.jpg

This looks already better:

ufechner@framework:~$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-rc3 (2024-08-26)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> Sys.CPU_NAME
"znver4"

Does anyone has a benchmark that makes use of the additional features of Zen 4 ?

1 Like

Those aren’t architected registers?! (Or then I would be very surprised). So if/since only rename registers, I don’t think you can even do anything differently, with code generation. I.e. it’s the microarch that does it best to run faster with same code.

FWIW, my zen5 CPU gets this on Julia 1.10

julia> Sys.CPU_NAME
"generic"

julia> function mysum(x)
           s = zero(eltype(x))
           for xi = x
               @fastmath s += xi
           end
           s
       end
mysum (generic function with 1 method)

julia> @cn mysum(Float64[])
julia> @code_native syntax=:intel debuginfo=:none mysum(Float64[])
	.text
	.file	"mysum"
	.globl	julia_mysum_278                 # -- Begin function julia_mysum_278
	.p2align	4, 0x90
	.type	julia_mysum_278,@function
julia_mysum_278:                        # @julia_mysum_278
# %bb.0:                                # %top
	push	rbp
	mov	rbp, rsp
	mov	rax, qword ptr [rdi + 8]
	test	rax, rax
	je	.LBB0_1
# %bb.2:                                # %L17
	mov	rcx, qword ptr [rdi]
	vmovq	xmm0, qword ptr [rcx]           # xmm0 = mem[0],zero
	cmp	rax, 1
	je	.LBB0_15
# %bb.3:                                # %L35.preheader
	lea	r8, [rax - 1]
	cmp	r8, 32
	jae	.LBB0_5
# %bb.4:
	mov	edx, 2
	mov	esi, 1
	jmp	.LBB0_13
.LBB0_1:
	vxorps	xmm0, xmm0, xmm0
	pop	rbp
	ret
.LBB0_5:                                # %vector.ph
	mov	rdx, r8
	and	rdx, -32
	vmovq	xmm0, xmm0                      # xmm0 = xmm0[0],zero
	lea	rsi, [rdx - 32]
	mov	r9, rsi
	shr	r9, 5
	inc	r9
	test	rsi, rsi
	je	.LBB0_6
# %bb.7:                                # %vector.ph.new
	mov	rdi, r9
	and	rdi, -2
	vxorpd	xmm1, xmm1, xmm1
	mov	esi, 1
	vxorpd	xmm2, xmm2, xmm2
	vxorpd	xmm3, xmm3, xmm3
	.p2align	4, 0x90
.LBB0_8:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vaddpd	zmm0, zmm0, zmmword ptr [rcx + 8*rsi]
	vaddpd	zmm1, zmm1, zmmword ptr [rcx + 8*rsi + 64]
	vaddpd	zmm2, zmm2, zmmword ptr [rcx + 8*rsi + 128]
	vaddpd	zmm3, zmm3, zmmword ptr [rcx + 8*rsi + 192]
	vaddpd	zmm0, zmm0, zmmword ptr [rcx + 8*rsi + 256]
	vaddpd	zmm1, zmm1, zmmword ptr [rcx + 8*rsi + 320]
	vaddpd	zmm2, zmm2, zmmword ptr [rcx + 8*rsi + 384]
	vaddpd	zmm3, zmm3, zmmword ptr [rcx + 8*rsi + 448]
	add	rsi, 64
	add	rdi, -2
	jne	.LBB0_8
# %bb.9:                                # %middle.block.unr-lcssa
	test	r9b, 1
	je	.LBB0_11
.LBB0_10:                               # %vector.body.epil.preheader
	vaddpd	zmm0, zmm0, zmmword ptr [rcx + 8*rsi]
	vaddpd	zmm1, zmm1, zmmword ptr [rcx + 8*rsi + 64]
	vaddpd	zmm2, zmm2, zmmword ptr [rcx + 8*rsi + 128]
	vaddpd	zmm3, zmm3, zmmword ptr [rcx + 8*rsi + 192]
.LBB0_11:                               # %middle.block
	vaddpd	zmm1, zmm1, zmm3
	vaddpd	zmm0, zmm0, zmm2
	vaddpd	zmm0, zmm0, zmm1
	vextractf64x4	ymm1, zmm0, 1
	vaddpd	zmm0, zmm0, zmm1
	vextractf128	xmm1, ymm0, 1
	vaddpd	xmm0, xmm0, xmm1
	vpermilpd	xmm1, xmm0, 1           # xmm1 = xmm0[1,0]
	vaddsd	xmm0, xmm0, xmm1
	cmp	r8, rdx
	je	.LBB0_15
# %bb.12:
	lea	rsi, [rdx + 1]
	or	rdx, 2
.LBB0_13:                               # %scalar.ph
	sub	rax, rdx
	inc	rax
	.p2align	4, 0x90
.LBB0_14:                               # %L35
                                        # =>This Inner Loop Header: Depth=1
	vaddsd	xmm0, xmm0, qword ptr [rcx + 8*rsi]
	mov	rsi, rdx
	inc	rdx
	dec	rax
	jne	.LBB0_14
.LBB0_15:                               # %L41
	pop	rbp
	vzeroupper
	ret
.LBB0_6:
	vxorpd	xmm1, xmm1, xmm1
	mov	esi, 1
	vxorpd	xmm2, xmm2, xmm2
	vxorpd	xmm3, xmm3, xmm3
	test	r9b, 1
	je	.LBB0_11
	jmp	.LBB0_10
.Lfunc_end0:
	.size	julia_mysum_278, .Lfunc_end0-julia_mysum_278
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

It calls the CPU generic, but does in fact recognize that it has AVX512 (note the zmm registers).

2 Likes

AVX512 provides twice the architectural floating point registers.
Without AVX512, you have (x/y)mm0-15. With, you have (x/y/z)mm0-31.

2 Likes

The number pointed to were 224 integer registers, now 32 more, so I don’t think it applies. Also 32 more floating point, so yes, I’m seemingly wrong on them because of AVX512.

Anyway, Golden Cove has even more of integer and floating point I see here, so then likely because of only additional rename, not architectural(?), or even more changes I missed out on?:

I haven’t read that in full, seems full of intriguing info.

I also see available:

Zen 5 is a ground-up redesign of Zen 4 with a wider front-end, increased floating point throughput and more accurate branch prediction

Intel client processors using Golden Cove don’t have AVX512, so they don’t have as many architectural registers.

All these cores have many times the actual registers as they do architectural, used for register renaming/more out of order execution.
Zen5 also doubled the number of vector registers compared to Zen4, but they have the same number of architectural (both support AVX512). Zen5 has 384 512 bit vector registers, but only 32 named architectural registers. A shame we don’t have zmm(0-63) or so, for heavier microkernels. Probably unnecessary, and could really start bloating the uop cache and icache…

More architectural registers help by enabling larger microkernels, and with heavily unrolled code, avoiding register spills, but require compilation to actually target the arch with that many.

That said, a new Intel CPU would also want to be recognized as having AVX2, as that is still much better than baseline x86_64. That won’t increase the number of architectural registers, but it makes the vector registers larger, and adds FMA, among other things.

Also, the APX extension is kind of exciting, increasing the number of architectural integer registers to 32. I’m not aware of any upcoming CPUs that will have it, yet.
As an aside, AArch64 already has 32 architectural integer and vector registers (but those vector registers are 1/4 the size of AVX512’s).

Kind of off topic, bug some big gains for Zen5 over Zen3 for this benchmark:

using TriangularSolve, LinearAlgebra, MKL;
BLAS.set_num_threads(1)
BLAS.get_config().loaded_libs
N = 100
A = rand(N,N); B = rand(N,N); C = similar(A);
@btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A), UpperTriangular($B));
@btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A), LowerTriangular($B));
@btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
@btime ldiv!(LowerTriangular($B), copyto!($C, $A));
@btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
@btime ldiv!(UpperTriangular($B), copyto!($C, $A));

@btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
@btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
@btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
@btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
@btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
@btime ldiv!(UpperTriangular($B), copyto!($C, $A)');

This is on the twoargs branch of TriangularSolve.

Benchmarks on Zen3:

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
  21.970 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A), UpperTriangular($B));
  62.849 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
  23.110 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A), LowerTriangular($B));
  63.309 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
  24.620 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A));
  67.919 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
  25.569 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A));
  57.619 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
  64.879 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
  274.177 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
  65.649 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
  472.915 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
  22.820 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
  211.538 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
  23.190 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
  217.588 μs (0 allocations: 0 bytes)

vs zen5:

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
  7.203 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A), UpperTriangular($B));
  35.570 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
  7.890 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A), LowerTriangular($B));
  36.080 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
  8.090 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A));
  32.970 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
  8.483 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A));
  31.150 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
  16.350 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
  120.371 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
  16.520 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
  226.903 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
  7.560 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
  122.021 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
  7.825 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
  118.961 μs (0 allocations: 0 bytes)

TriangularSolve is generally about 3x faster on zen5 for these benchmarks.
Clock speed makes up a big part of that, though (the zen3 system is a server, zen5 desktop).

I’ve also apparently not optimized some of the combinations, which are >2x slower than the others…
But most of the combinations aren’t a priority.

EDIT:
For fun, my cascadelake CPU (an Intel CPU with AVX512 and 2 fma units):

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
  12.870 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A), UpperTriangular($B));
  13.329 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
  12.874 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A), LowerTriangular($B));
  13.933 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
  14.468 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A));
  15.597 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
  15.779 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A));
  14.961 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
  21.168 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
  397.563 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
  21.118 μs (0 allocations: 0 bytes)

julia> @btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
  541.666 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
  12.926 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
  283.166 μs (0 allocations: 0 bytes)

julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
  13.096 μs (0 allocations: 0 bytes)

julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
  280.993 μs (0 allocations: 0 bytes)
6 Likes

“Finally this week that AMD Zen 5 (znver5) support has been submitted for review in upstreaming it for LLVM.” ( 11 September 2024 )

4 Likes