Intel client processors using Golden Cove don’t have AVX512, so they don’t have as many architectural registers.
All these cores have many times the actual registers as they do architectural, used for register renaming/more out of order execution.
Zen5 also doubled the number of vector registers compared to Zen4, but they have the same number of architectural (both support AVX512). Zen5 has 384 512 bit vector registers, but only 32 named architectural registers. A shame we don’t have zmm(0-63) or so, for heavier microkernels. Probably unnecessary, and could really start bloating the uop cache and icache…
More architectural registers help by enabling larger microkernels, and with heavily unrolled code, avoiding register spills, but require compilation to actually target the arch with that many.
That said, a new Intel CPU would also want to be recognized as having AVX2, as that is still much better than baseline x86_64. That won’t increase the number of architectural registers, but it makes the vector registers larger, and adds FMA, among other things.
Also, the APX extension is kind of exciting, increasing the number of architectural integer registers to 32. I’m not aware of any upcoming CPUs that will have it, yet.
As an aside, AArch64 already has 32 architectural integer and vector registers (but those vector registers are 1/4 the size of AVX512’s).
Kind of off topic, bug some big gains for Zen5 over Zen3 for this benchmark:
using TriangularSolve, LinearAlgebra, MKL;
BLAS.set_num_threads(1)
BLAS.get_config().loaded_libs
N = 100
A = rand(N,N); B = rand(N,N); C = similar(A);
@btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A), UpperTriangular($B));
@btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A), LowerTriangular($B));
@btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
@btime ldiv!(LowerTriangular($B), copyto!($C, $A));
@btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
@btime ldiv!(UpperTriangular($B), copyto!($C, $A));
@btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
@btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
@btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
@btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
@btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
@btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
This is on the twoargs
branch of TriangularSolve.
Benchmarks on Zen3:
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
21.970 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), UpperTriangular($B));
62.849 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
23.110 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), LowerTriangular($B));
63.309 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
24.620 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A));
67.919 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
25.569 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A));
57.619 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
64.879 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
274.177 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
65.649 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
472.915 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
22.820 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
211.538 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
23.190 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
217.588 μs (0 allocations: 0 bytes)
vs zen5:
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
7.203 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), UpperTriangular($B));
35.570 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
7.890 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), LowerTriangular($B));
36.080 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
8.090 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A));
32.970 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
8.483 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A));
31.150 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
16.350 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
120.371 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
16.520 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
226.903 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
7.560 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
122.021 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
7.825 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
118.961 μs (0 allocations: 0 bytes)
TriangularSolve is generally about 3x faster on zen5 for these benchmarks.
Clock speed makes up a big part of that, though (the zen3 system is a server, zen5 desktop).
I’ve also apparently not optimized some of the combinations, which are >2x slower than the others…
But most of the combinations aren’t a priority.
EDIT:
For fun, my cascadelake CPU (an Intel CPU with AVX512 and 2 fma units):
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
12.870 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), UpperTriangular($B));
13.329 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
12.874 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), LowerTriangular($B));
13.933 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
14.468 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A));
15.597 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
15.779 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A));
14.961 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
21.168 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
397.563 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
21.118 μs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
541.666 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
12.926 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
283.166 μs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
13.096 μs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
280.993 μs (0 allocations: 0 bytes)