Intel client processors using Golden Cove don’t have AVX512, so they don’t have as many architectural registers.
All these cores have many times the actual registers as they do architectural, used for register renaming/more out of order execution.
Zen5 also doubled the number of vector registers compared to Zen4, but they have the same number of architectural (both support AVX512). Zen5 has 384 512 bit vector registers, but only 32 named architectural registers. A shame we don’t have zmm(0-63) or so, for heavier microkernels. Probably unnecessary, and could really start bloating the uop cache and icache…
More architectural registers help by enabling larger microkernels, and with heavily unrolled code, avoiding register spills, but require compilation to actually target the arch with that many.
That said, a new Intel CPU would also want to be recognized as having AVX2, as that is still much better than baseline x86_64. That won’t increase the number of architectural registers, but it makes the vector registers larger, and adds FMA, among other things.
Also, the APX extension is kind of exciting, increasing the number of architectural integer registers to 32. I’m not aware of any upcoming CPUs that will have it, yet.
As an aside, AArch64 already has 32 architectural integer and vector registers (but those vector registers are 1/4 the size of AVX512’s).
Kind of off topic, bug some big gains for Zen5 over Zen3 for this benchmark:
using TriangularSolve, LinearAlgebra, MKL;
BLAS.set_num_threads(1)
BLAS.get_config().loaded_libs
N = 100
A = rand(N,N); B = rand(N,N); C = similar(A);
@btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A), UpperTriangular($B));
@btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A), LowerTriangular($B));
@btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
@btime ldiv!(LowerTriangular($B), copyto!($C, $A));
@btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
@btime ldiv!(UpperTriangular($B), copyto!($C, $A));
@btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
@btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
@btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
@btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
@btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
@btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
@btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
This is on the twoargs branch of TriangularSolve.
Benchmarks on Zen3:
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
21.970 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), UpperTriangular($B));
62.849 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
23.110 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), LowerTriangular($B));
63.309 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
24.620 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A));
67.919 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
25.569 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A));
57.619 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
64.879 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
274.177 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
65.649 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
472.915 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
22.820 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
211.538 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
23.190 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
217.588 ÎĽs (0 allocations: 0 bytes)
vs zen5:
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
7.203 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), UpperTriangular($B));
35.570 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
7.890 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), LowerTriangular($B));
36.080 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
8.090 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A));
32.970 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
8.483 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A));
31.150 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
16.350 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
120.371 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
16.520 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
226.903 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
7.560 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
122.021 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
7.825 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
118.961 ÎĽs (0 allocations: 0 bytes)
TriangularSolve is generally about 3x faster on zen5 for these benchmarks.
Clock speed makes up a big part of that, though (the zen3 system is a server, zen5 desktop).
I’ve also apparently not optimized some of the combinations, which are >2x slower than the others…
But most of the combinations aren’t a priority.
EDIT:
For fun, my cascadelake CPU (an Intel CPU with AVX512 and 2 fma units):
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), UpperTriangular($B), Val(false));
12.870 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), UpperTriangular($B));
13.329 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A), LowerTriangular($B), Val(false));
12.874 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A), LowerTriangular($B));
13.933 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A), Val(false));
14.468 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A));
15.597 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A), Val(false));
15.779 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A));
14.961 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', UpperTriangular($B), Val(false));
21.168 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', UpperTriangular($B));
397.563 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.rdiv!(copyto!($C, $A)', LowerTriangular($B), Val(false));
21.118 ÎĽs (0 allocations: 0 bytes)
julia> @btime rdiv!(copyto!($C, $A)', LowerTriangular($B));
541.666 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(LowerTriangular($B), copyto!($C, $A)', Val(false));
12.926 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(LowerTriangular($B), copyto!($C, $A)');
283.166 ÎĽs (0 allocations: 0 bytes)
julia> @btime TriangularSolve.ldiv!(UpperTriangular($B), copyto!($C, $A)', Val(false));
13.096 ÎĽs (0 allocations: 0 bytes)
julia> @btime ldiv!(UpperTriangular($B), copyto!($C, $A)');
280.993 ÎĽs (0 allocations: 0 bytes)