Julia 1.7.0 - dev, Intel 4 physical cores, I will try with 1.6.3. What I still think is strange is that in my case the fastest code is the one copied from matlab
No big change except for Tullio which allocates less on 1.6.3
1.586 s (20001 allocations: 943.88 KiB)
1.495 s (30001 allocations: 1.07 MiB)
510.505 ms (5 allocations: 6.59 KiB)
228.873 ms (6 allocations: 6.61 KiB)
217.190 ms (104 allocations: 5.67 KiB)
152.197 ms (19 allocations: 197.54 MiB)
166.631 ms (10 allocations: 188.00 MiB)```
Did you reenable @fastmath
? Because I recently had the impression it could pessimize code.
What is going on here? I am getting this:
julia> @btime normalEquationsMatlabJulia($C,$Cx,$x,$A,$B);
3.808 ms (13 allocations: 9.17 MiB)
Iβm copying here the complete Julia section, because I think Iβm going crazy:
% julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.6.3 (2021-09-23)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using LoopVectorization
julia> function normalEquationsMatlabJulia(CC,Cx,A,B,x)
batchSize = 10000;
stencil = ones(1, size(A,2))
# @fastmath @inbounds @views for n = 1:batchSize:size(A,1)
@views for n = 1:batchSize:size(A,1)
idx = min(n+batchSize-1,size(A,1))
@turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
@turbo CC .= CC.+temp'*temp
@turbo Cx .= Cx.+temp'*x[n:idx]
end
return Cx
end
normalEquationsMatlabJulia (generic function with 1 method)
julia> N = 10000;
julia> R = 20;
julia> M = 40;
julia> A = rand(N,M);
julia> B = rand(N,R);
julia> C = rand(M*R,M*R);
julia> x = rand(N,);
julia> Cx = rand(M*R,);
julia> @btime normalEquationsMatlabJulia($C,$Cx,$x,$A,$B);
3.808 ms (13 allocations: 9.17 MiB)
Is the result correct? @goerch had some @assert macros to check if it is correct (it is when I run it). Also, how does the @btime of that function compare to the others on your machine?
Well, the assertions are not correct, because it is just referring to the same array:
normal!(C,Cx,x,A,B_r);
CxNormal = Cx # CxNormal is just the same array as Cx
normalTurbo!(C,Cx,x,A,B);
@assert Cx β CxNormal # this will always be true
it should be CxNormal = copy(Cx)
.
Also, since the functions modify Cx
, each version is not operating on the same values. The original Cx
should be copied from call to call (I am not sure how that affects performance, but it may, and is necessary to assert the values).
Sorry for the mistake. I tried to correct my post above. As far as I understand we have three different results:
-
normal!
,normalTurbo!
andnormalTturbo!
computing the same
3.404 s (20004 allocations: 5.81 MiB)
1.184 s (20004 allocations: 5.81 MiB)
1.179 s (20004 allocations: 5.81 MiB)
-
normalEquations!
,normalEquationsP!
,normalEquationsMatlab
andnormalEquationsMatlabJulia
computing the same
385.193 ms (8 allocations: 4.90 MiB)
479.380 ms (8 allocations: 4.90 MiB)
182.132 ms (19 allocations: 197.54 MiB)
209.968 ms (13 allocations: 192.88 MiB)
normalEquationsTullio
99.472 ms (156 allocations: 4.90 MiB)
Sorry, I didnβt notice that before I replied. FWIW, I think that annotations far from their effect make such misinterpretations likely.
The temp
here is transposed in some codes and not in others. That probably explains most of the differences in the results.
Just added more validations: the problem seems to be more systematic, it also occurs with C
I started a thread about the inards of kron
here.
Thank you all for the inputs, apologies again for the sloppy code. @goerch @lmiq I fixed the source of different results. The results so far were identical under permutation due to calling kron(A,B)
instead of kron(B,A)
. Also, the Matlab copy was missing broadcasting, which is actually needed. Below is the thus modified code from @goerch with @fastmath enabled (it seems to speed up the code for me):
using LoopVectorization
using Tullio
using BenchmarkTools
function normal!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
@fastmath @views @inbounds for n = 1:size(A,1)
# @views for n = 1:size(A,1)
kron!(temp,B[n,:],A[n,:])
C .= C.+ temp.*temp'
Cx .= Cx .+ temp.*x[n]
end
return Cx
end
function normalTurbo!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
@fastmath @views @inbounds for n = 1:size(A,1)
# @views for n = 1:size(A,1)
kron!(temp,B[n,:],A[n,:])
@turbo C .= C.+ temp.*temp'
@turbo Cx .= Cx .+ temp.*x[n]
end
return Cx
end
function normalTturbo!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
@fastmath @views @inbounds for n = 1:size(A,1)
# @views for n = 1:size(A,1)
kron!(temp,B[n,:],A[n,:])
@tturbo C .= C.+ temp.*temp'
@tturbo Cx .= Cx .+ temp.*x[n]
end
return Cx
end
# swapped arguments
function normalEquations!(C,Cx,x,A,B)
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
C4x = reshape(Cx, size(B,2),size(A,2))
@fastmath @inbounds @turbo for m = 1:size(B,2)
# @inbounds @turbo for m = 1:size(B,2)
for r = 1:size(A,2)
for mm = 1:size(B,2)
for rr = 1:size(A,2)
for n = 1:size(B,1)
temp = A[n,r] * B[n,m]
C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
C4x[m,r] = C4x[m,r] + temp*x[n]
end
end
end
end
end
return Cx
end
# swapped arguments
function normalEquationsP!(C,Cx,x,A,B)
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
C4x = reshape(Cx, size(B,2),size(A,2))
@fastmath @inbounds @tturbo for m = 1:size(B,2)
# @inbounds @tturbo for m = 1:size(B,2)
for r = 1:size(A,2)
for mm = 1:size(B,2)
for rr = 1:size(A,2)
for n = 1:size(B,1)
temp = A[n,r] * B[n,m]
C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
C4x[m,r] = C4x[m,r] + temp*x[n]
end
end
end
end
end
return Cx
end
# swapped arguments
function normalEquationsTullio!(C,Cx,x,A,B)
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
Cx4 = reshape(Cx, size(B,2),size(A,2))
@tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
@tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
return Cx
end
function normalEquationsMatlab(CC,Cx,A,B,x)
batchSize = 10000;
@inbounds @fastmath for n = 1:batchSize:size(A,1)
idx = min(n+batchSize-1,size(A,1));
temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:], ones(1, size(A,2)));
CC .= CC.+temp'*temp;
Cx .= Cx.+temp'*x[n:idx];
end
return Cx
end
function normalEquationsMatlabJulia(CC,Cx,A,B,x)
batchSize = 10000;
stencil = ones(1, size(A,2))
@fastmath @inbounds @views for n = 1:batchSize:size(A,1)
# @inbounds @views for n = 1:batchSize:size(A,1)
idx = min(n+batchSize-1,size(A,1))
@turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
@turbo CC .= CC.+temp'*temp
@turbo Cx .= Cx.+temp'*x[n:idx]
end
return Cx
end
N = 10000;
R = 20;
M = 40;
A = rand(N,M);
B = rand(N,R);
x = rand(N,);
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
A_r = PermutedDimsArray(permutedims(A), (2, 1))
B_r = PermutedDimsArray(permutedims(B), (2, 1))
normal!(C,Cx,x,A,B_r);
CxNormal = copy(Cx)
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalTurbo!(C,Cx,x,A,B);
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalTturbo!(C,Cx,x,A,B);
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquations!(C,Cx,x,B_r,A);
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsP!(C,Cx,x,B,A);
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsTullio!(C,Cx,x,B,A);
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsMatlab(C,Cx,A,B,x);
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsMatlabJulia(C,Cx,A,B,x);
@assert Cx β CxNormal
@btime normal!($C,$Cx,$x,$A,$B_r);
@btime normalTurbo!($C,$Cx,$x,$A,$B);
@btime normalTturbo!($C,$Cx,$x,$A,$B);
@btime normalEquations!($C,$Cx,$x,$B_r,$A);
@btime normalEquationsP!($C,$Cx,$x,$B_r,$A);
@btime normalEquationsTullio!($C,$Cx,$x,$B_r,$A);
@btime normalEquationsMatlab($C,$Cx,$A,$B,$x);
@btime normalEquationsMatlabJulia($C,$Cx,$A,$B,$x);
Now all the asserts pass. In my case now the timings are the following:
6.306 s (20001 allocations: 943.88 KiB) #Normal
1.114 s (20001 allocations: 943.88 KiB) #NormalTurbo
1.039 s (30001 allocations: 1.07 MiB) #NormalTturbo
382.551 ms (4 allocations: 224 bytes) #Loop
146.457 ms (5 allocations: 240 bytes) #ParallelLoop
176.421 ms (104 allocations: 5.67 KiB) #Tullio
115.667 ms (16 allocations: 192.65 MiB) #Matlab
139.217 ms (10 allocations: 188.00 MiB) #MatlabJulia
Now the performance is already way better (\approx 1.5 times for me) thanks to the indexing (thanks @Jeff_Emanuel , @goerch , @baggepinnen ), however it is still slower than the Matlab version in Julia (normalEquationsMatlab) and the actual Matlab implementation (identical to normalEquationsMatlab except using repmat instead of repeat) which takes 0.0869 seconds on my computer.
Hi @germinator,
thanks for the updated MWE. We still have to discuss the difference in our results. These are my latest numbers for your consolidated MWE:
5.525 s (20001 allocations: 943.88 KiB)
1.204 s (20001 allocations: 943.88 KiB)
898.374 ms (30001 allocations: 1.07 MiB)
483.086 ms (4 allocations: 224 bytes)
105.213 ms (5 allocations: 240 bytes)
97.093 ms (153 allocations: 8.70 KiB)
185.628 ms (16 allocations: 192.65 MiB)
207.268 ms (10 allocations: 188.00 MiB)
Using the naive
function _kron!(C::T1, A::T2, B::T3) where {T1 <: AbstractVector, T2 <: AbstractVector, T3 <: AbstractVector}
for i = 1:size(A, 1)
@views mul!(C[(i-1)*size(B,1)+1:i*size(B,1)], A[i], B)
end
C
end
where applicable I see
3.458 s (1 allocation: 6.38 KiB)
1.160 s (1 allocation: 6.38 KiB)
911.221 ms (10001 allocations: 162.62 KiB)
480.688 ms (4 allocations: 224 bytes)
106.741 ms (5 allocations: 240 bytes)
102.574 ms (153 allocations: 8.70 KiB)
189.446 ms (16 allocations: 192.65 MiB)
218.533 ms (10 allocations: 188.00 MiB)
Hi @goerch , thanks for your updates. I have the same versions of Tullio
and LoopVectorization
.
(@v1.6) pkg> status Tullio
Status `C:\Users\LocalAdmin\.julia\environments\v1.6\Project.toml`
[bc48ee85] Tullio v0.3.2
(@v1.6) pkg> status LoopVectorization
Status `C:\Users\LocalAdmin\.julia\environments\v1.6\Project.toml`
[bdcacae8] LoopVectorization v0.12.98
I probably have dynamic fluency scaling on as well, although I have in power settings minimum and maximum processor states at 100%.
Now when it comes to the results: could it be simply due to the fact that you are running Julia on 6 threads? Hence you perform better when using @tturbo and thus also @tullio. This would explain the similar scaling when using kron!
and in the Matlab functions, and the faster scaling when using @tturbo and @tullio. One thing that makes me wonder however is why in your result @tullio is faster than loops with @tturbo, as @tullio is called twice in the function and many of the computations done for C
and Cx
are the same and computed twice. Would this mean that it is still possible to further improve the performance of the looped version?
Also, do you happen to have a Matlab license? It would be interesting to see how that would work on your machine
As an experiment I ran the benchmark with Julia and BLAS threads set to one (via BLAS.set_num_threads(1)
) and see
3.196 s (1 allocation: 6.38 KiB)
1.091 s (1 allocation: 6.38 KiB)
1.083 s (1 allocation: 6.38 KiB)
431.337 ms (4 allocations: 224 bytes)
389.771 ms (4 allocations: 224 bytes)
394.443 ms (4 allocations: 224 bytes)
555.561 ms (10 allocations: 188.00 MiB)
557.053 ms (10 allocations: 188.00 MiB)
Iβm curious to hear what you make of this. Interesting stuff could happen if someone runs this on a real workstationβ¦
Sorry, no.
I donβt know which of the variants is an interesting candidate for improvement? Both normalEquationsTullio!
and normalEquationsMatlab
look like clever solutions of the problem.
Iβm sorry, I am not following all the details of the thread, but have tried using MKL
instead of BLAS
in the Julia implementations? That could explain the difference relative to Matlab if the performance is dependent on a matrix multiplication only.
Iβd like to propose to add something like
function kr!(C, A, B)
@fastmath @inbounds for n = 1:size(A,1)
for i = 1:size(B,2)
for j = 1:size(A,2)
C[(i-1)*size(A,2)+j, n] = A[n,j] * B[n,i]
end
end
end
end
function _normal!(C,Cx,x,A,B)
temp = Matrix{Float64}(undef,size(A,2)*size(B,2),size(A,1))
kr!(temp, A, B)
mul!(C, temp, temp')
mul!(Cx, temp, x)
return Cx
end
to the comparison.
Here we go again. My latest MWE, switching to MKL,
using LinearAlgebra
using MKL
using LoopVectorization
using Tullio
using BenchmarkTools
# BLAS.set_num_threads(1)
function _kron(A::T1, B::T2) where {T1 <: AbstractVector, T2 <: AbstractVector}
C = zeros(size(A,1) * size(B,1))
@fastmath @views @inbounds for i = 1:size(A, 1)
C[(i-1)*size(B,1)+1:i*size(B,1)] .= A[i] * B
end
C
end
function _kron!(C::T1, A::T2, B::T3) where {T1 <: AbstractVector, T2 <: AbstractVector, T3 <: AbstractVector}
@fastmath @views @inbounds for i = 1:size(A, 1)
mul!(C[(i-1)*size(B,1)+1:i*size(B,1)], A[i], B)
end
C
end
function normal!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
@fastmath @views @inbounds for n = 1:size(A,1)
# @views for n = 1:size(A,1)
_kron!(temp,B[n,:],A[n,:])
C .+= temp.*temp'
Cx .+= temp.*x[n]
end
return Cx
end
function kr!(C, A, B)
@fastmath @inbounds for n = 1:size(A,1)
# @views C[1:size(A,2)*size(B,2), n:n] .= reshape(A[n,:] * B[n,:]', size(A,2) * size(B,2))
# @views for i = 1:size(B, 2)
# C[(i-1)*size(A,2)+1:i*size(A,2), n:n] .= A[n,:] .* B[n,i]
# end
for i = 1:size(B,2)
for j = 1:size(A,2)
C[(i-1)*size(A,2)+j, n] = A[n,j] * B[n,i]
end
end
end
end
function _normal!(C,Cx,x,A,B)
temp = Matrix{Float64}(undef,size(A,2)*size(B,2),size(A,1))
kr!(temp, A, B)
mul!(C, temp, temp')
mul!(Cx, temp, x)
return Cx
end
function normalTurbo!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
@fastmath @views @inbounds for n = 1:size(A,1)
# @views for n = 1:size(A,1)
_kron!(temp,B[n,:],A[n,:])
@turbo C .+= temp.*temp'
@turbo Cx .+= temp.*x[n]
end
return Cx
end
function normalTturbo!(C,Cx,x,A,B)
temp = Vector{Float64}(undef,size(A,2)*size(B,2))
@fastmath @views @inbounds for n = 1:size(A,1)
# @views for n = 1:size(A,1)
_kron!(temp,B[n,:],A[n,:])
@tturbo C .+= temp.*temp'
@tturbo Cx .+= temp.*x[n]
end
return Cx
end
# swapped arguments
function normalEquations!(C,Cx,x,A,B)
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
C4x = reshape(Cx, size(B,2),size(A,2))
@fastmath @inbounds @turbo for m = 1:size(B,2)
# @inbounds @turbo for m = 1:size(B,2)
for r = 1:size(A,2)
for mm = 1:size(B,2)
for rr = 1:size(A,2)
for n = 1:size(B,1)
temp = A[n,r] * B[n,m]
C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
C4x[m,r] = C4x[m,r] + temp*x[n]
end
end
end
end
end
return Cx
end
# swapped arguments
function normalEquationsP!(C,Cx,x,A,B)
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
C4x = reshape(Cx, size(B,2),size(A,2))
@fastmath @inbounds @tturbo for m = 1:size(B,2)
# @inbounds @tturbo for m = 1:size(B,2)
for r = 1:size(A,2)
for mm = 1:size(B,2)
for rr = 1:size(A,2)
for n = 1:size(B,1)
temp = A[n,r] * B[n,m]
C4[m,r,mm,rr] = C4[m,r,mm,rr] + temp * A[n,rr] * B[n,mm]
C4x[m,r] = C4x[m,r] + temp*x[n]
end
end
end
end
end
return Cx
end
# swapped arguments
function normalEquationsTullio!(C,Cx,x,A,B)
C4 = reshape(C, size(B,2),size(A,2),size(B,2),size(A,2)) # writing into original C
Cx4 = reshape(Cx, size(B,2),size(A,2))
@tullio C4[j,i,l,k] = A[n,i] * B[n,j] * A[n,k] * B[n,l]
@tullio Cx4[j,i] = A[n,i] * B[n,j] * x[n]
return Cx
end
function normalEquationsMatlab(CC,Cx,A,B,x)
batchSize = 10000;
stencil = ones(1, size(A,2))
@inbounds @fastmath @views for n = 1:batchSize:size(A,1)
idx = min(n+batchSize-1,size(A,1));
temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:], stencil)
CC .+= temp'*temp;
Cx .+= temp'*x[n:idx];
end
return Cx
end
function normalEquationsMatlabJulia(CC,Cx,A,B,x)
batchSize = 10000;
stencil = ones(1, size(A,2))
@fastmath @inbounds @views for n = 1:batchSize:size(A,1)
# @inbounds @views for n = 1:batchSize:size(A,1)
idx = min(n+batchSize-1,size(A,1))
@turbo temp = repeat(A[n:idx,:],1,size(B,2)).*kron(B[n:idx,:],stencil )
@turbo CC .+= temp'*temp
@turbo Cx .+= temp'*x[n:idx]
end
return Cx
end
N = 10000;
R = 20;
M = 40;
A = rand(N,M);
B = rand(N,R);
x = rand(N,);
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
A_r = PermutedDimsArray(permutedims(A), (2, 1))
B_r = PermutedDimsArray(permutedims(B), (2, 1))
normal!(C,Cx,x,A,B_r);
CNormal = copy(C)
CxNormal = copy(Cx)
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
_normal!(C,Cx,x,A,B_r);
@assert C β CNormal
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalTurbo!(C,Cx,x,A,B);
@assert C β CNormal
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalTturbo!(C,Cx,x,A,B);
@assert C β CNormal
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquations!(C,Cx,x,B_r,A);
@assert C β CNormal
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsP!(C,Cx,x,B,A);
@assert C β CNormal
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsTullio!(C,Cx,x,B,A);
@assert C β CNormal
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsMatlab(C,Cx,A,B,x);
@assert C β CNormal
@assert Cx β CxNormal
C = zeros(M*R,M*R);
Cx = zeros(M*R,);
normalEquationsMatlabJulia(C,Cx,A,B,x);
@assert C β CNormal
@assert Cx β CxNormal
@btime normal!($C,$Cx,$x,$A,$B_r);
@btime _normal!($C,$Cx,$x,$A,$B);
@btime normalTurbo!($C,$Cx,$x,$A,$B);
@btime normalTturbo!($C,$Cx,$x,$A,$B);
@btime normalEquations!($C,$Cx,$x,$B_r,$A);
@btime normalEquationsP!($C,$Cx,$x,$B_r,$A);
@btime normalEquationsTullio!($C,$Cx,$x,$B_r,$A);
@btime normalEquationsMatlab($C,$Cx,$A,$B,$x);
@btime normalEquationsMatlabJulia($C,$Cx,$A,$B,$x);
resulting in
3.705 s (1 allocation: 6.38 KiB)
41.172 ms (2 allocations: 61.04 MiB)
1.247 s (1 allocation: 6.38 KiB)
935.492 ms (10001 allocations: 162.62 KiB)
406.754 ms (4 allocations: 224 bytes)
112.410 ms (5 allocations: 240 bytes)
105.382 ms (152 allocations: 8.67 KiB)
118.262 ms (10 allocations: 188.00 MiB)
124.236 ms (10 allocations: 188.00 MiB)
So what is the difference when compared to Matlab now?
Here are my timings, since Iβm procrastinating.
2.319 s (1 allocation: 6.38 KiB)
45.789 ms (2 allocations: 61.04 MiB)
626.227 ms (1 allocation: 6.38 KiB)
741.258 ms (10001 allocations: 162.62 KiB)
309.291 ms (4 allocations: 224 bytes)
38.514 ms (5 allocations: 240 bytes)
37.346 ms (860 allocations: 48.42 KiB)
87.815 ms (10 allocations: 188.00 MiB)
100.636 ms (10 allocations: 188.00 MiB)
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, generic)
That was 32 CPU threads, 8 BLAS threads, and without MKL.