Hi, I’m bring this thread back to live to get some advice.
I’m trying to avoid the ccall to blasfunc and am getting similar
performance (but a higher allocation burden) using
So, here’s a QR code that does this in two ways, one with BLAS calls
and one way without. When I use do not use BLAS calls I’m getting killed with
allocations on the two lines that update the new column:
qk.-=Qkm*rk
qk.-=Qkm*pk
Is there something I’m doing wrong here? Is there an obvious way to
reduce the allocation burden?
'preciate it,
– Tim
function classical2!(A)
(m,n)=size(A)
precision=typeof(A[1,1])
R=precision.(zeros(n,n))
R[1,1]=norm(A[:,1])
A[:,1]=A[:,1]/R[1,1]
#
# Turn on the BLAS calls
#
doblas=1
#
# Compute the factorization with CGS twice.
#
@views for k=2:n
rk=R[1:k-1,k]
qk=A[:,k]
Qkm=A[:,1:k-1]
pk=zeros(size(rk))
if doblas==0
#
# no BLAS
#
# Orthogonalize
rk.+=Qkm'*qk
qk.-=Qkm*rk
# Orthogonalize again
pk.=Qkm'*qk
qk.-=Qkm*pk
rk.+=pk
else
#
# BLAS
#
# Orthogonalize
BLAS.gemv!('T',1.0,Qkm,qk,1.0,rk)
BLAS.gemv!('N',-1.0,Qkm,rk,1.0,qk)
# Orthogonalize again
BLAS.gemv!('T',1.0,Qkm,qk,0.0,pk)
BLAS.gemv!('N',-1.0,Qkm,pk,1.0,qk)
rk.+=pk
#
end
R[k,k]=norm(qk)
qk./=R[k,k]
end
return QR = (Q=A, R=R)
end