Hi, I’m bring this thread back to live to get some advice.

I’m trying to avoid the ccall to blasfunc and am getting similar

performance (but a higher allocation burden) using

So, here’s a QR code that does this in two ways, one with BLAS calls

and one way without. When I use do not use BLAS calls I’m getting killed with

allocations on the two lines that update the new column:

```
qk.-=Qkm*rk
qk.-=Qkm*pk
```

Is there something I’m doing wrong here? Is there an obvious way to

reduce the allocation burden?

'preciate it,

– Tim

```
function classical2!(A)
(m,n)=size(A)
precision=typeof(A[1,1])
R=precision.(zeros(n,n))
R[1,1]=norm(A[:,1])
A[:,1]=A[:,1]/R[1,1]
#
# Turn on the BLAS calls
#
doblas=1
#
# Compute the factorization with CGS twice.
#
@views for k=2:n
rk=R[1:k-1,k]
qk=A[:,k]
Qkm=A[:,1:k-1]
pk=zeros(size(rk))
if doblas==0
#
# no BLAS
#
# Orthogonalize
rk.+=Qkm'*qk
qk.-=Qkm*rk
# Orthogonalize again
pk.=Qkm'*qk
qk.-=Qkm*pk
rk.+=pk
else
#
# BLAS
#
# Orthogonalize
BLAS.gemv!('T',1.0,Qkm,qk,1.0,rk)
BLAS.gemv!('N',-1.0,Qkm,rk,1.0,qk)
# Orthogonalize again
BLAS.gemv!('T',1.0,Qkm,qk,0.0,pk)
BLAS.gemv!('N',-1.0,Qkm,pk,1.0,qk)
rk.+=pk
#
end
R[k,k]=norm(qk)
qk./=R[k,k]
end
return QR = (Q=A, R=R)
end
```