I finally understood your advice and figure out how to do it. This works

function klgemv!(trans, m, n, alpha, A, lda, X, incx, beta, Y, incy)

ccall(( (BLAS.@blasfunc dgemv_ ), Base.libblas_name), Cvoid,

(Ref{UInt8}, Ref{Int64}, Ref{Int64}, Ref{Float64},

Ptr{Float64}, Ref{Int64}, Ptr{Float64}, Ref{Int64},

Ref{Float64}, Ptr{Float64}, Ref{Int64}),

trans, m, n, alpha,

A, lda, X, incx,

beta, Y, incy)

return Y

end

I’m using this to implement the Arnoldi factorization using classical Gram-Schmidt twice, which is both stable and, if you have four or more cores, faster than modified GS. I’ve been testing it for correctness against the qr! function in Julia and qr in Matlab. This is not a good way to compute a QR factorization and one expects to be slower than the QR in lapack. For small problems the cgs way is very competitive with QR, but for large ones Lapack can use blas3 calls, and the speed gets a serious boost. You see this in both Matlab and Julia. What I did not expect was that the Julia versions were so much faster. For an 20000x800 matrix, qr! in Julia was over 20x faster that qr in Matlab. My cgs version was over 8x faster in Julia (once I figured out how to reduce the allocation burden with views).

I stand impressed.