How did you define `OlS_builtin`

?

I wrapped it with:

```
function OLS_builtin{T<:Real}(X::Array{T,2}, y::Array{T,1})
β = X\y
return(β)
end
```

@roryd

Why not use qr factorization offered by BLAS ?

Try Mkl qr factorization, is very fast

Anyway, in my opinion your example can have a huge speed-up using vectorial multiplication instead of using for cycle.

This looks pretty good. I’d take out the `+=`

and use a `muladd`

but that’s about it. If one uses Julia with `-O3`

(which is the default), this should auto-SIMD anyways, so I’d call it close to optimal. Then to finalize it I’d make an in-place version `OLS_cd!`

where the user could pass in the vectors instead of allocating them. But that’s if you need to avoid BLAS for some reason. It’s probably best to do

```
@into ρ = X*r
```

and such (which is using InPlaceOps.jl to make the inplace BLAS call nicer. I always forget the actually name of the function)

Anyways, you might want to check out IterativeSolvers.jl

In general, it’s more useful if you post complete code, including suitable matrices and vectors in this case.

You live a charmed life if your use cases are well-conditioned enough for simple coordinate descent.

I Second @ChrisRackauckas in urging you to check out `lsqr!`

in `IterativeSolvers`

- it’s a nice implementation of a well-tested iterative scheme which is pretty robust. The algorithm has a scholarly web page which happily mentions the Julia version by Gomez and Holy.

Incidently (attn: @Tecnezio), the left-divide function wrapped in `OLS_builtin`

does in fact just call the LAPACK qr code for this case. For big problems the iterative schemes are indeed often faster.

@dpsanders thanks for the reminder – the missing code would be something like

```
n = 10000
p = 1000
A = randn(n,p)
b = randn(n)
```

@Ralph_Smith indeed – my use cases are all artificial! But that’s because I’m working on a modification to L_1 / L_2 regularized regression, and I’m just generating fake data to test with. Thanks for the pointer to some more robust methods. The coordinate descent approach seems like a nice place to live in terms interpretability (important for novices like me) and accessibility to modifications that make L_1 and L_2 penalties easy to add without making the matrices bigger, but I’ll definitely take a look at those other methods.

How about Julia vs Julia? I was surprised to discover my 0.6-rc is 10x slower than 0.5 official version in at least one use case. It’s compiled without GPL libs, but I can’t find where it’s calling into an optimized library. I believe it’s calling lufact! in lu.jl. Not sure what else is different between the two versions.

```
versioninfo()
Julia Version 0.5.1
Commit 6445c82 (2017-03-05 13:25 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
a=randn(90,90);
b=randn(90);
@time for i=1:10000; a\b; end
0.864465 seconds (80.00 k allocations: 635.529 MB, 6.92% gc time)
@which lufact(a)
lufact{T<:AbstractFloat}(A::Union{AbstractArray{Complex{T},2},AbstractArray{T,2}}) at linalg\lu.jl:72
versioninfo()
Julia Version 0.6.0-rc1.3
Commit c302d75d95 (2017-05-14 09:56 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
@time for i=1:10000; a\b; end
8.576880 seconds (80.00 k allocations: 635.529 MiB, 0.86% gc time)
@which lufact(a)
lufact(A::Union{AbstractArray{Complex{T},2}, AbstractArray{T,2}}) where T<:AbstractFloat in Base.LinAlg at linalg\lu.jl:81
```

You should’t benchmark in global scope like that https://docs.julialang.org/en/stable/manual/performance-tips/

julia 0.5:

```
julia> using BenchmarkTools
julia> a = randn(90,90); b = randn(90);
julia> @btime $a \ $ b;
103.155 μs (8 allocations: 65.08 KiB)
```

julia 0.6:

```
julia> using BenchmarkTools
julia> a = randn(90,90); b = randn(90);
julia> @btime $a \ $ b;
100.378 μs (8 allocations: 65.08 KiB)
```

Thanks for reminding me of the hazards of global scope. I tried it your way, but the relative performance is about the same.

I believe the culprit may be this line.

```
lpt = LAPACK.getrf!(A)
# v0.6
@code_lowered LAPACK.getrf!(a)
...
$(Expr(:foreigncall, :((Core.tuple)(:dgetrf_64_, Base.LinAlg.LAPACK.liblapack)), Void, svec(Ptr{Int64}, Ptr{Int64}, Ptr{Float64}, Ptr{Int64}, Ptr{Int64}, Ptr{Int64}), :(&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(3)))), SSAValue(3), :(&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(4)))), SSAValue(4), :((Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.Float64), SSAValue(5))), SSAValue(5), :(&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(6)))), SSAValue(6), :((Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(7))), SSAValue(7), :((Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(8))), SSAValue(8))) # line 538:
# v0.5
@code_lowered LAPACK.getrf!(a)
...
(Core.ccall)((Core.tuple)(:dgetrf_64_,Base.LinAlg.LAPACK.liblapack),Base.LinAlg.LAPACK.Void,(Core.svec)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.Float64),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt)),&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(3))),SSAValue(3),&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(4))),SSAValue(4),(Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.Float64),SSAValue(5)),SSAValue(5),&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(6))),SSAValue(6),(Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(7)),SSAValue(7),(Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(8)),SSAValue(8)) # line 536:
```

How about for larger matrices?

The performance is 2x better in 0.6 with a 10000x10000 sized matrix. That doesn’t do me any good though, as I need to solve many smaller matrix problems.

How does this compare?

```
@btime LinAlg.generic_lufact!(copy(a), Val{true}) \ b
```

Try on nightly or rc2, there was a little issue with ccall lowering that Jeff just fixed.

Doing it this way

```
@btime LinAlg.generic_lufact!(copy(a), Val{true}) \ b
```

v0.5 is 3x slower and v0.6 is 2-3x faster; they have about the same run-time. I never got through the larger matrix test; it was taking too long.

I tried rc2 – no improvement.

The fact that large matrices are faster and “small” matrices are slower (90 x 90 is way larger than the ccall overhead could explain for) strongly suggests there is a difference in the build configuration of the lapack library.

For some reason, I didn’t get an email notification of your post. For what it’s worth, here’s my Make.user. There are no other configuration changes. I installed a second version of cygwin just for this build that has only the packages required.

```
XC_HOST = x86_64-w64-mingw32
USE_GPL_LIBS = 0
MARCH = native
```

If someone can point me to the relevant file(s), I can poke around and maybe narrow the problem down.