@Ralph_Smith indeed – my use cases are all artificial! But that’s because I’m working on a modification to L_1 / L_2 regularized regression, and I’m just generating fake data to test with. Thanks for the pointer to some more robust methods. The coordinate descent approach seems like a nice place to live in terms interpretability (important for novices like me) and accessibility to modifications that make L_1 and L_2 penalties easy to add without making the matrices bigger, but I’ll definitely take a look at those other methods.
How about Julia vs Julia? I was surprised to discover my 0.6-rc is 10x slower than 0.5 official version in at least one use case. It’s compiled without GPL libs, but I can’t find where it’s calling into an optimized library. I believe it’s calling lufact! in lu.jl. Not sure what else is different between the two versions.
versioninfo()
Julia Version 0.5.1
Commit 6445c82 (2017-03-05 13:25 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
a=randn(90,90);
b=randn(90);
@time for i=1:10000; a\b; end
0.864465 seconds (80.00 k allocations: 635.529 MB, 6.92% gc time)
@which lufact(a)
lufact{T<:AbstractFloat}(A::Union{AbstractArray{Complex{T},2},AbstractArray{T,2}}) at linalg\lu.jl:72
versioninfo()
Julia Version 0.6.0-rc1.3
Commit c302d75d95 (2017-05-14 09:56 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
@time for i=1:10000; a\b; end
8.576880 seconds (80.00 k allocations: 635.529 MiB, 0.86% gc time)
@which lufact(a)
lufact(A::Union{AbstractArray{Complex{T},2}, AbstractArray{T,2}}) where T<:AbstractFloat in Base.LinAlg at linalg\lu.jl:81
You should’t benchmark in global scope like that https://docs.julialang.org/en/stable/manual/performance-tips/
julia 0.5:
julia> using BenchmarkTools
julia> a = randn(90,90); b = randn(90);
julia> @btime $a \ $ b;
103.155 μs (8 allocations: 65.08 KiB)
julia 0.6:
julia> using BenchmarkTools
julia> a = randn(90,90); b = randn(90);
julia> @btime $a \ $ b;
100.378 μs (8 allocations: 65.08 KiB)
Thanks for reminding me of the hazards of global scope. I tried it your way, but the relative performance is about the same.
I believe the culprit may be this line.
lpt = LAPACK.getrf!(A)
# v0.6
@code_lowered LAPACK.getrf!(a)
...
$(Expr(:foreigncall, :((Core.tuple)(:dgetrf_64_, Base.LinAlg.LAPACK.liblapack)), Void, svec(Ptr{Int64}, Ptr{Int64}, Ptr{Float64}, Ptr{Int64}, Ptr{Int64}, Ptr{Int64}), :(&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(3)))), SSAValue(3), :(&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(4)))), SSAValue(4), :((Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.Float64), SSAValue(5))), SSAValue(5), :(&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(6)))), SSAValue(6), :((Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(7))), SSAValue(7), :((Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr, Base.LinAlg.LAPACK.BlasInt), SSAValue(8))), SSAValue(8))) # line 538:
# v0.5
@code_lowered LAPACK.getrf!(a)
...
(Core.ccall)((Core.tuple)(:dgetrf_64_,Base.LinAlg.LAPACK.liblapack),Base.LinAlg.LAPACK.Void,(Core.svec)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.Float64),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),(Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt)),&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(3))),SSAValue(3),&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(4))),SSAValue(4),(Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.Float64),SSAValue(5)),SSAValue(5),&((Base.ptr_arg_unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(6))),SSAValue(6),(Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(7)),SSAValue(7),(Base.unsafe_convert)((Core.apply_type)(Base.LinAlg.LAPACK.Ptr,Base.LinAlg.LAPACK.BlasInt),SSAValue(8)),SSAValue(8)) # line 536:
How about for larger matrices?
The performance is 2x better in 0.6 with a 10000x10000 sized matrix. That doesn’t do me any good though, as I need to solve many smaller matrix problems.
How does this compare?
@btime LinAlg.generic_lufact!(copy(a), Val{true}) \ b
Try on nightly or rc2, there was a little issue with ccall lowering that Jeff just fixed.
Doing it this way
@btime LinAlg.generic_lufact!(copy(a), Val{true}) \ b
v0.5 is 3x slower and v0.6 is 2-3x faster; they have about the same run-time. I never got through the larger matrix test; it was taking too long.
I tried rc2 – no improvement.
The fact that large matrices are faster and “small” matrices are slower (90 x 90 is way larger than the ccall overhead could explain for) strongly suggests there is a difference in the build configuration of the lapack library.
For some reason, I didn’t get an email notification of your post. For what it’s worth, here’s my Make.user. There are no other configuration changes. I installed a second version of cygwin just for this build that has only the packages required.
XC_HOST = x86_64-w64-mingw32
USE_GPL_LIBS = 0
MARCH = native
If someone can point me to the relevant file(s), I can poke around and maybe narrow the problem down.