Julia 1.3, 1.4 on MacOS and Intel MKL Error

ablaom · March 24, 2020, 10:44pm

Sorry in advance if this is the wrong place to post.

I am looking for any help narrowing down the source of a MacOS only / Julia 1.3, 1.4 only failure for the MLJModels.jl package: “Intel MKL Error”. Copied from the issue raised there:

I have introduce CI on this branch for julia 1.3 and julia 1.4, where testing in now failing for MacOS.

The error is triggered by testing of the wrapped scikit-learn (python) clustering models. According to the travis logs, the conda installations for scikit-learn are the same for linux and macOS, excect that macOS has an additional package llvm-openmp-4.0.1 installed.

Any help at all on this one would be appreciated. In particular, should I regard this as julia 1.3/1.4 error?

Here is the tail of the stack trace

Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
540Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
541
542Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
543
544Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
545
546Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
547
548
549Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
550Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
551
552signal (11): Segmentation fault: 11
553in expression starting at /Users/travis/build/alan-turing-institute/MLJModels.jl/test/ScikitLearn/clustering.jl:139
554thread_team_ctxt_commit_callback at /Users/travis/.julia/conda/3/lib/libmkl_intel_thread.dylib (unknown line)
555mkl_lapack_thread_team_ctxt_commit_task at /Users/travis/.julia/conda/3/lib/libmkl_core.dylib (unknown line)
556mkl_lapack_dgetrf at /Users/travis/.julia/conda/3/lib/libmkl_intel_thread.dylib (unknown line)
557Allocations: 388377569 (Pool: 388284978; Big: 92591); GC: 274
558ERROR: Package MLJModels errored during testing
559Stacktrace:
560 [1] pkgerror(::String, ::Vararg{String,N} where N) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/Types.jl:113
561 [2] #test#131(::Bool, ::Nothing, ::Cmd, ::Cmd, ::typeof(Pkg.Operations.test), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/Operations.jl:1372
562 [3] #test at ./none:0 [inlined]
563 [4] #test#62(::Bool, ::Nothing, ::Cmd, ::Cmd, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Pkg.API.test), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:253
564 [5] (::Pkg.API.var"#kw##test")(::NamedTuple{(:coverage,),Tuple{Bool}}, ::typeof(Pkg.API.test), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at ./none:0
565 [6] #test#58 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:233 [inlined]
566 [7] (::Pkg.API.var"#kw##test")(::NamedTuple{(:coverage,),Tuple{Bool}}, ::typeof(Pkg.API.test)) at ./none:0
567 [8] top-level scope at none:1

Albert_Zevelev · March 24, 2020, 10:59pm

@cstjean have you seen this issue on macs for Julia 1.3/1.4?

cstjean · March 24, 2020, 11:10pm

Not so far, but I am unfortunately not using ScikitLearn in my day-to-day work.

tlienart · March 25, 2020, 8:24am

Looking around it seems MKL throws such issues upon seeing corrupted values (inf, nans…)

Anyway I tried running the tests locally for MLJModels 0.9.1 (latest release) on my machine (a mac without MKL) and they fail as well with:

SpectralClustering: Error During Test at /Users/tlienart/.julia/packages/MLJModels/8gw1p/test/ScikitLearn/clustering.jl:139
  Got exception outside of a @test
  PyError ($(Expr(:escape, :(ccall(#= /Users/tlienart/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:43 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'ValueError'>
  ValueError('array must not contain infs or NaNs')

Later on

BayesianRidge: Test Failed at /Users/tlienart/.julia/packages/MLJModels/8gw1p/test/ScikitLearn/linear-regressors.jl:29
  Expression: isapprox(norm(predict(m, f, X) .- y) / norm(y), 0.0326918, rtol = 1.0e-5)
   Evaluated: isapprox(15.425555763884265, 0.0326918; rtol = 1.0e-5)
Stacktrace:
 [1] top-level scope at /Users/tlienart/.julia/packages/MLJModels/8gw1p/test/ScikitLearn/linear-regressors.jl:29
 [2] top-level scope at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1114
 [3] top-level scope at /Users/tlienart/.julia/packages/MLJModels/8gw1p/test/ScikitLearn/linear-regressors.jl:26

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

~~I don’t think that’s an issue on the Julia’s side.~~

Edit: ok actually I don’t know, I also tested locally MLJModels 0.8.4 (previous minor release) and it also fails on Julia 1.4 and nightly with similar errors.

Edit2: I’ll try with a compat bound on MKL_jll → well I managed to try with the most recent 2020 one but that also failed.

wolfgang · March 25, 2020, 11:31am

From

one can assume, that between your tested releases LU factorisation somehow has a problem. The above mentioned discourse article from @Tecnezio references into the Intel documentation:

https://software.intel.com/en-us/node/520892

tlienart · April 1, 2020, 7:51am

Thanks @wolfgang, what’s a bit frustrating is that I don’t really see why we’re observing this while you can call sklearn.jl’s models separately without issues as far as I can tell; and to the best of my knowledge we don’t touch or set any linalg setting MKL or otherwise.

The same code runs in 1.2 and doesn’t run in 1.3 / 1.4 / nightly; which shouldn’t be the case. Worst than that, it used to work with 1.3 and 1.4 up to until a point maybe 2-3 weeks ago. I tried using an earlier of our release to see if it was something we introduced but it also failed.

cc @tkelman with apologies for the ping

tlienart · April 3, 2020, 11:16am

Ok after much hair pulling I found the issue by looking at the last working commit and the first failing commit and what env was loaded. Anyway long story short the issue is with OpenSpecFun_jll v0.5.3+2 (now +3 but doesn’t matter) which introduces a CompilerSupportLibraries_jll; I don’t know what is doing what but basically reverting to OpenSpecFun_jll v0.5.3+1 fixes the issue mentioned here. See also this issue for more details.

cc @Albert_Zevelev and @staticfloat

Some notes:

I don’t know how to specify a compat version for a jll file (there’s a + in there) would someone know?
What package(s) add(s) this?

I’m surprised we’re the only ones who’ve had issues with this but it should be looked into asap.

kristoffer.carlsson · April 3, 2020, 5:27pm

Seems similar to https://github.com/JuliaPackaging/BinaryBuilder.jl/issues/700 where that package also breaks MKL in some cases.

Albert_Zevelev · April 15, 2020, 10:26pm

@tlienart
Hi Thibaut, has there been any luck incorporating this?

ablaom · April 16, 2020, 4:04am

My understanding is that a fix will not be available until https://github.com/JuliaPackaging/BinaryBuilder.jl/issues/700 is resolved

Albert_Zevelev · April 16, 2020, 4:29am

Got it. A couple of notes I’ll leave here.

Are we sure it’s an Intel MKL Error? The message I get says PyError

using MLJ
X, y =  @load_boston;
train, test = partition(eachindex(y), .7, rng=333);


@load ARDRegressor
mdl  = ARDRegressor()
mach = machine(mdl, X, y)

fit!(mach, rows=train) 

PyError ($(Expr(:escape, :(ccall(#= /Users/AZevelev/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:43 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'numpy.linalg.LinAlgError'>
LinAlgError('unrecoverable internal error.')
  File "/Users/AZevelev/.julia/conda/3/lib/python3.7/site-packages/sklearn/linear_model/_bayes.py", line 577, in fit
    sigma_ = update_sigma(X, alpha_, lambda_, keep_lambda, n_samples)
  File "/Users/AZevelev/.julia/conda/3/lib/python3.7/site-packages/sklearn/linear_model/_bayes.py", line 562, in update_sigma
    X[:, keep_lambda].T))
  File "/Users/AZevelev/.julia/conda/3/lib/python3.7/site-packages/sklearn/externals/_scipy_linalg.py", line 99, in pinvh
    s, u = decomp.eigh(a, lower=lower, check_finite=False)
  File "/Users/AZevelev/.julia/conda/3/lib/python3.7/site-packages/scipy/linalg/decomp.py", line 474, in eigh
    raise LinAlgError("unrecoverable internal error.")

pyerr_check at exception.jl:60 [inlined]
pyerr_check at exception.jl:64 [inlined]
_handle_error(::String) at exception.jl:81
macro expansion at exception.jl:95 [inlined]
#110 at pyfncall.jl:43 [inlined]
disable_sigint at c.jl:446 [inlined]
__pycall! at pyfncall.jl:42 [inlined]
_pycall!(::PyCall.PyObject, ::PyCall.PyObject, ::Tuple{Array{Float64,2},Array{Float64,1}}, ::Int64, ::Ptr{Nothing}) at pyfncall.jl:29
_pycall!(::PyCall.PyObject, ::PyCall.PyObject, ::Tuple{Array{Float64,2},Array{Float64,1}}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at pyfncall.jl:11
(::PyCall.PyObject)(::Array{Float64,2}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at pyfncall.jl:86
(::PyCall.PyObject)(::Array{Float64,2}, ::Vararg{Any,N} where N) at pyfncall.jl:86
fit!(::PyCall.PyObject, ::Array{Float64,2}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at Skcore.jl:100
fit!(::PyCall.PyObject, ::Array{Float64,2}, ::Array{Float64,1}) at Skcore.jl:100
fit(::ARDRegressor, ::Int64, ::NamedTuple{(:Crim, :Zn, :Indus, :NOx, :Rm, :Age, :Dis, :Rad, :Tax, :PTRatio, :Black, :LStat),NTuple{12,Array{Float64,1}}}, ::Array{Float64,1}) at ScikitLearn.jl:157
fit!(::Machine{ARDRegressor}; rows::Array{Int64,1}, verbosity::Int64, force::Bool) at machines.jl:183
(::StatsBase.var"#fit!##kw")(::NamedTuple{(:rows,),Tuple{Array{Int64,1}}}, ::typeof(fit!), ::Machine{ARDRegressor}) at machines.jl:146
top-level scope at untitled-69d5ad375bc84d57ff345abae9b40d8f:15

in the issue raised by @kristoffer.carlsson, @staticfloat says using MKL_jll is a work around.
I don’t think it works for me.
Does using MKL_jll work for anyone else?

tlienart · April 16, 2020, 5:46am

I’m sure that downgrading the version of openspecfun works. As I said earlier I don’t know how to set a compat bound for _jll file given that the version number is not standard (it ends with `+k). I’ve not tried the workaround suggested.

Imo the more robust workaround for now while we wait for the core issue to be fixed is to have people install the old version of openspecfun provided there’s a simple way that doesn’t involve having to clone the old image of the repo etc.

In short: does anyone know how to set a compat bound for a jll file even if it’s not part of the Project.toml, something that’s easily reproducible.

tlienart · April 16, 2020, 5:35pm

@Albert_Zevelev would you mind trying the following?

using Pkg
Pkg.develop(PackageSpec(url="https://github.com/tlienart/OpenSpecFun_jll.jl"))

Then restart a session and try testing MLJModels?

Albert_Zevelev · April 16, 2020, 5:57pm

@tlienart no luck. Not sure if I did it correctly:

using Pkg
Pkg.develop(PackageSpec(url="https://github.com/tlienart/OpenSpecFun_jll.jl"))

using MLJ
X, y =  @load_boston;
train, test = partition(eachindex(y), .7, rng=333);

@load LGBMRegressor
mdl  = LGBMRegressor()
mach = machine(mdl, X, y)
fit!(mach, rows=train)


@load ARDRegressor
mdl  = ARDRegressor()
mach = machine(mdl, X, y)
fit!(mach, rows=train)

PyError ($(Expr(:escape, :(ccall(#= /Users/AZevelev/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:43 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'numpy.linalg.LinAlgError'>
LinAlgError('unrecoverable internal error.')
  File "/Users/AZevelev/.julia/conda/3/lib/python3.7/site-packages/sklearn/linear_model/_bayes.py", line 577, in fit
    sigma_ = update_sigma(X, alpha_, lambda_, keep_lambda, n_samples)
  File "/Users/AZevelev/.julia/conda/3/lib/python3.7/site-packages/sklearn/linear_model/_bayes.py", line 562, in update_sigma
    X[:, keep_lambda].T))
  File "/Users/AZevelev/.julia/conda/3/lib/python3.7/site-packages/sklearn/externals/_scipy_linalg.py", line 99, in pinvh
    s, u = decomp.eigh(a, lower=lower, check_finite=False)
  File "/Users/AZevelev/.julia/conda/3/lib/python3.7/site-packages/scipy/linalg/decomp.py", line 474, in eigh
    raise LinAlgError("unrecoverable internal error.")

pyerr_check at exception.jl:60 [inlined]
pyerr_check at exception.jl:64 [inlined]
_handle_error(::String) at exception.jl:81
macro expansion at exception.jl:95 [inlined]
#110 at pyfncall.jl:43 [inlined]
disable_sigint at c.jl:446 [inlined]
__pycall! at pyfncall.jl:42 [inlined]
_pycall!(::PyCall.PyObject, ::PyCall.PyObject, ::Tuple{Array{Float64,2},Array{Float64,1}}, ::Int64, ::Ptr{Nothing}) at pyfncall.jl:29
_pycall!(::PyCall.PyObject, ::PyCall.PyObject, ::Tuple{Array{Float64,2},Array{Float64,1}}, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at pyfncall.jl:11
(::PyCall.PyObject)(::Array{Float64,2}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at pyfncall.jl:86
(::PyCall.PyObject)(::Array{Float64,2}, ::Vararg{Any,N} where N) at pyfncall.jl:86
fit!(::PyCall.PyObject, ::Array{Float64,2}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at Skcore.jl:100
fit!(::PyCall.PyObject, ::Array{Float64,2}, ::Array{Float64,1}) at Skcore.jl:100
fit(::ARDRegressor, ::Int64, ::NamedTuple{(:Crim, :Zn, :Indus, :NOx, :Rm, :Age, :Dis, :Rad, :Tax, :PTRatio, :Black, :LStat),NTuple{12,Array{Float64,1}}}, ::Array{Float64,1}) at ScikitLearn.jl:157
fit!(::Machine{ARDRegressor}; rows::Array{Int64,1}, verbosity::Int64, force::Bool) at machines.jl:183
(::StatsBase.var"#fit!##kw")(::NamedTuple{(:rows,),Tuple{Array{Int64,1}}}, ::typeof(fit!), ::Machine{ARDRegressor}) at machines.jl:146
top-level scope at MLJ_FitMachine_Error.jl:18

tlienart · April 16, 2020, 6:19pm

Ok (I think) that error is unrelated to the current problem and actually may be due to just not using the ARDRegression properly (can you open an issue separately so that we can look into this?)

For instance

@load TheilSenRegressor
mdl = TheilSenRegressor()
mach = machine(mdl, X, y)
fit!(mach, rows=train)

should work

Would you mind trying your script where you go over all models and see if any of them still throws an MKL error? thanks!!

PS: does the same stuff run with plain ScikitLearn.jl ?

thanks!

Edit: it’s interesting, it does work with ScikitLearn.jl but very badly, if you try to fit on the full data then predict on the data you get an answer in the 1e8 - 1e9 range when the ground truth is around 20… so I would think that the problem may be with the ARDRegressor instead of MLJ, potentially we might not pass the right defaults for HP but either way it’s probably not a very good model here.

Albert_Zevelev · April 16, 2020, 7:00pm

In my original error post (Feb 18) the following models did not work:

        [ "ARDRegressor", "BayesianRidgeRegressor", "ElasticNetCVRegressor",
        "GaussianProcessRegressor", "LarsCVRegressor", "LarsRegressor",
        "LassoCVRegressor", "LassoLarsCVRegressor", "LassoLarsICRegressor",
        "LassoLarsRegressor", "OrthogonalMatchingPursuitCVRegressor",
        "OrthogonalMatchingPursuitRegressor" ]

I reran all regressions (continuous Y). The following don’t work

[ "ARDRegressor",  #PyError
"RANSACRegressor", #PyError
"BayesianRidgeRegressor",#Takes Forever
"RidgeCVRegressor" #worked 5min ago. now Takes Forever. 
]

I can live w/o these. But I’m concerned about the stability of MLJ & the overall ecosystem.
ScikitLearn.jl always worked

tlienart · April 16, 2020, 7:04pm

ok great that’s an improvement at least

I think those last 3 models potentially need another analysis. I’m not surprised they’re crap though but I should double check the default hyperparameters that they get.

Albert_Zevelev · April 16, 2020, 7:15pm

@tlienart
Can you check if you get the same on your machine?

#load packages
using MLJ, RDatasets, TableView, DataFrames
################################################################################
#OHE/AZ(X)/load_m/train_m
#OHE
function one_hot_encode(d::DataFrame)
    encoded = DataFrame()
    for col in names(d), val in unique(d[!, col])
        lab = string(col) * "_" * string(val)
        encoded[!, Symbol(lab) ] = ifelse.(d[!, col] .== val, 1, 0)
    end
    return encoded
end
#AZ: convert Strings & Count to OHE.
function AZ(X)
    sch = schema(X);
    #ty = [CategoricalString{UInt8}, CategoricalString{UInt32}, CategoricalValue{Int64,UInt32}]
    tn = [Int, Float16, Float32, Float64]
    vs = [];
    for (name, type) in zip(sch.names, sch.types)
        if type ∉ tn  #∈ ty #∉ [Int32, Int64, Float64]
            #println(:($name) , "  ", type)
            push!(vs, :($name) )
            #global X = coerce(X, :($name) =>Continuous);
        end
    end
    #
    Xd= DataFrame(X);
    X_ohe = one_hot_encode( Xd[:, vs]  )
    Xd = hcat( X_ohe, select(Xd, Not( vs )) )
    Xd = coerce(Xd, autotype(Xd, :discrete_to_continuous))
    #sch= schema(Xd);
    #@show sch.scitypes;
    #
    X=Xd
    return X
end
#Load & make model list.
@inline function load_m(model_list)
    model_names = Vector{String}(undef, length(model_list))
    @inbounds for (i, model) in enumerate(model_list)
        load(model.name, pkg=model.package_name, verbosity=0) #
        model_names[i] = model.name
    end
    return model_names
end
#Train & Score.
#NOTE: if we do target engineering we need to transform Y back to compare score.
@inline function train_m(m::String, X, y, train, test, pr, meas; invtrans=identity)
    t1 = time_ns()
    println(m)
    if m =="XGBoostRegressor"
        mdl  = eval(Meta.parse("$(m)(num_round=500)"))
    elseif m=="LGBMRegressor"
        mdl  = eval(Meta.parse("$(m)(num_iterations = 1_000, min_data_in_leaf=10)"))
    elseif m=="EvoTreeRegressor"
        mdl  = eval(Meta.parse("$(m)(nrounds = 1500)"))
    else
        mdl  = eval(Meta.parse("$(m)()"))
    end
    #
    mach = machine(mdl, X, y)
    fit!(mach, rows=train, verbosity=0) #, verbosity=0
    #ŷ = MLJ.pr(mach, rows=test)
    ŷ = pr(mach, rows=test)
    ŷ = invtrans.(ŷ)
    y = invtrans.(y)
    #AZ Custom oos-R2
    if meas==rmsl
        s = meas(abs.(ŷ), abs.(y[test]) )  #abs.() for rmsl AMES.
    else
        s = meas(ŷ, y[test])
    end
    t2 = time_ns()
    return [round(s, sigdigits=5), round((t2-t1)/1.0e9, sigdigits=5)]
end
#


################################################################################
#Boston 50 models
################################################################################
X, y =  @load_boston;
train, test = partition(eachindex(y), .7, rng=333);
X = AZ(X)
m_match = models(matching(X, y), x -> x.prediction_type == :deterministic);
m_names = load_m(m_match);

dropm = [ "ARDRegressor", "RANSACRegressor", 
    "BayesianRidgeRegressor", #takes forever
    "RidgeCVRegressor" #worked 5min ago. not anymore. 
    ]
filter!(m -> m ∉ dropm, m_names)

sc = [train_m(m, X, y, train, test, predict, rms) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )
#
sc = [train_m(m, X, log.(y), train, test, predict, rms, invtrans=exp) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )
#
sc = [train_m(m, log.(X.+1), y, train, test, predict, rms) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )
#
sc = [train_m(m, log.(X.+1), log.(y), train, test, predict, rms, invtrans=exp) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )
#

tlienart · April 16, 2020, 9:09pm

lgtm, would be worth a tutorial with explanations

But the key thing here is that my fork of OpenSpecFun works and can be indicated as a workaround on MLJ for now.

Albert_Zevelev · April 16, 2020, 9:14pm

The reason I mention it, as I go through the examples (particularly the classifications) I find a few other SKLearn packages causing problems…
I feel a little reluctant working on the tutorial until MLJ for mac is cleaned up

Topic		Replies	Views
Intel MKL Error on mac General Usage error , mlj	3	937	March 24, 2020
Intel Mkl Windows building Internals & Design windows , blas , mkl , build	33	5934	October 7, 2019
Building Julia with Intel MKL and Intel LIBM on macOS General Usage	6	1828	April 18, 2020
What is the relation between MLJ and Flux? Machine Learning	17	6527	February 10, 2021
Lots of messages printed out when I evaluate the model using MLJ Machine Learning	5	353	April 13, 2022

Julia 1.3, 1.4 on MacOS and Intel MKL Error

Related topics