Julia 1.3, 1.4 on MacOS and Intel MKL Error

Sorry in advance if this is the wrong place to post.

I am looking for any help narrowing down the source of a MacOS only / Julia 1.3, 1.4 only failure for the MLJModels.jl package: “Intel MKL Error”. Copied from the issue raised there:


I have introduce CI on this branch for julia 1.3 and julia 1.4, where testing in now failing for MacOS.

The error is triggered by testing of the wrapped scikit-learn (python) clustering models. According to the travis logs, the conda installations for scikit-learn are the same for linux and macOS, excect that macOS has an additional package llvm-openmp-4.0.1 installed.

Any help at all on this one would be appreciated. In particular, should I regard this as julia 1.3/1.4 error?

Here is the tail of the stack trace

Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
540Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
541
542Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
543
544Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
545
546Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
547
548
549Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
550Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
551
552signal (11): Segmentation fault: 11
553in expression starting at /Users/travis/build/alan-turing-institute/MLJModels.jl/test/ScikitLearn/clustering.jl:139
554thread_team_ctxt_commit_callback at /Users/travis/.julia/conda/3/lib/libmkl_intel_thread.dylib (unknown line)
555mkl_lapack_thread_team_ctxt_commit_task at /Users/travis/.julia/conda/3/lib/libmkl_core.dylib (unknown line)
556mkl_lapack_dgetrf at /Users/travis/.julia/conda/3/lib/libmkl_intel_thread.dylib (unknown line)
557Allocations: 388377569 (Pool: 388284978; Big: 92591); GC: 274
558ERROR: Package MLJModels errored during testing
559Stacktrace:
560 [1] pkgerror(::String, ::Vararg{String,N} where N) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/Types.jl:113
561 [2] #test#131(::Bool, ::Nothing, ::Cmd, ::Cmd, ::typeof(Pkg.Operations.test), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/Operations.jl:1372
562 [3] #test at ./none:0 [inlined]
563 [4] #test#62(::Bool, ::Nothing, ::Cmd, ::Cmd, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Pkg.API.test), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:253
564 [5] (::Pkg.API.var"#kw##test")(::NamedTuple{(:coverage,),Tuple{Bool}}, ::typeof(Pkg.API.test), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at ./none:0
565 [6] #test#58 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:233 [inlined]
566 [7] (::Pkg.API.var"#kw##test")(::NamedTuple{(:coverage,),Tuple{Bool}}, ::typeof(Pkg.API.test)) at ./none:0
567 [8] top-level scope at none:1
1 Like

@cstjean have you seen this issue on macs for Julia 1.3/1.4?

Not so far, but I am unfortunately not using ScikitLearn in my day-to-day work.

Looking around it seems MKL throws such issues upon seeing corrupted values (inf, nans…)

Anyway I tried running the tests locally for MLJModels 0.9.1 (latest release) on my machine (a mac without MKL) and they fail as well with:

SpectralClustering: Error During Test at /Users/tlienart/.julia/packages/MLJModels/8gw1p/test/ScikitLearn/clustering.jl:139
  Got exception outside of a @test
  PyError ($(Expr(:escape, :(ccall(#= /Users/tlienart/.julia/packages/PyCall/zqDXB/src/pyfncall.jl:43 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'ValueError'>
  ValueError('array must not contain infs or NaNs')

Later on

BayesianRidge: Test Failed at /Users/tlienart/.julia/packages/MLJModels/8gw1p/test/ScikitLearn/linear-regressors.jl:29
  Expression: isapprox(norm(predict(m, f, X) .- y) / norm(y), 0.0326918, rtol = 1.0e-5)
   Evaluated: isapprox(15.425555763884265, 0.0326918; rtol = 1.0e-5)
Stacktrace:
 [1] top-level scope at /Users/tlienart/.julia/packages/MLJModels/8gw1p/test/ScikitLearn/linear-regressors.jl:29
 [2] top-level scope at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1114
 [3] top-level scope at /Users/tlienart/.julia/packages/MLJModels/8gw1p/test/ScikitLearn/linear-regressors.jl:26

Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.

I don’t think that’s an issue on the Julia’s side.

Edit: ok actually I don’t know, I also tested locally MLJModels 0.8.4 (previous minor release) and it also fails on Julia 1.4 and nightly with similar errors.

Edit2: I’ll try with a compat bound on MKL_jll --> well I managed to try with the most recent 2020 one but that also failed. :man_shrugging:

From

one can assume, that between your tested releases LU factorisation somehow has a problem. The above mentioned discourse article from @Tecnezio references into the Intel documentation:

Thanks @wolfgang, what’s a bit frustrating is that I don’t really see why we’re observing this while you can call sklearn.jl’s models separately without issues as far as I can tell; and to the best of my knowledge we don’t touch or set any linalg setting MKL or otherwise.

The same code runs in 1.2 and doesn’t run in 1.3 / 1.4 / nightly; which shouldn’t be the case. Worst than that, it used to work with 1.3 and 1.4 up to until a point maybe 2-3 weeks ago. I tried using an earlier of our release to see if it was something we introduced but it also failed.

cc @tkelman with apologies for the ping