'inv' Causes Stack Overflow on Julia 1.7.0 and Mac OS

I am running into an odd error when using Julia 1.7.0 on my 2017 Mac Book Pro (Intel i7) on Mac OS 12.01. In particular, executing

n = 1000;
A = randn(Float32, n, n);
inv(A)
   ERROR: StackOverflowError:

With no further stack trace printed afterwards.

By extending the inv function, I was able to identify that the culprit appears to be the lu function.
Further, the same error occurs with ComplexF32 and ComplexF64 types, but not Float64.

On my machine, I tested both Julia 1.6.3 and built the Julia master branch, where the issue does not exist. At the same time, the issue does not appear to be entirely constrained to my machine, since I originally became aware of it through a CI test of one of my packages failing with the same message on Julia 1.6.

More detailed info: On my machine, inv works for all element types up to a certain size. For Float32, the failure size is 514 by 514 and for the complex types, it’s 258 by 258. Below these cutoffs, everything works as usual.

Can anyone reproduce this on their machine, or ideally, know how to fix this?

1 Like

To my surprise, I can confirm this (or something very similar). I’ve seen the same stack overflow error triggered by inv / lu / getrf, however, on a Linux machine and for Float64. I’m surprised because I’ve been using a release candidate (need to check which) for many weeks and have never seen this. But it appears right after upgrading to the stable release today. Note that I could “fix” the issue by setting BLAS.set_num_threads(1) at the beginning of my code. But, of course, that’s not really a solution.

(Btw, the Documenter.jl documentation of the same code even showed a segfault which could also be “solved” by setting the number of BLAS threads. Not sure whether this is related but it feels like it.)

1 Like

I am somewhat relieved I am not alone in this :slight_smile:

I will try to find the time to create a MWE for Linux / Float64 etc. But until then, I can safely confirm the StackOverflowError on (Intel) macOS with your code example:

Note that things work after BLAS.set_num_threads(1):

UPDATE: Things also work when using MKL. The issue is related to OpenBLAS.

2 Likes

Reproduced the behavior with Float32 on a 2019 MacBook Pro, Intel i9, macOS 11.6.1.

But I didn’t reproduce the stack overflow on a 2015 Thinkpad (Intel i7) running Fedora 32, either with Float64 or Float32.

1 Like

I just tried the example of the OP on macOS (intel) with all 1.7 release candidates. The issue doesn’t appear in rc1 but only rc2 and rc3. So it seems to have been introduced between rc1 and rc2. (And as has been mentioned above, the issue is also absent on 1.8)

UPDATE: Interestingly, the issue is also present in 1.6.4 (but not 1.6.3 as mentioned by the OP). So seems also to be part of a backport. @kristoffer.carlsson

It works fine on Linux for me:

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

julia> n = 1000; A = randn(Float32, n, n); B = inv(A);

julia> size(B)
(1000, 1000)

julia> using LinearAlgebra; BLAS.get_num_threads()
4

To be clear, I never said that the example of the OP doesn’t work for me on linux. However, I’ve (presumably but very likely) seen the same issue as part of a larger codebase on Linux.

Why do I think it’s the same issue?

  • StackOverflowError that can be traced back to inv / lu / getrf
  • Goes away with BLAS.set_num_threads(1)
  • Only occurs on 1.7.0 and (just tested) 1.6.4 but works fine on 1.6.3 and 1.7.0-rc1.

I don’t have a MWE yet and will report back once I have one.

1 Like

Understood. Just trying to help, but that may best be done by letting you dig into it!

1 Like

Just to add a little more information / evidence for the Linux case:

I was just able to produce a segfault (due to this):

signal (11): Segmentation fault
in expression starting at /scratch/pc2-mitarbeiter/bauerc/devel/SubmatrixMethod.jl/test/runtests.jl:19
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_64_ at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
getrf! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lapack.jl:575
#lu!#146 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:81 [inlined]
lu!##kw at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:81 [inlined]
#lu#153 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:279 [inlined]
lu at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:278 [inlined]
lu at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:278 [inlined]
inv at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/dense.jl:876
macro expansion at /scratch/pc2-mitarbeiter/bauerc/devel/SubmatrixMethod.jl/src/submatrix.jl:59 [inlined]
macro expansion at /scratch/pc2-mitarbeiter/bauerc/devel/SubmatrixMethod.jl/src/debugging.jl:15 [inlined]
submatrix_computation! at /scratch/pc2-mitarbeiter/bauerc/devel/SubmatrixMethod.jl/src/submatrix.jl:58 [inlined]
#7 at ./threadingconstructs.jl:178
unknown function (ip: 0x1555280d1c3f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877
Allocations: 44277944 (Pool: 44172053; Big: 105891); GC: 294

And here the stacktrace below the StackOverflowError as generated during testing a package (] test):

 Test threw exception
  Expression: maximum(abs.(R .- inv(Matrix(A)))) ≤ 1.0e-7
  StackOverflowError:
  Stacktrace:
    [1] getrf!(A::Matrix{Float64})
      @ LinearAlgebra.LAPACK /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/share/julia/stdlib/v1.7/LinearAlgebra/src/lapack.jl:575
    [2] #lu!#146
      @ /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:81 [inlined]
    [3] #lu#153
      @ /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:279 [inlined]
    [4] lu (repeats 2 times)
      @ /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:278 [inlined]
    [5] inv(A::Matrix{Float64})
      @ LinearAlgebra /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/share/julia/stdlib/v1.7/LinearAlgebra/src/dense.jl:876
    [6] macro expansion
      @ /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/share/julia/stdlib/v1.7/Test/src/Test.jl:445 [inlined]
    [7] macro expansion
      @ /scratch/pc2-mitarbeiter/bauerc/CI-jacamar/data/bauerc/builds/yEdgJCGQ/000/pc2/julia/submatrixmethod.jl/test/runtests.jl:85 [inlined]
    [8] macro expansion
      @ /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/share/julia/stdlib/v1.7/Test/src/Test.jl:1283 [inlined]
    [9] macro expansion
      @ /scratch/pc2-mitarbeiter/bauerc/CI-jacamar/data/bauerc/builds/yEdgJCGQ/000/pc2/julia/submatrixmethod.jl/test/runtests.jl:73 [inlined]
   [10] macro expansion
      @ /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/share/julia/stdlib/v1.7/Test/src/Test.jl:1283 [inlined]
   [11] macro expansion
      @ /scratch/pc2-mitarbeiter/bauerc/CI-jacamar/data/bauerc/builds/yEdgJCGQ/000/pc2/julia/submatrixmethod.jl/test/runtests.jl:72 [inlined]
   [12] macro expansion
      @ /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/share/julia/stdlib/v1.7/Test/src/Test.jl:1283 [inlined]
   [13] top-level scope
      @ /scratch/pc2-mitarbeiter/bauerc/CI-jacamar/data/bauerc/builds/yEdgJCGQ/000/pc2/julia/submatrixmethod.jl/test/runtests.jl:18

@SEA I think it would be good to open a GitHub issue for this. Would you mind creating one?

Btw, this reported issue seems very similar. Might have the same cause.

Can anyone reproduce the error on Linux, and if so, how? I’ve been trying to test https://github.com/JuliaPackaging/Yggdrasil/pull/3999 locally, but completely idiotic macOS security policies prevent me from doing anything. Being able to reproduce the error on Linux would save my sanity. Nevermind, I was eventually able to solve my macOS problems.

FWIW, not precisely the same error message but this is segfaulting consistently on a Linux machine with Julia 1.7.0 and related to OpenBLAS and getrf.

(Julia started with 8 threads, i.e. julia -t 8)

julia> n = 1000;

julia> Threads.@threads for i in 1:5
           A = randn(Float64, n, n); inv(A);
       end

signal (11): Segmentation fault
in expression starting at REPL[2]:1
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_64_ at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
getrf! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lapack.jl:575
#lu!#146 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:81 [inlined]
lu!##kw at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:81 [inlined]
#lu#153 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:279 [inlined]
lu at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:278 [inlined]
lu at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:278 [inlined]
inv at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/dense.jl:876
macro expansion at ./REPL[2]:2 [inlined]
#40#threadsfor_fun at ./threadingconstructs.jl:85
#40#threadsfor_fun at ./threadingconstructs.jl:52
unknown function (ip: 0x1554f0112d5f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877
Allocations: 4321396 (Pool: 4319432; Big: 1964); GC: 5
Segmentation fault (core dumped)

Again, goes away when BLAS.set_num_threads(1).

UPDATE: On another machine I needed to replace 1:5 with 1:20 to trigger the segfault.

TLDR: it’s this Julia issue (also reported here), caused by a problem with the OpenBLAS package where an experimental feature threading feature was accidentally enabled in the package used for the Julia release. Seems like it will require a rapid Julia 1.7.1 bugfix release.

Workaround is to call BLAS.set_num_threads(1) for now :frowning:, or downgrade to Julia 1.6.3 until a fix is released.

4 Likes

Will that also fix

https://github.com/JuliaLang/julia/issues/43242

This isn’t reproducible for me, not with 5 nor 50 iterations.

Hm, that’s curious I tried 3 different machines (JUWELS, Noctua and a local machine) with fresh Julia 1.7.0 installs and I could make it segfault on all of them only by varying the upper iteration bound.

1 Like

I’m not alone with the IJulia problem. I’msyrprised you could not see it

I created this issue referencing the Yggdrasil issue.