I am running into an odd error when using Julia 1.7.0 on my 2017 Mac Book Pro (Intel i7) on Mac OS 12.01. In particular, executing
n = 1000;
A = randn(Float32, n, n);
inv(A)
ERROR: StackOverflowError:
With no further stack trace printed afterwards.
By extending the inv function, I was able to identify that the culprit appears to be the lu function.
Further, the same error occurs with ComplexF32 and ComplexF64 types, but not Float64.
On my machine, I tested both Julia 1.6.3 and built the Julia master branch, where the issue does not exist. At the same time, the issue does not appear to be entirely constrained to my machine, since I originally became aware of it through a CI test of one of my packages failing with the same message on Julia 1.6.
More detailed info: On my machine, inv works for all element types up to a certain size. For Float32, the failure size is 514 by 514 and for the complex types, it’s 258 by 258. Below these cutoffs, everything works as usual.
Can anyone reproduce this on their machine, or ideally, know how to fix this?
To my surprise, I can confirm this (or something very similar). I’ve seen the same stack overflow error triggered by inv / lu / getrf, however, on a Linux machine and for Float64. I’m surprised because I’ve been using a release candidate (need to check which) for many weeks and have never seen this. But it appears right after upgrading to the stable release today. Note that I could “fix” the issue by setting BLAS.set_num_threads(1) at the beginning of my code. But, of course, that’s not really a solution.
(Btw, the Documenter.jl documentation of the same code even showed a segfault which could also be “solved” by setting the number of BLAS threads. Not sure whether this is related but it feels like it.)
I will try to find the time to create a MWE for Linux / Float64 etc. But until then, I can safely confirm the StackOverflowError on (Intel) macOS with your code example:
I just tried the example of the OP on macOS (intel) with all 1.7 release candidates. The issue doesn’t appear in rc1 but only rc2 and rc3. So it seems to have been introduced between rc1 and rc2. (And as has been mentioned above, the issue is also absent on 1.8)
UPDATE: Interestingly, the issue is also present in 1.6.4 (but not 1.6.3 as mentioned by the OP). So seems also to be part of a backport. @kristoffer.carlsson
To be clear, I never said that the example of the OP doesn’t work for me on linux. However, I’ve (presumably but very likely) seen the same issue as part of a larger codebase on Linux.
Why do I think it’s the same issue?
StackOverflowError that can be traced back to inv / lu / getrf
Goes away with BLAS.set_num_threads(1)
Only occurs on 1.7.0 and (just tested) 1.6.4 but works fine on 1.6.3 and 1.7.0-rc1.
I don’t have a MWE yet and will report back once I have one.
Just to add a little more information / evidence for the Linux case:
I was just able to produce a segfault (due to this):
signal (11): Segmentation fault
in expression starting at /scratch/pc2-mitarbeiter/bauerc/devel/SubmatrixMethod.jl/test/runtests.jl:19
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_64_ at /upb/departments/pc2/groups/pc2-mitarbeiter/bauerc/easybuild/software/JuliaHPC/1.7.0-intelcuda-2020b/bin/../lib/julia/libopenblas64_.so (unknown line)
getrf! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lapack.jl:575
#lu!#146 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:81 [inlined]
lu!##kw at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:81 [inlined]
#lu#153 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:279 [inlined]
lu at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:278 [inlined]
lu at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:278 [inlined]
inv at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/dense.jl:876
macro expansion at /scratch/pc2-mitarbeiter/bauerc/devel/SubmatrixMethod.jl/src/submatrix.jl:59 [inlined]
macro expansion at /scratch/pc2-mitarbeiter/bauerc/devel/SubmatrixMethod.jl/src/debugging.jl:15 [inlined]
submatrix_computation! at /scratch/pc2-mitarbeiter/bauerc/devel/SubmatrixMethod.jl/src/submatrix.jl:58 [inlined]
#7 at ./threadingconstructs.jl:178
unknown function (ip: 0x1555280d1c3f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877
Allocations: 44277944 (Pool: 44172053; Big: 105891); GC: 294
And here the stacktrace below the StackOverflowError as generated during testing a package (] test):
Can anyone reproduce the error on Linux, and if so, how? I’ve been trying to test https://github.com/JuliaPackaging/Yggdrasil/pull/3999 locally, but completely idiotic macOS security policies prevent me from doing anything. Being able to reproduce the error on Linux would save my sanity. Nevermind, I was eventually able to solve my macOS problems.
FWIW, not precisely the same error message but this is segfaulting consistently on a Linux machine with Julia 1.7.0 and related to OpenBLAS and getrf.
(Julia started with 8 threads, i.e. julia -t 8)
julia> n = 1000;
julia> Threads.@threads for i in 1:5
A = randn(Float64, n, n); inv(A);
end
signal (11): Segmentation fault
in expression starting at REPL[2]:1
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_parallel at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
dgetrf_64_ at /cm/shared/apps/pc2/EB-SW/software/Julia/1.7.0-linux-x86_64/bin/../lib/julia/libopenblas64_.so (unknown line)
getrf! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lapack.jl:575
#lu!#146 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:81 [inlined]
lu!##kw at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:81 [inlined]
#lu#153 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:279 [inlined]
lu at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:278 [inlined]
lu at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/lu.jl:278 [inlined]
inv at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/LinearAlgebra/src/dense.jl:876
macro expansion at ./REPL[2]:2 [inlined]
#40#threadsfor_fun at ./threadingconstructs.jl:85
#40#threadsfor_fun at ./threadingconstructs.jl:52
unknown function (ip: 0x1554f0112d5f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877
Allocations: 4321396 (Pool: 4319432; Big: 1964); GC: 5
Segmentation fault (core dumped)
Again, goes away when BLAS.set_num_threads(1).
UPDATE: On another machine I needed to replace 1:5 with 1:20 to trigger the segfault.
TLDR: it’s this Julia issue (also reported here), caused by a problem with the OpenBLAS package where an experimental feature threading feature was accidentally enabled in the package used for the Julia release. Seems like it will require a rapid Julia 1.7.1 bugfix release.
Workaround is to call BLAS.set_num_threads(1) for now , or downgrade to Julia 1.6.3 until a fix is released.
Hm, that’s curious I tried 3 different machines (JUWELS, Noctua and a local machine) with fresh Julia 1.7.0 installs and I could make it segfault on all of them only by varying the upper iteration bound.