How to diagnose a terminal crash?

SortofDamocles · March 19, 2022, 1:16pm

I am getting unpredictable and non-reproducible terminal crashes in Julia 1.7, both within VSCode and also from the command line using PowerShell on Windows 11, and I don’t know how to diagnose the root issue. I’m sorry in advance for how long this is; so I will start with a summary question. If more context is necessary to answer it, then stick around.

Summary: I have a bug that I don’t know how to diagnose, because it causes an unpredictable terminal crash in windows. I suspect it’s related to multithreading. I would like to know what tools exist to find the root cause.

Context:

The crash itself: Usually happens between minutes and hours after starting a long computation. I am running many (~1000) batches consisting of many (~1000) iterations of the same function, using Threads.@threads for each batch, and it is not the first batch which crashes. When it does crash, VSCode gives exit code 0xC0000005, which I am reliably told is a memory access error. I have 11,000,000 successful iterations (11,000 batches) complete at this point. It went several days, and many millions of iterations, without crashing, but now for the last two days I cannot reliably complete more than a few batches.
Circumstances: This is multithreaded code operating on a shared database, in dataframe form, plus two auxiliary dicts. There is also a large amount of shared data that is currently defined as global consts, but I am using that read-only. I believe I’m using locks correctly to protect the data, creating ReentrantLocks in my calling function and then sharing them across all tasks, with a lock(L) do ... end wrapper around any code that touches the shared write-able resources. I can set the thing running, and it will be great for 10 minutes, so I leave it be, but when I come back an hour later it has crashed in the meantime. And because it’s a terminal crash, I don’t get any useful Julia debug info. Windows’ Resource Monitor shows nothing weird; memory usage is stable, number of threads is stable, etc.
Other context: When I was first developing this program, I ran into these crashes occasionally. They seemed to go away after a while, but not due to anything intentional on my part. For a few days the code seemed stable, and it did not crash across something like 30 clock hours of computation. In an effort to optimize, I recently moved the shared database from a global to a function, and now it crashes every time I run it. I changed nothing else other than loading the DB inside a function call and passing it as an arg to the functions that use it. This did give a large speedup. Anecdotally, it seemed like starting Julia with more threads made crashing more likely back when it was rare. These facts suggest to me that it’s some kind of data race issue, but for pointers? I’m in over my head on that. For what it’s worth, making the batches smaller seems to help.
Extenuating circumstances: I have a very large global resource that is shared among the threads in a read-only manner. It takes about 15 minutes to load it into memory and preprocess it, so debugging is very expensive. I don’t know how to produce a MWE because the crash was (until recently) very rare and part of a relatively complex program that relies on that data. So the purpose of this question is not really getting help fixing the problem, but just finding out how to diagnose it so I can fix it myself.

Thanks for any info!

oheil · March 19, 2022, 1:24pm

What kind of hardware are you running your calculations?
I am asking, because during reading my suspicion went to memory faults because of hardware without ECC RAM, like consumer PCs.

oheil · March 19, 2022, 1:27pm

Another approach could be to build Julia with debug symbols from source. It’s quite easy in a WSL2 debian under Windows, did it recently. With that you can run julia in a gdb session, and maybe you can find out more. Still it’s probably not easy to debug this kind of crashes…

SortofDamocles · March 19, 2022, 1:42pm

It is indeed non error correcting memory.

oheil · March 19, 2022, 1:53pm

Do have access to a compute workstation? This would be the easiest to check, if crashes don’t happen on such a machine, you know more.

I just checked scaleway.com, because recently I needed access to a virtual iOS MAC Desktop for helping someone developing using Xcode/swift, and 24h (min licence because apple) cost only 2€ something.
I checked now for a virtual compute workstation there, it’s called Elastic Metal, and it’s very cheap:
You can get 2 Xeons with 192GByte Ram, 2TByte SSD for 0,25€ (25cent) per hour. Ideal for checking out difficult things like your problem.
So, not wanting to make too much advertising for a single provider, Amazon (AWS) came into mind, but scaleway is the only one where I can check easily for prices because I am a customer.

That’s how I would tackle down your issue: 1) check on workstation, 2) gdb approach (surely very time consuming)

goerch · March 19, 2022, 2:55pm

I would never underestimate the power of @inbounds;), which I’d assume to be used extensively…

More questions: if you run the REPL in a shell, I’d expect to see a Julia trace back in case of a crash? Which Julia version are you using?

SortofDamocles · March 19, 2022, 3:00pm

I do not use @inbounds anywhere.

There is no traceback because it crashes the entire terminal. This is Julia 1.7:

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12th Gen Intel(R) Core(TM) i5-12600K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, goldmont)

Is there a way to configure Julia to dump tracebacks to a file? I suspect there is one, it’s just not visible because the terminal closes too quickly.

goerch · March 19, 2022, 3:04pm

Maybe a stupid question, but just to be sure: in VS Code? What about trying cmd.exe then (I have yet to see Julia crash cmd.exe)?

Edit: I seem to recall I saw it once…

SortofDamocles · March 19, 2022, 3:06pm

Yes, the same thing happens in PowerShell.

oheil · March 19, 2022, 3:07pm

Another easy idea: just check another Julia version 1.6 or 1.8 or nightly. If one isn’t crashing you got a solution for you. It can always be a strange bug in some version.

goerch · March 19, 2022, 3:26pm

OK, in such a scenario I’d try to run Julia with stdout and stderr redirection to a file. Don’t know if there is a better way in PowerShell.

Edit: another idea: can you exclude OOM?

SortofDamocles · March 19, 2022, 3:37pm

Memory usage is stable, I’m trying again in windows terminal right now. Just passed 20,000 iterations. Using 22.6 GB out of 64GB system memory.

Do you know off the top of your head how to redirect stderr? I have had no luck with the docs on that.

EDIT: i see that’s a powershell thing. never mind!

goerch · March 19, 2022, 3:39pm

For cmd.exe

echo 1.txt 1>error.log 2>&1

SortofDamocles · March 19, 2022, 3:40pm

Thanks for this gracious response to a pretty bad question

goerch · March 19, 2022, 3:51pm

This one is probably the most difficult to diagnose in your application. Maybe you could fork your code to make use of GitHub - JuliaConcurrent/ConcurrentCollections.jl: Concurrent data structures for Julia?

Do you guard write access or all access to update-able resources?

SortofDamocles · March 19, 2022, 4:02pm

All access. I am running again with stderr redirected to a file; we’ll see what that looks like.

SortofDamocles · March 19, 2022, 4:26pm

Redirecting stderr didn’t work. I ran my file using julia -t 8 2>> "errors.txt", where the 2>> means redirect stderr. The file errors.txt was created but is empty even though it crashed again on the 35th batch. Thanks for everyone’s suggestions. I will try the other suggestions, but the timeline for them is quite long. This is a weekend project and I’ve already used most of my day today on this.

oheil · March 19, 2022, 4:50pm

I don’t think, in case of memory corruption, that stderr will give anything meaningful, because the process already crashed.
And I don’t think, a race condition is the reason, because, this would cause dead locks or false results, but not a crash like this.

What you also can try to address possible hardware (RAM) failure: your 64GB are probably 4x16 GBytes, so remove 2 of them from your system and see if you crash. If yes, put them back, remove the other two and again check if crash. But of course this might be not possible if you need the 64GB to run your task at all, so, just an idea. Perhaps you can run it with less data.

for a big data (home version) like this? Sounds stressful.

SortofDamocles · March 19, 2022, 9:37pm

It’s usually quite relaxing!

Anyway, the suggestion to try a different version of Julia seems to be the winner. I installed 1.7.2, and the first time I ran it, it failed again – but this time with a stack trace! And mercifully there were no breaking changes in Serialization. Now I think it may have been an interaction with OneDrive. The exit code led me to suspect something was wrong in memory, but maybe not! Anyway, they fixed the problem from 1.7.1 → 1.7.2, so nothing to do here except chalk it up to experience. Thanks again for the excellent suggestions! And here is the offending stack trace for posterity.

SystemError: opening file "db.jls": Permission denied
Stacktrace:
  [1] systemerror(p::String, errno::Int32; extrainfo::Nothing)
    @ Base .\error.jl:174
  [2] #systemerror#68
    @ .\error.jl:173 [inlined]
  [3] systemerror
    @ .\error.jl:173 [inlined]
  [4] open(fname::String; lock::Bool, read::Nothing, write::Nothing, create::Nothing, truncate::Bool, append::Nothing)
    @ Base .\iostream.jl:293
  [5] open(fname::String, mode::String; lock::Bool)
    @ Base .\iostream.jl:355
  [6] open(fname::String, mode::String)
    @ Base .\iostream.jl:355
  [7] open(::Serialization.var"#1#2"{DataFrame}, ::String, ::Vararg{String}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Base .\io.jl:328
  [8] open(::Function, ::String, ::String)
    @ Base .\io.jl:328
  [9] serialize
    @ C:\Users\rNr\AppData\Local\Programs\Julia-1.7.2\share\julia\stdlib\v1.7\Serialization\src\Serialization.jl:775 [inlined]
 [10] train(nbatches::Int64, batchsize::Int64)
    @ Main c:\users...    \main.jl:341

SortofDamocles · March 20, 2022, 4:12pm

The crash is back! 1.7.2 did not fix the problem, and it was not an interaction with onedrive. I just got lucky and completed a full run of 1000000 iterations before it crashed again.

Topic		Replies	Views
Julia terminal crashes unexpectedly General Usage error-message , terminal	13	261	February 23, 2025
How can I find out why julia crashed? General Usage question , debug , crash	14	716	August 29, 2024
What to do when Julia crashes General Usage	7	1937	November 4, 2019
Segmentation fault using multithreaded julia on new server General Usage question , segfault	0	157	May 22, 2024
Tracking down instant crashes with multithreading General Usage	5	283	June 21, 2022

How to diagnose a terminal crash?

Related topics