I am getting unpredictable and non-reproducible terminal crashes in Julia 1.7, both within VSCode and also from the command line using PowerShell on Windows 11, and I don’t know how to diagnose the root issue. I’m sorry in advance for how long this is; so I will start with a summary question. If more context is necessary to answer it, then stick around.
Summary: I have a bug that I don’t know how to diagnose, because it causes an unpredictable terminal crash in windows. I suspect it’s related to multithreading. I would like to know what tools exist to find the root cause.
Context:
The crash itself: Usually happens between minutes and hours after starting a long computation. I am running many (~1000) batches consisting of many (~1000) iterations of the same function, using Threads.@threads for each batch, and it is not the first batch which crashes. When it does crash, VSCode gives exit code 0xC0000005, which I am reliably told is a memory access error. I have 11,000,000 successful iterations (11,000 batches) complete at this point. It went several days, and many millions of iterations, without crashing, but now for the last two days I cannot reliably complete more than a few batches.
Circumstances: This is multithreaded code operating on a shared database, in dataframe form, plus two auxiliary dicts. There is also a large amount of shared data that is currently defined as global consts, but I am using that read-only. I believe I’m using locks correctly to protect the data, creating ReentrantLocks in my calling function and then sharing them across all tasks, with a lock(L) do ... end wrapper around any code that touches the shared write-able resources. I can set the thing running, and it will be great for 10 minutes, so I leave it be, but when I come back an hour later it has crashed in the meantime. And because it’s a terminal crash, I don’t get any useful Julia debug info. Windows’ Resource Monitor shows nothing weird; memory usage is stable, number of threads is stable, etc.
Other context: When I was first developing this program, I ran into these crashes occasionally. They seemed to go away after a while, but not due to anything intentional on my part. For a few days the code seemed stable, and it did not crash across something like 30 clock hours of computation. In an effort to optimize, I recently moved the shared database from a global to a function, and now it crashes every time I run it. I changed nothing else other than loading the DB inside a function call and passing it as an arg to the functions that use it. This did give a large speedup. Anecdotally, it seemed like starting Julia with more threads made crashing more likely back when it was rare. These facts suggest to me that it’s some kind of data race issue, but for pointers? I’m in over my head on that. For what it’s worth, making the batches smaller seems to help.
Extenuating circumstances: I have a very large global resource that is shared among the threads in a read-only manner. It takes about 15 minutes to load it into memory and preprocess it, so debugging is very expensive. I don’t know how to produce a MWE because the crash was (until recently) very rare and part of a relatively complex program that relies on that data. So the purpose of this question is not really getting help fixing the problem, but just finding out how to diagnose it so I can fix it myself.
What kind of hardware are you running your calculations?
I am asking, because during reading my suspicion went to memory faults because of hardware without ECC RAM, like consumer PCs.
Another approach could be to build Julia with debug symbols from source. It’s quite easy in a WSL2 debian under Windows, did it recently. With that you can run julia in a gdb session, and maybe you can find out more. Still it’s probably not easy to debug this kind of crashes…
Do have access to a compute workstation? This would be the easiest to check, if crashes don’t happen on such a machine, you know more.
I just checked scaleway.com, because recently I needed access to a virtual iOS MAC Desktop for helping someone developing using Xcode/swift, and 24h (min licence because apple) cost only 2€ something.
I checked now for a virtual compute workstation there, it’s called Elastic Metal, and it’s very cheap:
You can get 2 Xeons with 192GByte Ram, 2TByte SSD for 0,25€ (25cent) per hour. Ideal for checking out difficult things like your problem.
So, not wanting to make too much advertising for a single provider, Amazon (AWS) came into mind, but scaleway is the only one where I can check easily for prices because I am a customer.
That’s how I would tackle down your issue: 1) check on workstation, 2) gdb approach (surely very time consuming)
Another easy idea: just check another Julia version 1.6 or 1.8 or nightly. If one isn’t crashing you got a solution for you. It can always be a strange bug in some version.
Redirecting stderr didn’t work. I ran my file using julia -t 8 2>> "errors.txt", where the 2>> means redirect stderr. The file errors.txt was created but is empty even though it crashed again on the 35th batch. Thanks for everyone’s suggestions. I will try the other suggestions, but the timeline for them is quite long. This is a weekend project and I’ve already used most of my day today on this.
I don’t think, in case of memory corruption, that stderr will give anything meaningful, because the process already crashed.
And I don’t think, a race condition is the reason, because, this would cause dead locks or false results, but not a crash like this.
What you also can try to address possible hardware (RAM) failure: your 64GB are probably 4x16 GBytes, so remove 2 of them from your system and see if you crash. If yes, put them back, remove the other two and again check if crash. But of course this might be not possible if you need the 64GB to run your task at all, so, just an idea. Perhaps you can run it with less data.
for a big data (home version) like this? Sounds stressful.
Anyway, the suggestion to try a different version of Julia seems to be the winner. I installed 1.7.2, and the first time I ran it, it failed again – but this time with a stack trace! And mercifully there were no breaking changes in Serialization. Now I think it may have been an interaction with OneDrive. The exit code led me to suspect something was wrong in memory, but maybe not! Anyway, they fixed the problem from 1.7.1 → 1.7.2, so nothing to do here except chalk it up to experience. Thanks again for the excellent suggestions! And here is the offending stack trace for posterity.
The crash is back! 1.7.2 did not fix the problem, and it was not an interaction with onedrive. I just got lucky and completed a full run of 1000000 iterations before it crashed again.