I figured I should just post this in case others encounter this, because it seems like something that may be quite complicated to actually pin down.
I have a Windows on ARM system (Surface Pro 11), and so need to run Julia in WSL to use the version compiled for ARM. So I have VSCode set up to run connected to WSL, and to use Julia inside WSL.
When I tried to precompile a package, which includes dependencies on Enzyme and Flux among others, from scratch inside VSCode’s REPL, the entire system hangs (display and all) somewhere variable while Enzyme is being precompiled. Eventually the system throws up a DPC_WATCHDOG_VIOLATION blue screen.
However, if I repeatedly precompile, it eventually gets through everything.
I also don’t seem to see the issue if I precompile from scratch just within a WSL Terminal window. CORRECTION: Not true
Turns out that the WSL team wants BSODs to be reported to Microsoft’s Security team, so I’ve sent them four minidumps from the crashes.
And this is what I sent them as a reproducer:
# 1. install Julia
curl -fsSL https://install.julialang.org | sh
# 2. Execute a long-running process in Julia
julia -e "import Pkg; Pkg.add([\"Flux\", \"Enzyme\", \"Symbolics\", \"Plots\"])"
Microsoft determined that the problem is resource exhaustion. They recommended that .wslconfig (Advanced settings configuration in WSL | Microsoft Learn) be used to set the number of processors available to the VM (default is all), and to reserve 1 or 2 for Windows. On my 12-core machine, I set it to 10 cores and Julia worked perfectly. (Didn’t try 11.)
# Settings apply across all Linux distros running on WSL 2
[wsl2]
# Sets the VM to use ten virtual processors
processors=10
I think, this just sounds like a hardware problem to me - not specific to Julia.
Did you try running some CPU stresstest (e.g. Prime95) to test this hypothesis?
Should WSL hanging also crash the host OS, if that was the only problem? Though I could try that. My system RAM usage is already near max though, with 16 GB, so not a lot of room.
I’ll try that, though Microsoft’s security people told me it looked like resource exhaustion. Julia does spin up a lot of processes to do the precompilation steps. Maybe there’s another resource being exhausted… like file system operations or handles?
I’ve run a long-running C compiler in WSL and nothing hanged there.
None of the other settings seems to work. Setting ENV["JULIA_NUM_PRECOMPILE_TASKS"] = 4, or restricting WSL to 4 virtual cores or less seems to work more reliably… at least I haven’t seen a BSOD with it yet.
Maybe I’ll try a stress test.
EDIT: 20 min of Prime95 on Windows directly - no issue. Also tried stress-ng on WSL - no issue.
I’m trying to see if I can get it to happen with anything other than Enzyme. So far no luck.
It seems like there are a handful of vulnerable points during the process that I’ve seen so far, during Enzyme precompilation and during precompilation of an extension of Enzyme.
Since precompilation actually executes code, I wonder if it’s a problem in Enzyme proper… EDIT: nope, Images (either ImageCore or TiffImages?) seems to have triggered it too.
I suggest a PR to limit to 4, or even 3, or 1? 1 is safest, but we could start with 4 and always lower this. It doesn’t matter to much what is chosen, people could always override with JULIA_NUM_PRECOMPILE_TASKS.
A small modification to:
To implement that PR, I needed isWSL() and I started with that:
It could strictly be merged separately (though I need to fix some minor issue with it, feel free to suggest a change for that, or even for the two in one PRs), and then a PR building on it. Or I was thinking, would I get away with doing those in one PR?
In your PR I see a second definition for Sys.iswindows rather than a definition for isWSL ?
Edit. It seems now corrected. Thanks!
May be two further comments: 1) possibly add a bullet in the 1.12 Readme? 2) is there an interest to back port it to 1.11? ( 1.10??)