Various BSODs when precompiling many packages in WSL on ARM64

I figured I should just post this in case others encounter this, because it seems like something that may be quite complicated to actually pin down.

I have a Windows on ARM system (Surface Pro 11), and so need to run Julia in WSL to use the version compiled for ARM. So I have VSCode set up to run connected to WSL, and to use Julia inside WSL.

When I tried to precompile a package, which includes dependencies on Enzyme and Flux among others, from scratch inside VSCode’s REPL, the entire system hangs (display and all) somewhere variable while Enzyme is being precompiled. Eventually the system throws up a DPC_WATCHDOG_VIOLATION blue screen.

However, if I repeatedly precompile, it eventually gets through everything.

I also don’t seem to see the issue if I precompile from scratch just within a WSL Terminal window. CORRECTION: Not true

I’ve now seen a whole family of BSODs. It’s very reliable.

IPI_WATCHDOG_TIMEOUT
CLOCK_WATCHDOG_TIMEOUT
DPC_WATCHDOG_VIOLATION

Turns out that the WSL team wants BSODs to be reported to Microsoft’s Security team, so I’ve sent them four minidumps from the crashes.

And this is what I sent them as a reproducer:

# 1. install Julia
curl -fsSL https://install.julialang.org | sh
# 2. Execute a long-running process in Julia
julia -e "import Pkg; Pkg.add([\"Flux\", \"Enzyme\", \"Symbolics\", \"Plots\"])"
1 Like

Microsoft determined that the problem is resource exhaustion. They recommended that .wslconfig (Advanced settings configuration in WSL | Microsoft Learn) be used to set the number of processors available to the VM (default is all), and to reserve 1 or 2 for Windows. On my 12-core machine, I set it to 10 cores and Julia worked perfectly. (Didn’t try 11.)

# Settings apply across all Linux distros running on WSL 2
[wsl2]

# Sets the VM to use ten virtual processors
processors=10
1 Like

I take that back, it suddenly started happening again. I even reduced CPUs to 6.

Next attempt: setting autoMemoryReclaim to dropcache

I think, this just sounds like a hardware problem to me - not specific to Julia.
Did you try running some CPU stresstest (e.g. Prime95) to test this hypothesis?

1 Like

You can try to give more RAM to WSL. I think the default is 4 GB and no swap.

This helped me before when a huge package hanged in precompilation in WSL:

[wsl2]
memory=6GB
swap=4GB
1 Like

Should WSL hanging also crash the host OS, if that was the only problem? Though I could try that. My system RAM usage is already near max though, with 16 GB, so not a lot of room.

I’ll try that, though Microsoft’s security people told me it looked like resource exhaustion. Julia does spin up a lot of processes to do the precompilation steps. Maybe there’s another resource being exhausted… like file system operations or handles?

I’ve run a long-running C compiler in WSL and nothing hanged there.

Should WSL hanging also crash the host OS, if that was the only problem?

On my laptop (windows 10, x86_64) the host does not crash when WSL is OOM. WSL hangs and does not restart successfully until host reboot.

None of the other settings seems to work. Setting ENV["JULIA_NUM_PRECOMPILE_TASKS"] = 4, or restricting WSL to 4 virtual cores or less seems to work more reliably… at least I haven’t seen a BSOD with it yet.

Maybe I’ll try a stress test.

EDIT: 20 min of Prime95 on Windows directly - no issue. Also tried stress-ng on WSL - no issue.

2 Likes

I’m trying to see if I can get it to happen with anything other than Enzyme. So far no luck.

It seems like there are a handful of vulnerable points during the process that I’ve seen so far, during Enzyme precompilation and during precompilation of an extension of Enzyme.

Since precompilation actually executes code, I wonder if it’s a problem in Enzyme proper… EDIT: nope, Images (either ImageCore or TiffImages?) seems to have triggered it too.

1 Like

I suggest a PR to limit to 4, or even 3, or 1? 1 is safest, but we could start with 4 and always lower this. It doesn’t matter to much what is chosen, people could always override with JULIA_NUM_PRECOMPILE_TASKS.

A small modification to:

To implement that PR, I needed isWSL() and I started with that:

It could strictly be merged separately (though I need to fix some minor issue with it, feel free to suggest a change for that, or even for the two in one PRs), and then a PR building on it. Or I was thinking, would I get away with doing those in one PR?

1 Like

I appreciate the suggestions for a PR. Would be nice to know if others on Windows ARM systems experience the same thing?

I’d like to understand better what the real cause is here if possible…

Might be related to this: windows 10 - WSL process causing DPC_WATCHDOG_VIOLATION - Super User

Maybe WSL generally has trouble with lots of threads doing filesystem operations?

In your PR I see a second definition for Sys.iswindows rather than a definition for isWSL ?

Edit. It seems now corrected. Thanks!
May be two further comments: 1) possibly add a bullet in the 1.12 Readme? 2) is there an interest to back port it to 1.11? ( 1.10??)

1 Like

Another thought I had is that it could be related to ARM64 systems not having any concept of hyperthreading

1 Like