Hard crash using @spawn instead of @async

I am trying to modify a function that currently works fine with @async, but by manually feeding it batches (4 hours) to generate about as many processes as I have available threads (on this Linux server, 32). I am changing it to take 24 hours (144 time steps) and work out the batch size using ChunkSplitters.jl, but it needs @spawn.
This old way works (@async version without chunks):

runbash(str) = run(`bash -c $str`)
hourminutesec = sort(vec([string(h,pad=2)*string(m*10,pad=2)*"00" for h in starthour:endhour, m in 0:5]))
for hms in hourminutesec
  !isfile("EbroLMA_"*chosendate*"_"*hms*"_0600.dat.gz") && @async begin
      status,best_stations = checkqua(chosendate,hms,foo,bar) 
      if status == true
         inputfiles = "/home/oscar/lma/input/$chosendate/L[$best_stations]*$hms*"
         proc_command = "/home/oscar/lma/lma_analysis_new --options $inputfiles"
         runbash(proc_command)
      end
   end
end

However, using @spawn here instead of @async (with or without @sync) crashes Julia (1.10.2). It generates the threads, as they progressively complete, a julia.exe process appears with high CPU, and suddenly the REPL gets killed without being able to see the error.

What could be the issue?

that generally indicates an out of memory error, but it’s hard to say without more information.

It seems the case, but for the same number of started threads as the @async version. I am now able to run it for a smaller batch. I note that with @spawn, when I supply 5 hours (30 threads) and it does not crash, it maintains a long-lived julia.exe thread alongside the actual lma_analysis_new threads with a lot of virtual memory and >2500% CPU but only 2.5% mem: 11.7g 3.2g 489140 R 2672 2.5 15:44.58 julia
It seems that @spawn generates way more overhead than @sync.

Edit: the threads completed and julia process reaches 3200%, REPL frozen.
Edit 2: also it generated only 7 output files instead of 30.

I’ve done more testing. First, I started running it as a serial loop. This revealed some crash because of a bug in the function. But now my serial loop runs fine, no errors occur. But the crash still occurs when I add @sync and @spawn.
Removing the map loop, and feeding it 24 time steps, sometimes produces a crash without killing the REPL. It displays:

double free or corruption (!prev)
[221604] signal (6.-6): Aborted
in expression starting at (line calling the function)

I then disabled the figure plotting of each time step (with CairoMakie). This was the culprit. Without it, everything runs as expected. I am not sure why. When I was still using @async, it could handle plotting without a problem.

I think this might be a thread safety bug in CairoMakie. does it work if you keep everything in parallel other than plotting? if so, you should probably file an issue on the Makie repository

Cairomakie is not threadsafe, for example when accessing FreeType objects because FreeType is also not threadsafe.

An FT_Face object can only be safely used from one thread at a time. Similarly, creation and destruction of FT_Face with the same FT_Library object can only be done from one thread at a time. On the other hand, functions like FT_Load_Glyph and its siblings are thread-safe and do not need the lock to be held as long as the same FT_Face object is not used from multiple threads at the same time

And

In multi-threaded applications it is easiest to use one FT_Library object per thread. In case this is too cumbersome, a single FT_Library object across threads is possible also, as long as a mutex lock is used around FT_New_Face and FT_Done_Face.

1 Like

Thanks. I could perhaps take out the plot section as its own function and place a ReEntrantLock around this function?

That could work yes. We have not conducted a thorough analysis of the problems with thread safety in Makie, I think for CairoMakie the font stuff is the biggest problem but there might also be things in Makie with globals like themes etc. In GLMakie it might be OpenGL global state. I think it would be totally reasonable to achieve thread safety for CairoMakie if someone were to put in the work.