I can’t get threads to work for me despite the sense I’ve picked up here that it should be easy.
I have a large number (hundreds) of Excel files to write out. I construct them from a largeish dataset using a simple template using XLSX.jl and then write them out sequentially. This takes a bit of time, so I thought I’d try using multiple threads - more as an exercise in learning than in expectation of a significant speed-up. (I know writing to disk is slow).
I can’t get it to work!
I have two arrays, one containing the intended file names and one containing the XLSXfiles.
@time begin
for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
Threads.@spawn begin
for i in idcs
o = @view(outfiles[i])
f = @view(filled_templates[i])
XLSX.writexlsx(o, f, overwrite=true)
end
end
end
The values of idcs, outfiles and filled_templates are what I’d expect. When I run this with a set of around 150 Excel templates and 4 threads, it takes between one and three seconds to run. No error is generated and my code runs on, but no Excel files are created.
I’ve tried many variations using @spawn, @threads, OhMyThreads @tasks, etc. To no avail.
I expect I’m falling into one of several traps multi-threading offers, but I can’t figure it out.
Do you have any (simple) pointers for me to follow?
I don’t know if this is your main problem, but this code is missing a @sync (or other means of synchronization) to wait until all the spawned tasks have finished. Try replacing for idcs in ... with @sync for idcs in ... on the outer loop.
for crow in eachrow(CSVrows)
# Do preparatory stuff (define filename, etc)
# Open XLSX template
for sn in XLSX.sheetnames(template)
writetemplate(template, crow)
end
# push!(outfiles, out_file)
# push!(filled_templates, template)
XLSX.writexlsx(out_file, template, overwrite=true)
end
This works. I did try to make this whole loop work with threads but there is too much going on and I couldn’t get it working. I tried to focus just on the slow bit instead.
This loop is also part of an outer loop that iterates over multiple batches of data in different folders, processing each folder in turn. It all works but takes many minutes.
The commented push! commands are where I was trying to create the arrays of filenames and filled templates to use in threads.
That code is synchronized through the sum(fetch, tasks) call, which needs to call fetch on each task and thus will not return until all tasks are finished. This is one of the “other means of synchronization” I mentioned.
Unfortunately, I don’t have any further insight regarding your problem.
@time begin
for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
Threads.@sync Threads.@spawn begin
for i in idcs
o = @view(outfiles[i])
f = @view(filled_templates[i])
XLSX.writexlsx(o, f, overwrite=true)
end
end
end
end
This appears to work - all the (correctly named) Excel files are created. Unfortunately, their content is all just repeats of a small subset of the templates. The one to one relationship between the two arrays has been lost.
Edit: This was me in a separate change, now fixed.
This makes your code sequential—each iteration spawns a task and then waits until it finishes before moving to the next iteration. So with this code, you might as well skip the spawning and syncing entirely. Does removing the spawn and sync macros from this code change the behavior/output at all? If so, that is very surprising.
The script runs to completion and all the files are correctly created in all the right places. To all intents and purposes, everything works. However, on exit, I get terminated with exit code: -1073740940.
* The terminal process "C:\Users\TGebbels\.julia\juliaup\julia-1.10.5+0.x64.w64.mingw32\bin\julia.exe '--color=yes', '--startup-file=no', '--history-file=no', 'c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.120.2\scripts\debugger\run_debugger.jl', '\\.\pipe\vsc-jl-dbg-8456b44a-84e1-4207-bbe6-e4ef97a4444c', '\\.\pipe\vsc-jl-dbg-a9faa164-10bc-449a-b912-b6604e377fb2', '\\.\pipe\vsc-jl-cr-298a8501-1dec-4ec6-a109-c8a5c411e93e'" terminated with exit code: -1073740940.
If I take out the @spawn and @sync, I get the same results but a clean exit. What does the exit code mean, and does it matter?
There you have a single process being spawned, it’s not parallel.
(Or, more likely, I don’t get what you did last)
What if you just use
@time begin
Threads.@threads for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
for i in idcs
o = @view(outfiles[i])
f = @view(filled_templates[i])
XLSX.writexlsx(o, f, overwrite=true)
end
end
Once again, you’re not waiting for the tasks to finish, so the completion of your script does not mean the tasks are completed. The inner @sync is doing nothing here. From your error message, it looks like you’re running your script in the debugger, which I don’t have experience with. Maybe there’s an issue with multitasking in that context?
I think there are still multiple concurrent tasks being spawned. The outer loop was just omitted from the quoted code.
Perhaps, in the VSCode extension, the debugger is always responsible for executing scripts whether or not the debugging functionality is enabled. In that case, it seems unlikely that such a basic feature as asynchronous tasks would be unsupported. But I don’t use VSCode, so can’t easily try it out it myself.
Thanks @lmiq. I tried this and it works, but still with the improper termination at the end of the script.
I tried this before (I’m sure I did! ) but without the chunking. I had understood from the docs that @threads did this itself and that is what differentiated it from @spawn. Anyway, what ever I tried before didn’t work.
Just tried again
Threads.@sync Threads.@threads for i in eachindex(outfiles)
XLSX.writexlsx(outfiles[i], filled_templates[i], overwrite=true)
end
and (now) it works! (but still with the improper termination).
Quick note: you don’t need @sync when using Threads.@threads. The latter takes care of synchronization for you. @sync is only needed when manually spawning tasks with Threads.@spawn or @async.
Correct, Threads.@threads divides the iteration range into chunks similar to what ChunkSplitters.jl does, although the latter in combination with @spawn provides a lot more flexibility. (Actually, Julia 1.11 will add a non-chunking mode Threads.@threads :greedy, but the default mode still chunks).
Threads.@sync for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
Threads.@spawn begin
for i in idcs
println(Threads.threadid(), " ", first(basename(outfiles[i]), 12))
XLSX.writexlsx(outfiles[i], filled_templates[i], overwrite=true)
end
end
end
What if you remove the XLSX line? The point is to see if the termination code still appears when you remove the actual work and only do trivial things in the loop.