Help needed getting started with threads!

TimG · September 1, 2024, 7:16pm

I can’t get threads to work for me despite the sense I’ve picked up here that it should be easy.

I have a large number (hundreds) of Excel files to write out. I construct them from a largeish dataset using a simple template using XLSX.jl and then write them out sequentially. This takes a bit of time, so I thought I’d try using multiple threads - more as an exercise in learning than in expectation of a significant speed-up. (I know writing to disk is slow).

I can’t get it to work!

I have two arrays, one containing the intended file names and one containing the XLSXfiles.

    @time begin
        for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
            Threads.@spawn begin
               for i in idcs
                    o = @view(outfiles[i])
                    f = @view(filled_templates[i])
                    XLSX.writexlsx(o, f, overwrite=true)
                end
            end
        end

The values of idcs, outfiles and filled_templates are what I’d expect. When I run this with a set of around 150 Excel templates and 4 threads, it takes between one and three seconds to run. No error is generated and my code runs on, but no Excel files are created.

I’ve tried many variations using @spawn, @threads, OhMyThreads @tasks, etc. To no avail.

I expect I’m falling into one of several traps multi-threading offers, but I can’t figure it out.

Do you have any (simple) pointers for me to follow?

Thanks!

gdalle · September 1, 2024, 7:31pm

Hi there! What does the sequential for loop look like, and does it work?

danielwe · September 1, 2024, 7:46pm

TimG:

    @time begin
        for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
            Threads.@spawn begin
               for i in idcs
                    o = @view(outfiles[i])
                    f = @view(filled_templates[i])
                    XLSX.writexlsx(o, f, overwrite=true)
                end
            end
        end

I don’t know if this is your main problem, but this code is missing a @sync (or other means of synchronization) to wait until all the spawned tasks have finished. Try replacing for idcs in ... with @sync for idcs in ... on the outer loop.

TimG · September 1, 2024, 8:04pm

Grossly simplified,

for crow in eachrow(CSVrows)
    # Do preparatory stuff (define filename, etc)
    # Open XLSX template
    for sn in XLSX.sheetnames(template)
        writetemplate(template, crow)
    end
#            push!(outfiles, out_file)
#            push!(filled_templates, template)
    XLSX.writexlsx(out_file, template, overwrite=true)
end

This works. I did try to make this whole loop work with threads but there is too much going on and I couldn’t get it working. I tried to focus just on the slow bit instead.

This loop is also part of an outer loop that iterates over multiple batches of data in different folders, processing each folder in turn. It all works but takes many minutes.

The commented push! commands are where I was trying to create the arrays of filenames and filled templates to use in threads.

TimG · September 1, 2024, 8:15pm

Thank you for the suggestion @danielwe. Unfortunately, adding @sync doesn’t seem to have made a difference. The files still aren’t being written.

I was trying to model my approach off this advice, which also doesn’t use @sync.

danielwe · September 1, 2024, 8:29pm

That code is synchronized through the sum(fetch, tasks) call, which needs to call fetch on each task and thus will not return until all tasks are finished. This is one of the “other means of synchronization” I mentioned.

Unfortunately, I don’t have any further insight regarding your problem.

TimG · September 1, 2024, 8:42pm

I put the @sync command one level higher:

    @time begin
        for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
            Threads.@sync Threads.@spawn begin
                for i in idcs
                    o = @view(outfiles[i])
                    f = @view(filled_templates[i])
                    XLSX.writexlsx(o, f, overwrite=true)
                end
            end
        end
    end

This did better, but showed:

ERROR: LoadError: TaskFailedException

    nested task error: MethodError: no method matching writexlsx(::SubArray{String, 0, Vector{String}, Tuple{Int64}, true}, ::SubArray{XLSX.XLSXFile, 0, Vector{XLSX.XLSXFile}, Tuple{Int64}, true}; overwrite::Bool)

With that hint, I changed to:

     XLSX.writexlsx(o[1], f[1], overwrite=true)

This appears to work - all the (correctly named) Excel files are created. ~~Unfortunately, their content is all just repeats of a small subset of the templates. The one to one relationship between the two arrays has been lost.~~
Edit: This was me in a separate change, now fixed.

danielwe · September 1, 2024, 9:05pm

TimG:

    @time begin
        for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
            Threads.@sync Threads.@spawn begin
                for i in idcs
                    o = @view(outfiles[i])
                    f = @view(filled_templates[i])
                    XLSX.writexlsx(o, f, overwrite=true)
                end
            end
        end
    end

This makes your code sequential—each iteration spawns a task and then waits until it finishes before moving to the next iteration. So with this code, you might as well skip the spawning and syncing entirely. Does removing the spawn and sync macros from this code change the behavior/output at all? If so, that is very surprising.

lmiq · September 1, 2024, 10:04pm

What happens if instead of writing the file you just print the thread id and the file name?

TimG · September 1, 2024, 10:23pm

If I use

   Threads.@spawn begin
        Threads.@sync for i in idcs

The script runs to completion and all the files are correctly created in all the right places. To all intents and purposes, everything works. However, on exit, I get terminated with exit code: -1073740940.

 *  The terminal process "C:\Users\TGebbels\.julia\juliaup\julia-1.10.5+0.x64.w64.mingw32\bin\julia.exe '--color=yes', '--startup-file=no', '--history-file=no', 'c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.120.2\scripts\debugger\run_debugger.jl', '\\.\pipe\vsc-jl-dbg-8456b44a-84e1-4207-bbe6-e4ef97a4444c', '\\.\pipe\vsc-jl-dbg-a9faa164-10bc-449a-b912-b6604e377fb2', '\\.\pipe\vsc-jl-cr-298a8501-1dec-4ec6-a109-c8a5c411e93e'" terminated with exit code: -1073740940.

If I take out the @spawn and @sync, I get the same results but a clean exit. What does the exit code mean, and does it matter?

lmiq · September 1, 2024, 10:39pm

There you have a single process being spawned, it’s not parallel.

(Or, more likely, I don’t get what you did last)

What if you just use

    @time begin
        Threads.@threads for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
               for i in idcs
                    o = @view(outfiles[i])
                    f = @view(filled_templates[i])
                    XLSX.writexlsx(o, f, overwrite=true)
                end
        end

(ps: those views are unnecessary there)

danielwe · September 1, 2024, 10:59pm

Once again, you’re not waiting for the tasks to finish, so the completion of your script does not mean the tasks are completed. The inner @sync is doing nothing here. From your error message, it looks like you’re running your script in the debugger, which I don’t have experience with. Maybe there’s an issue with multitasking in that context?

I think there are still multiple concurrent tasks being spawned. The outer loop was just omitted from the quoted code.

TimG · September 1, 2024, 11:12pm

Ah, yes. I’ve put the sync on the wrong loop! Thanks.

it looks like you’re running your script in the debugger

Not my intention. I always choose Run Without Debugging from the menu.

I think there are still multiple concurrent tasks being spawned. The outer loop was just omitted from the quoted code.

That was my intention.

danielwe · September 1, 2024, 11:15pm

How about you try just running a completely trivial multithreaded loop, like @lmiq suggested further up the thread? Something like this

@time begin
    Threads.@threads for i in 1:10
        @show Threads.threadid()
        @show current_task()
    end
end

Does that work?

danielwe · September 1, 2024, 11:22pm

Perhaps, in the VSCode extension, the debugger is always responsible for executing scripts whether or not the debugging functionality is enabled. In that case, it seems unlikely that such a basic feature as asynchronous tasks would be unsupported. But I don’t use VSCode, so can’t easily try it out it myself.

TimG · September 1, 2024, 11:26pm

Thanks @lmiq. I tried this and it works, but still with the improper termination at the end of the script.

I tried this before (I’m sure I did! ) but without the chunking. I had understood from the docs that @threads did this itself and that is what differentiated it from @spawn. Anyway, what ever I tried before didn’t work.

Just tried again

    Threads.@sync Threads.@threads for i in eachindex(outfiles)
           XLSX.writexlsx(outfiles[i], filled_templates[i], overwrite=true)
    end

and (now) it works! (but still with the improper termination).

danielwe · September 1, 2024, 11:28pm

Quick note: you don’t need @sync when using Threads.@threads. The latter takes care of synchronization for you. @sync is only needed when manually spawning tasks with Threads.@spawn or @async.

danielwe · September 1, 2024, 11:37pm

Correct, Threads.@threads divides the iteration range into chunks similar to what ChunkSplitters.jl does, although the latter in combination with @spawn provides a lot more flexibility. (Actually, Julia 1.11 will add a non-chunking mode Threads.@threads :greedy, but the default mode still chunks).

TimG · September 1, 2024, 11:38pm

So, I tried

    Threads.@sync for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
       Threads.@spawn begin
          for i in idcs
              println(Threads.threadid(), "  ", first(basename(outfiles[i]), 12))
              XLSX.writexlsx(outfiles[i], filled_templates[i], overwrite=true)
          end
       end
    end

and got

1  GCS-00000001
2  GCS-00000011
3  GCS-00000020
4  GCS-00000029
1  GCS-00000002
1  GCS-00000003
2  GCS-00000012
3  GCS-00000021
4  GCS-00000030
1  GCS-00000004
1  GCS-00000005
4  GCS-00000031
3  GCS-00000022
2  GCS-00000013
1  GCS-00000006
3  GCS-00000023
4  GCS-00000032
2  GCS-00000014
1  GCS-00000007
1  GCS-00000008
4  GCS-00000033
2  GCS-00000015
3  GCS-00000024
1  GCS-00000009
4  GCS-00000034
1  GCS-00000010
2  GCS-00000016
3  GCS-00000025
4  GCS-00000035
3  GCS-00000026
2  GCS-00000017
4  GCS-00000036
2  GCS-00000018
3  GCS-00000027
4  GCS-00000037
2  GCS-00000019
3  GCS-00000028

But still got an improper termination code!

danielwe · September 1, 2024, 11:41pm

What if you remove the XLSX line? The point is to see if the termination code still appears when you remove the actual work and only do trivial things in the loop.

Topic		Replies	Views
Write to text file while using threads New to Julia threads , io	12	1178	November 24, 2021
Updating an XLSX file simultaneously by more than one Julia script Data question , data , xlsx , excel	5	1171	October 26, 2023
Multi-threading or multi-processing, how to know which to use and when? Performance question , parallel , multithreading , distributed	32	6286	December 1, 2021
What is julia doing with your threads? General Usage	23	1130	February 21, 2024
Behavior of threads General Usage multithreading	33	1721	March 24, 2023

Help needed getting started with threads!

Related topics