Help needed getting started with threads!

I can’t get threads to work for me despite the sense I’ve picked up here that it should be easy.

I have a large number (hundreds) of Excel files to write out. I construct them from a largeish dataset using a simple template using XLSX.jl and then write them out sequentially. This takes a bit of time, so I thought I’d try using multiple threads - more as an exercise in learning than in expectation of a significant speed-up. (I know writing to disk is slow).

I can’t get it to work! :confused:

I have two arrays, one containing the intended file names and one containing the XLSXfiles.

    @time begin
        for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
            Threads.@spawn begin
               for i in idcs
                    o = @view(outfiles[i])
                    f = @view(filled_templates[i])
                    XLSX.writexlsx(o, f, overwrite=true)
                end
            end
        end

The values of idcs, outfiles and filled_templates are what I’d expect. When I run this with a set of around 150 Excel templates and 4 threads, it takes between one and three seconds to run. No error is generated and my code runs on, but no Excel files are created.

I’ve tried many variations using @spawn, @threads, OhMyThreads @tasks, etc. To no avail.

I expect I’m falling into one of several traps multi-threading offers, but I can’t figure it out.

Do you have any (simple) pointers for me to follow?

Thanks!

Hi there! What does the sequential for loop look like, and does it work?

I don’t know if this is your main problem, but this code is missing a @sync (or other means of synchronization) to wait until all the spawned tasks have finished. Try replacing for idcs in ... with @sync for idcs in ... on the outer loop.

2 Likes

Grossly simplified,

for crow in eachrow(CSVrows)
    # Do preparatory stuff (define filename, etc)
    # Open XLSX template
    for sn in XLSX.sheetnames(template)
        writetemplate(template, crow)
    end
#            push!(outfiles, out_file)
#            push!(filled_templates, template)
    XLSX.writexlsx(out_file, template, overwrite=true)
end

This works. I did try to make this whole loop work with threads but there is too much going on and I couldn’t get it working. I tried to focus just on the slow bit instead.

This loop is also part of an outer loop that iterates over multiple batches of data in different folders, processing each folder in turn. It all works but takes many minutes.

The commented push! commands are where I was trying to create the arrays of filenames and filled templates to use in threads.

Thank you for the suggestion @danielwe. Unfortunately, adding @sync doesn’t seem to have made a difference. The files still aren’t being written.

I was trying to model my approach off this advice, which also doesn’t use @sync.

That code is synchronized through the sum(fetch, tasks) call, which needs to call fetch on each task and thus will not return until all tasks are finished. This is one of the “other means of synchronization” I mentioned.

Unfortunately, I don’t have any further insight regarding your problem.

I put the @sync command one level higher:

    @time begin
        for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
            Threads.@sync Threads.@spawn begin
                for i in idcs
                    o = @view(outfiles[i])
                    f = @view(filled_templates[i])
                    XLSX.writexlsx(o, f, overwrite=true)
                end
            end
        end
    end

This did better, but showed:

ERROR: LoadError: TaskFailedException

    nested task error: MethodError: no method matching writexlsx(::SubArray{String, 0, Vector{String}, Tuple{Int64}, true}, ::SubArray{XLSX.XLSXFile, 0, Vector{XLSX.XLSXFile}, Tuple{Int64}, true}; overwrite::Bool)

With that hint, I changed to:

     XLSX.writexlsx(o[1], f[1], overwrite=true)

This appears to work - all the (correctly named) Excel files are created. Unfortunately, their content is all just repeats of a small subset of the templates. The one to one relationship between the two arrays has been lost.
Edit: This was me in a separate change, now fixed.

This makes your code sequential—each iteration spawns a task and then waits until it finishes before moving to the next iteration. So with this code, you might as well skip the spawning and syncing entirely. Does removing the spawn and sync macros from this code change the behavior/output at all? If so, that is very surprising.

1 Like

What happens if instead of writing the file you just print the thread id and the file name?

If I use

   Threads.@spawn begin
        Threads.@sync for i in idcs

The script runs to completion and all the files are correctly created in all the right places. To all intents and purposes, everything works. However, on exit, I get terminated with exit code: -1073740940.

 *  The terminal process "C:\Users\TGebbels\.julia\juliaup\julia-1.10.5+0.x64.w64.mingw32\bin\julia.exe '--color=yes', '--startup-file=no', '--history-file=no', 'c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.120.2\scripts\debugger\run_debugger.jl', '\\.\pipe\vsc-jl-dbg-8456b44a-84e1-4207-bbe6-e4ef97a4444c', '\\.\pipe\vsc-jl-dbg-a9faa164-10bc-449a-b912-b6604e377fb2', '\\.\pipe\vsc-jl-cr-298a8501-1dec-4ec6-a109-c8a5c411e93e'" terminated with exit code: -1073740940. 

If I take out the @spawn and @sync, I get the same results but a clean exit. What does the exit code mean, and does it matter?

There you have a single process being spawned, it’s not parallel.

(Or, more likely, I don’t get what you did last)

What if you just use

    @time begin
        Threads.@threads for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
               for i in idcs
                    o = @view(outfiles[i])
                    f = @view(filled_templates[i])
                    XLSX.writexlsx(o, f, overwrite=true)
                end
        end

(ps: those views are unnecessary there)

1 Like

Once again, you’re not waiting for the tasks to finish, so the completion of your script does not mean the tasks are completed. The inner @sync is doing nothing here. From your error message, it looks like you’re running your script in the debugger, which I don’t have experience with. Maybe there’s an issue with multitasking in that context?

I think there are still multiple concurrent tasks being spawned. The outer loop was just omitted from the quoted code.

2 Likes

Ah, yes. I’ve put the sync on the wrong loop! Thanks.

it looks like you’re running your script in the debugger

Not my intention. I always choose Run Without Debugging from the menu.

I think there are still multiple concurrent tasks being spawned. The outer loop was just omitted from the quoted code.

That was my intention.

How about you try just running a completely trivial multithreaded loop, like @lmiq suggested further up the thread? Something like this

@time begin
    Threads.@threads for i in 1:10
        @show Threads.threadid()
        @show current_task()
    end
end

Does that work?

Perhaps, in the VSCode extension, the debugger is always responsible for executing scripts whether or not the debugging functionality is enabled. In that case, it seems unlikely that such a basic feature as asynchronous tasks would be unsupported. But I don’t use VSCode, so can’t easily try it out it myself.

Thanks @lmiq. I tried this and it works, but still with the improper termination at the end of the script.

I tried this before (I’m sure I did! :thinking:) but without the chunking. I had understood from the docs that @threads did this itself and that is what differentiated it from @spawn. Anyway, what ever I tried before didn’t work.

Just tried again

    Threads.@sync Threads.@threads for i in eachindex(outfiles)
           XLSX.writexlsx(outfiles[i], filled_templates[i], overwrite=true)
    end

and (now) it works! (but still with the improper termination).

Quick note: you don’t need @sync when using Threads.@threads. The latter takes care of synchronization for you. @sync is only needed when manually spawning tasks with Threads.@spawn or @async.

1 Like

Correct, Threads.@threads divides the iteration range into chunks similar to what ChunkSplitters.jl does, although the latter in combination with @spawn provides a lot more flexibility. (Actually, Julia 1.11 will add a non-chunking mode Threads.@threads :greedy, but the default mode still chunks).

So, I tried

    Threads.@sync for idcs in chunks(eachindex(outfiles); n=Threads.nthreads())
       Threads.@spawn begin
          for i in idcs
              println(Threads.threadid(), "  ", first(basename(outfiles[i]), 12))
              XLSX.writexlsx(outfiles[i], filled_templates[i], overwrite=true)
          end
       end
    end

and got

1  GCS-00000001
2  GCS-00000011
3  GCS-00000020
4  GCS-00000029
1  GCS-00000002
1  GCS-00000003
2  GCS-00000012
3  GCS-00000021
4  GCS-00000030
1  GCS-00000004
1  GCS-00000005
4  GCS-00000031
3  GCS-00000022
2  GCS-00000013
1  GCS-00000006
3  GCS-00000023
4  GCS-00000032
2  GCS-00000014
1  GCS-00000007
1  GCS-00000008
4  GCS-00000033
2  GCS-00000015
3  GCS-00000024
1  GCS-00000009
4  GCS-00000034
1  GCS-00000010
2  GCS-00000016
3  GCS-00000025
4  GCS-00000035
3  GCS-00000026
2  GCS-00000017
4  GCS-00000036
2  GCS-00000018
3  GCS-00000027
4  GCS-00000037
2  GCS-00000019
3  GCS-00000028

But still got an improper termination code!

What if you remove the XLSX line? The point is to see if the termination code still appears when you remove the actual work and only do trivial things in the loop.

1 Like