I think you want a batch / basesize of 1.
Julia’s threading normally uses a batch size of around cld(length(x),Threads.nthreads()).
I recommend using ThreadsX, which lets you set the base size.
In particular ThreadsX.map or ThreadsX.foreach. These accept a basesize option; I would set basesize=1. On a computer with 36 threads:
julia> tids = ThreadsX.map(_->Threads.threadid(), 1:10Threads.nthreads(), basesize=1);
julia> tids[1:10]
10-element Vector{Int64}:
1
22
19
27
30
11
13
4
18
5
julia> tids[1:36] |> unique |> length
18
julia> tids[37:72] |> unique |> length
19
julia> tids = ThreadsX.map(_->(sleep(1e-1);Threads.threadid()), 1:10Threads.nthreads(), basesize=1);
julia> tids[1:36] |> unique |> length
24
julia> tids[37:72] |> unique |> length
25
Compare this to Threads.@threads:
julia> Threads.@threads :static for i in 1:10Threads.nthreads()
tids[i] = Threads.threadid()
end
julia> tids'
1×360 adjoint(::Vector{Int64}) with eltype Int64:
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 … 35 36 36 36 36 36 36 36 36 36 36
julia> tids == sort(tids)
true
ThreadsX with a batch size of 1 isn’t cycling, but it is running them 1 at a time and assigning new work as they complete it, which I think is exactly what you want. ThreadsX.foreach is for when you don’t want a return value; you can replace for loops with ThreadsX.foreach, but often you’re filling a vector in a loop, in which case ThreadsX.map is convenient, taking care of allocating and filling it for you.
There are a lot of other nice convenient functions in there like ThreadsX.mapreduce or ThreadsX.findfirst that are worth taking a look at.
Depending on your particular task, you can try different basesize values and benchmark to see how they do. Being to control that, and choose from a variety of convenience functions, make ThreadsX quite nice to use for threading.