I think the way to make the least modifications in a code that previously used threadid is to use ChunkSplitters.jl, by just replacing the threaded loop by a threaded loop over the chunks of the data:
julia> using ChunkSplitters, Base.Threads
julia> my_arr = rand(10_000);
julia> nchunks = 10
my_sum = zeros(10)
@threads for (ichunk, inds) in enumerate(index_chunks(my_arr; n=nchunks))
my_sum[ichunk] += sum(@view(my_arr[inds]))
end
sum(my_sum)
5033.886812176603
# replacement to
julia> my_sum = zeros(10)
@threads for i in eachindex(my_arr)
my_sum[threadid()] += my_arr[i]
end
sum(my_sum)
5033.886812176624
but OhMyThreads.jl is a higher-level alternative and is probably, most times, a better option after some initial small effort to rewrite the structure of the parallel code.
ps: In your case you would do:
buffers = create_thread_buffers(wf, nth)
# Parallel loop using @threads
using ChunkSplitters: chunks
@threads for (tid, iGP_range) in enumerate(chunks(1:length(mx); n=nth))
# Get thread-local buffers
GP = buffers.thread_buffers[tid] # tid is now the chunk index
comp_buffers = buffers.thread_comp_buffers[tid]
for iGP in iGP_range
# current calculations using iGP
end
end
(note that with that nth does not be necessarily equal to nthreads(), which can be useful to control the number of threads used, if nth < nthreads() or increase the number of tasks sometimes improving workload balance, if nth >> nthreads()).