I am trying to write a good thread-parallel implementation of a problem that can be described quickly like this:
- The user has to do n tasks, each of which require a workspace.
- They can provide any number of workspaces they want—say they provide m of them.
- Finally, they might be running it with any number of threads, say l, and it is possible that l > m.
What’s the correct way to still utilize multi-threading to the highest degree possible that is aware of m?
Here’s a somewhat motivating example: say that you wanted the log-determinant of some number of giant matrices. You don’t have enough RAM to create all of them at once, but perhaps you have enough RAM to create three or four of the matrices. But other parts of your program benefit from using many more threads than that, so you’re using more than three or four threads. Here’s a MWE that works when l \leq m:
using StableRNGs BLAS.set_num_threads(1) # Atomic doesn't seem to be designed for this type. const workspaces = [zeros(100,100) for _ in 1:3] function dostuff(n, spaces) logdets = zeros(n) Threads.@threads for i in 1:n # Choose some _available_ workspace. If length(workspaces) is greater than # Threads.nthreads() this works, but if not then obviously this fails, which # makes me think this isn't the right way to do this. w = workspaces[Threads.threadid()] s = size(w,1) # Fill the buffer with something: x = randn(StableRNG(i), s) for k in 1:s xk = x[k] @simd for j in 1:s xj = x[j] @inbounds w[j,k] = exp(-abs(xj-xk)) end end # Put the computed value in your output array: w_fact = cholesky!(w) logdets[i] = logdet(w_fact) end logdets end dostuff(30, workspaces)
But the fact that this breaks for l > m makes me think that this is not the correct way to write something like this. I’m aware of the
Atomic structure, but I gather that a collection of matrices is not really the intended use case of that object.
Can somebody more knowledgeable than me provide any guidance? Thanks for reading!