Correct way to fetch results from distributed processes while also printing progress?

This is my setup:
I’m running a distributed computation where Futures return their own local results, and potential further (recursive) Futures that they have created. The initial Futures are stored in an array on the main thread, and new Futures will be added to this array after being fetched to the main thread as described below.

I would like for the main thread (or potentially multiple threads on the same machine) to pop! each Future from the array, fetch the data (this blocks), do some simple processing of the fetched data, and finally add the newly returned Futures into the array of futures and continue processing. I would also like the main thread to print some status on a timer, such as the number of Futures currently in the array, and the number of results processed so far. My attempts are as follows:

  • Spawning a new async Task for each Future - this is a somewhat successful approach, but the overhead/scaling is bad since I can end up with several 100k Tasks at once. It also seems to result in a bug where some Futures end up never getting fetched but simply disappearing.
  • Spawn only a few async Tasks that loop and process a single Future at a time. This is similar to what runtests.jl from the Julia repo does, but I simply cannot make it work - it seems like only one fetch() is run before everything is blocked and nothing happens. Also, for some reason, the fetch()es eventually end up never succeeding even though the distributed call has long since returned. It looks like the main thread just ends up idling, nothing ever happens.
  • Not doing anything asynchronously, just process each Future in a while length(futures) > 0 loop. This approach is the most successful, since it actually works consistently. However, mostly the main thread is completely idle waiting for whatever Future is currently blocking, so I would like an async/threaded solution.

I would post some code, but my distributed computation is very complicated, and I’m not sure what a good benchmark for that is. Perhaps I should try to reproduce some of my observations in a simple script.