Thank you for this write-up.
Can you provide some comments for the following cases:
-
When the running time of each iteration is very unpredictable, how can one select a good value for the
chunks_per_thread
? -
What can be done with nested loops that can be parallelized?
There is a very elegant solution to these problems employed by cilk_for
of OpenCilk
with work stealing and a loop grainsize hint. Can we utilize such techniques with Julia as well?