Yeah, with @threads there seems to always be some reason to have to manually split up the workload over the threads.
- In this case it’s because you want each of the threads to reduce to their own local variable before summing the results of the threads together. Similarly in Parallel is very slow - #17 by Elrod.
- Another case: Parallelizing for loop in the computation of a gradient - #18 by saschatimme (perhaps not as frequently encountered as the first, but still)
Maybe @threads in its current form is just not the right abstraction? At the very least, a mapreduce-style version of @threads similar to @parallel would be pretty useful I think.