I forgot to mention this, but Example for the Depth first multithread implementation performance gain as a motivation reminded me that the early termination feature depends on that Julia scheduler being depth-first. The computed result is deterministic and scheduler independent. However, the depth-first scheduling makes it possible to terminate as early as possible by writing the reduction in divide-and-conquer approach. It makes the implementation very straightforward, if not trivial. A big thanks to Julia dev team!