The issue appears to be due to the behavior of threads in 0.6. I’ve had a similar problem (see Multithreading performance regressions in 0.6?). You can solve this by wrapping the threaded loop in its own function and calling that. Doing that I get dx!
to be about twice as fast as dx
.
Also you should run each function once before timing it or you will include the compilation time that only occurs on the first run.