@jlperla thanks for the suggestions.
It looks like the authors took vectorized code from Matlab & Numpy, then translated it to Julia & found Julia under-performs.
Using above suggestions, I’ve found if you simply replaced the awkward/complicated vectorized parts of the code w/ simpler/more intuitive loops then Julia outperforms all the other languages except TF & PyTorch (which I don’t have right now).
If I add some of the other suggestions given here, the code becomes harder to read & only ~20% faster.
It’s a miracle that in Julia the simpler, more intuitive, & less bug prone code is also faster than numpy/matlab.