Just published a simple example script of neural network simpleNN.jl from the original post: macOS Python with numpy faster than Julia in training neural network.
There are a lot of optimization tips in the original answer from @ChrisRackauckas, that are implemented in the first version of the script. Some additional sequental changes and improvements are shown in versions 2-4. And the point is, when I use matrix multiplication (v3), training epoch becomes x10 faster.