Not sure about the original question, but for the Julia code you’re not actually initializing the input arrays:
This may result in subnormal numbers in your test inputs, which can adversely affect performance. See e.g. 50x speed difference in gemv for different values in vector - #3 by StefanKarpinski.