To elaborate on @mstewart’s answer above, even for the simpler problem of multiplying two m \times m matrices, which naively takes about 10 lines of code (3 nested loops), the simplest implementations (in any language!) are typically orders of magnitude slower than highly optimized algorithms (which do the “same” \sim 2m^3 floating-point operations; the theoretical lower-complexity matrix-mult algorithms are virtually never used). Optimized matrix-multiply “BLAS” libraries often devote \gtrsim 10,000 lines of code to this problem! It’s not a question of the speed of “for loops” — optimized code completely re-organizes the algorithm into “block” operations that have better memory locality, for example.
See also the discussion thread: Julia matrix-multiplication performance
It is very hard to beat (or even come close to) the performance of highlyoptimized libraries for basic operations on generic dense matrices, even if you understand these performance issues! This is true in any language, even in compiled languages like C. (Where you can do better is if your matrices are very special and and you can exploit that for improved algorithms.)
Note, by the way, that all of these type declarations are not required for performance. As long as you initialize your variables in a type-stable way like d = zero(eltype(b)); e = zero(eltype(A))
, you can omit all of the type declarations and the code will be just as fast … and more generic, since it will work on matrices of any numeric type. (This is a common misconception.) See “Argument-Type Declarations” in the Julia manual.