I benchmarked the following naive matrix multiplication algorithm in both Julia 0.6 and 1.0. Surprisingly, Julia 1.0 version runs 4 times slower. Doing `@code_llvm`

shows very neat output in 0.6 vs. 1.0. Any idea what is going wrong? Why doesn’t `@simd`

work well in 1.0?

The times are as follows:

```
0.351667 seconds (4 allocations: 15.259 MiB, 0.39% gc time) # Julia 0.6
1.454174 seconds (2 allocations: 7.629 MiB) # Julia 1.0
```

and the code:

```
using Compat
function matgen(n)
tmp = 1/n/n
[tmp * (i-j) * (i+j-2) for i = 1:n, j = 1:n]
end
function mul(a, b)
m,n = size(a)
q,p = size(b)
# transpose a for cache-friendliness
aT = transpose(a)
out = Array{Float64}(undef,m,p)
for i = 1:m
for j = 1:p
z = 0.0
@simd for k = 1:n
z += aT[k,i]*b[k,j]
end
out[i,j] = z
end
end
out
end
function main(n)
n = n÷2 * 2
a = matgen(n)
b = matgen(n)
@time c = mul(a, b)
v = n÷2 + 1
println(c[v, v])
end
main(200)
main(1000)
```

Thanks to @mohamed82008, this has been solved. `transpose`

is lazy in 1.0, so I needed to copy, otherwise the matrix `aT`

was same as `a`

and cash friendliness got destroyed.

For the record, here is the new timing (27% faster than 0.6):

`0.276188 seconds (4 allocations: 15.259 MiB)`