Naive matrix multiplication is super slow in Julia 1.0?

I benchmarked the following naive matrix multiplication algorithm in both Julia 0.6 and 1.0. Surprisingly, Julia 1.0 version runs 4 times slower. Doing @code_llvm shows very neat output in 0.6 vs. 1.0. Any idea what is going wrong? Why doesn’t @simd work well in 1.0?

The times are as follows:

  0.351667 seconds (4 allocations: 15.259 MiB, 0.39% gc time) # Julia 0.6
  1.454174 seconds (2 allocations: 7.629 MiB)                 # Julia 1.0      

and the code:

using Compat

function matgen(n)
  tmp = 1/n/n
  [tmp * (i-j) * (i+j-2) for i = 1:n, j = 1:n]
end

function mul(a, b)
  m,n = size(a)
  q,p = size(b)

  # transpose a for cache-friendliness
  aT  = transpose(a)
  out = Array{Float64}(undef,m,p)
  for i = 1:m
    for j = 1:p
      z = 0.0
      @simd for k = 1:n
        z += aT[k,i]*b[k,j]
      end
      out[i,j] = z
    end
  end
  out
end

function main(n)
  n = n÷2 * 2
  a = matgen(n)
  b = matgen(n)
  @time c = mul(a, b)
  v = n÷2 + 1
  println(c[v, v])
end

main(200)
main(1000)

Thanks to @mohamed82008, this has been solved. transpose is lazy in 1.0, so I needed to copy, otherwise the matrix aT was same as a and cash friendliness got destroyed.

For the record, here is the new timing (27% faster than 0.6):
0.276188 seconds (4 allocations: 15.259 MiB)

Try @inbounds and copy(transpose(a)). Transpose is lazy in v1.0 so you are iterating the parent matrix row-wise unless you explicitly copy. And @inbounds is important to trigger @simd.

6 Likes

Oops, yes you’re right, it’s the transpose being lazy. I totally missed this out. Thank you.

I don’t see the solution checkbox to accept that reply

Can you further explain this, please?
or tell me where to find details about why we need to use “copy”
I don’t find it very intuitive.

Do you mean every time z += aT[k,i]*b[k,j] is executed then the matrix is transposed unless we use copy()?

No, aT[k, i] just calls a[i, k] under the hood (Transpose andAdjoint overload getindex). In this case, actually materializing the transpose by allocating a new matrix makes it so that the layout in memory is better suited to this algorithm (better cache locality). That is not the case for all algorithms though.

Could I get a little bit more explanation to what lazy is (i.e. what it means for transpose to be lazy?)

if we reversed the order of the loops i and k then we wouldn’t need to reverse it?

The bigger benefit is that SIMD requires you to either load a block of contiguous elements, or broadcast a single element across a vector.
That means, if a SIMD vector has 4 elements, you either need to load 4 contiguous elements, or have something to do with a vector that contains 4 copies of the same element.
For an optimized matrix multiplication algorithm, one of the tricks is to do the latter. Another trick is that you want to minimize how much time you’re actually spending loading and storing elements from the matrices. This is where a for loop is especially bad – for each SIMD vector you load, you’re only using it for a single operation before you replace it with another!
In an avx2 kernel, you can get 12 fma instructions (48 multiplications and additions) per 2 contiguous loads and 6 broadcasts. With avx-512, you can get 28 fma instructions (224 multiplications and additions) per 2 contiguous loads and 14 broadcasts.

One major benefit of this is that you’re moving from place to place in memory much more slowly – and hence much better cache locality – than not tranposing. That gives the memory time to catch up to the CPU.

Anyway, in this example, actually transposing the matrix lets you load two contiguous sets at a time to get SIMD.

Could I get a little bit more explanation to what lazy is (i.e. what it means for transpose to be lazy?)

Lazy means that it doesn’t actually compute anything until you need it. The transpose/adjoint isn’t creating a new matrix, it’s just pointing to your old matrix, and translating getindex(A, i, j) into getindex(A, j, i). Some algorithms, like calling mul! and *, also dispatches differently based on whether the matrices are transposed or not. That’s because

julia> using LinearAlgebra

help?> BLAS.gemm!
  gemm!(tA, tB, alpha, A, B, beta, C)

  Update C as alpha*A*B + beta*C or the other three variants according to tA and tB. Return the updated C.

instead of actually tranposing, those functions can just redirect to gemm! and pass the correct values for tA and tB.

2 Likes

I would think that when modeling the naive case, you might be better off not including tricks like @simd and attempting to make the matrix cache-friendly. That is not really naive, and as others have pointed out, it is also not really expert.

In general I wouldn’t expect any code you write for matrix multiplication to be competitive with Julia, which should be smart enough to call the best BLAS routine for the case, whether a*b or a'*b. For that matter, so does Matlab.

Perhaps you don’t really care about matrix multiplication. If your test was about really how well Julia can produce C code, then that’s a tricky one, because Julia wants to be higher level than C. I guess you really want a flag like @nolazy, @notricks, @thanksbutjustgivemetheccode?

As I pointed out here, Julia can do a really good job “producing C code” for matrix multiplication!

You’ve gone too far, in no where did I say that I’m inventing my own replacement for Julia’s matrix multiplication. What I do is simple benchmarks across many languages that implement the same algorithm and I try to keep the same implementation across languages. No real use of the code above. @simd was left just for comparison between 0.6 and 1.0, since 6.0 requires that explicitly.

BTW, Julia 1.0 really shines in my benchmarks. In the mul benchmark above, it runs 1.8X faster than C and matches the best timings of ifort. Thanks a lot Julia devs and community, you’ve made a great job.

1 Like

I see, this isn’t meant to be matrix multiplication per se, but rather an operation testing for loops, arrays, and multiplication. Fair enough, but then I would want to stick to truly naive implementations.

I suppose this is an issue for all benchmarks, but there is still some question of what counts as naive. Does taking the transpose count as naive? It intends to take advantage of extra assumptions about caches (which were not correct in the original listing). Do you still transpose for all languages including row-major ordering? If order changes for each language-specific implementation, is that still naive, and still “same implementation across languages?” And if the language also offers a preferred implementation like a*b, isn’t it just as fair to that as re-ordering for cache?

I don’t mean to cut you or your benchmark down, but just to point out issues with any benchmark, and questions about what the rules should be.