Help me on strange performance slow down using SMatrix

function instruct2!(state, U, loc)
    a, c, b, d = U
    step = 1 << (loc - 1)
    step_2 = 1 << loc
    for j in 0:step_2:size(state, 1)-step
       for i in j+1:j+step
            u1rows!(state, i, i+step, a, b, c, d)
       end
    end
    return state
end

@inline @inbounds function u1rows!(state::AbstractVector, i::Int, j::Int, a, b, c, d)
    w = state[i]
    v = state[j]
    state[i] = a*w+b*v
    state[j] = c*w+d*v
    state
end

I’m using SMatrix instead of Matrix for a small matrix (2x2), the only related operations are iterate_index, which looks like a, b, c, d = U (U is the matrix), the rest of the code is only related to a, b, c, d, but the performance seems not to, the difference between SMatrix and Matrix increases along with the size of state

I tested this on Julia

Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.2.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
julia> @benchmark foreach(k->instruct2!($st, $U, 1), 1:100)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     183.067 ms (0.00% GC)
  median time:      191.323 ms (0.00% GC)
  mean time:        192.796 ms (0.00% GC)
  maximum time:     209.240 ms (0.00% GC)
  --------------
  samples:          26
  evals/sample:     1

julia> @benchmark foreach(k->instruct2!($st, $(Matrix(U)), 1), 1:100)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     178.031 ms (0.00% GC)
  median time:      181.558 ms (0.00% GC)
  mean time:        184.131 ms (0.00% GC)
  maximum time:     219.924 ms (0.00% GC)
  --------------
  samples:          28
  evals/sample:     1

But this looks unexpected since the main cost has nothing to do with which kind of matrix type to use…

Please provide enough code so that the benchmarks can be run, preferably with just copy and paste.

Also, there is no need to do a foreach loop for benchmarking, BenchmarkTools does that for your. The time difference seems very small as well.

1 Like

sorry, I missed first two lines:

using StaticArrays, BenchmarkTools

U = @SMatrix rand(ComplexF64, 2, 2)
st = rand(ComplexF64, 1<<20)

The overhead is small indeed, but it is somehow seems not to be constant on my laptop, it scales with the total time cost (when increase the size of state), and instruct! function is actually inside another for loop, so it’s pretty obvious when there’s a loop, like a few ms.

In fact, if I just measure the time cost of a, b, c, d = U, SMatrix is much faster, which make this look strange to me.