Initializing a number matrix

I was wondering if there was a faster way of initializing a number matrix without needing to traverse the matrix twice as in the following code (where the first zeros command needs to go over the whole matrix to zero it out).

function f1(N)
    n = zeros(Float64,N,N)
    @inbounds for i=1:N; n[i,i] = (i-1); end
    return n
end

giving:

julia> f1(5)
5×5 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0
 0.0  0.0  2.0  0.0  0.0
 0.0  0.0  0.0  3.0  0.0
 0.0  0.0  0.0  0.0  4.0

Of course, checking if i==j to decide whether to assign i-1 or 0 is slower, but I figured that there should be a way with some clever arithmetic. I’ve been thinking about it for a little while now, but couldn’t find a nice way.

Perhaps your actual usage is more complicated than this, but Diagonal(0.0:4.0) does what you want for your example.

Edit for good measure: Seems like your way is still faster though.

julia> @btime Diagonal($0.0:4.0)
  165.341 ns (0 allocations: 0 bytes)
5×5 Diagonal{Float64,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}:
 0.0   ⋅    ⋅    ⋅    ⋅
  ⋅   1.0   ⋅    ⋅    ⋅
  ⋅    ⋅   2.0   ⋅    ⋅
  ⋅    ⋅    ⋅   3.0   ⋅
  ⋅    ⋅    ⋅    ⋅   4.0

julia> @btime f1(5)
  44.006 ns (1 allocation: 336 bytes)
5×5 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0
 0.0  0.0  2.0  0.0  0.0
 0.0  0.0  0.0  3.0  0.0
 0.0  0.0  0.0  0.0  4.0

I do need the full dense matrix.

This makes 1 pass and is still slower.

function f2(N)
   n = Array{Float64,2}(undef,N,N)
   @inbounds for i in eachindex(IndexCartesian(),n)
      n[i] = i[1] == i[2] ? i[1]-1 : 0.0
   end
   return n
end
julia> @btime f2(5)
  99.835 ns (1 allocation: 336 bytes)
5×5 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0
 0.0  0.0  2.0  0.0  0.0
 0.0  0.0  0.0  3.0  0.0
 0.0  0.0  0.0  0.0  4.0

To initiate a Matrix{<:Number} one does not need to traverse any element. In Julia 0.7-DEV you can use,

obj = Matrix{T}(undef, m, n)

where T<:Number, m<:Integer, n<:Integer (T is the type, m is the number of rows, and n is the number of columns).
If you need to actually define a Matrix will all entries zero(T) where T<:Number you can use

zeros(T, m, n)

If you do not need to preallocate and just need a Diagonal{<:Number},

Diagonal(::AbstractVector{<:Number})

will do it. Just note that Diagonal will not allow you to set non-diagonal entries. Use it only if the eventual Matrix is guaranteed to be Diagonal as otherwise you will need to Matrix(Diagonal(T)) where T<:AbstractVector{<:Number}.

In you example, I would recommend doing

function magic(n::Integer)
    output = Matrix{Float64}(undef, n, n)
    for (r,c) ∈ enumerate(1:n)
        @inbounds output[r,c] = r == c ? float(c - 1) : float(0)
    end
    return output
end
@btime magic(5)
  46.038 ns (1 allocation: 336 bytes)

Yes, like I mentioned, checking if i==j would be slower. I do believe that there is a way to do this without conditionals though.

Another option, just because I’m procrastinating. This time, init to all zeros, but only loop over the diagonal elements.

function f4(N)
    n = zeros(Float64,N,N)
    for i in (CartesianIndex(i,i) for i=1:N)
      @inbounds n[i] = i[1]-1
    end
    return n
end

For the record, even on a 100x100 matrix, f4 is only 0.4 microseconds faster than your original f1 and uses the same amount of memory. I don’t think you will see any amazing performance increases in such a simple task.

That’s because f4 is the same as f1. It does seem like allocating the memory is the most expensive operation here though.

I suppose the essence of my question is:
Given i and j, what is a function that is equivalent to g(i,j) = i==j ? i : 0, but without the conditional?.

That might not be possible, but I was curious.

Ha sorry, didn’t notice that you were only looping over the diagonal already. I don’t think you can do better. For example, amend the function so that you pass in the already zeroed out matrix. How long does it take to set the diagonal only? It’s going to be nonallocating and my guess is that it is <10ns for that size matrix. The bulk of the function is actually allocating the matrix in the first place.

function f5(N)
    n = Matrix{Float64}(undef,N,N)
    for i in (CartesianIndex(i,i) for i=1:N)
      @inbounds n[i] = i[1]-1
    end
    return n
end

Takes 29ns compared to the 44ns for f4. So the difference tells you how long it takes to pass over every element in the matrix.

You could try using ifelse instead of a branch. For this sort of micro-optimization, though, I don’t think you’ll have too much of a win over two iterations through the array. What size arrays are you using? If they’re huge, you’ll possibly be able to do it faster with some calloc hackery.

I’ve seen a similar thing done in octave before
g(i,j) = i * (i==j)

I’m doing another pass to actually initialize the array to what I want, so there are 3 passes:

  1. zeros, 2. diagonal, 3. full matrix

If there was an extremely cheap way to set the diagonal elements directly in the third pass, I could save (estimated by simply removing 1 & 2 above) around 20% of the runtime, for a 25x25 matrix. ifelse is still too expensive, which is why I was thinking about some clever modular arithmetic.

Did you benchmark this? I am not sure it works this way, you may lose more on branch mispredictions than you gain on memory access. For a 25x25 matrix, everything should be in the L1 cache.

Yes, I did.

The three out-of-order memory access are a decent fraction of the total (compared to a single pass over the matrix) since the actual calculations in step 3 are somewhat simple.

The whole point of the original question was to find a way of doing this without a branch or conditional on i==j. Though, again, there might not be such a way.

Here is a simplified benchmark which reproduces what I am seeing:

const H_u = [rand(25,25) for i=1:5];

function f1(ξ=0.22)
    ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
    M = zeros(typeof(ξ),25,25)
    @inbounds for i=1:25; M[i,i] = (i-1); end
    @inbounds for i = 1:length(M)
        M[i] += ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
    end
    return M
end

function f2(ξ=0.22)
    ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
    M = Matrix{Float64}(25,25)
    @inbounds for i = 1:length(M)
        M[i] = ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
    end
    return M
end

using BenchmarkTools

f1() == f2()+Diagonal(0:24)

@benchmark f1()
@benchmark f2()

This is unrelated to your original question, but you might find the @evalpoly macro to be beneficial here.

Try

function f3(ξ, H_u)
    M = f2(ξ, H_u)
    M[1:26:625] += 0:24
    M
end

where

function f1(ξ, H_u)
    ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
    M = zeros(typeof(ξ),25,25)
    @inbounds for i=1:25; M[i,i] = (i-1); end
    @inbounds for i = 1:length(M)
        M[i] += ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
    end
    return M
end

function f2(ξ, H_u)
    ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
    M = Matrix{Float64}(25,25)
    @inbounds for i = 1:length(M)
        M[i] = ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
    end
    return M
end

eg

julia> @btime f1($ξ, $H_u);
  2.556 μs (1 allocation: 5.06 KiB)

julia> @btime f2($ξ, $H_u);
  2.052 μs (1 allocation: 5.06 KiB)

julia> @btime f3($ξ, $H_u);
  2.253 μs (5 allocations: 5.78 KiB)

so it is a bit better.

Also, I don’t know if you use the same H_u multiple times, but

S_u = hcat(vec.(H_u)...)

function f4(ξ, S_u)
    M = reshape(S_u * (ξ .^ (1:5)), 25, 25)
    M[1:26:625] += 0:24
    M
end

then

julia> f4(ξ, S_u) ≈ f1(ξ, H_u)
true

julia> @btime f4($ξ, $S_u);
  1.565 μs (8 allocations: 6.00 KiB)
1 Like

That reshape trick is nice! I’m assuming it is faster because OpenBLAS is multi-threaded.

I did know about the macro. It seems to work with matrices as constants, but does not support fusion:

using BenchmarkTools
const A,B,C,D,E = (rand(4,4),rand(4,4),rand(4,4),rand(4,4),rand(4,4));
g1(ξ) = @evalpoly(ξ,A,B,C,D,E)
g2(ξ) = A .+ ξ.*B .+ ξ^2 .* C .+ ξ^3 .* D .+ ξ^4 .* E
@benchmark g1(0.1)
@benchmark g2(0.1)
BenchmarkTools.Trial:
  memory estimate:  1.63 KiB
  allocs estimate:  8
  --------------
  minimum time:     364.313 ns (0.00% GC)
  median time:      387.858 ns (0.00% GC)
  mean time:        427.272 ns (4.85% GC)
  maximum time:     3.043 μs (65.09% GC)
  --------------
  samples:          10000
  evals/sample:     211

BenchmarkTools.Trial:
  memory estimate:  240 bytes
  allocs estimate:  2
  --------------
  minimum time:     192.030 ns (0.00% GC)
  median time:      199.607 ns (0.00% GC)
  mean time:        214.374 ns (3.03% GC)
  maximum time:     1.572 μs (83.73% GC)
  --------------
  samples:          10000
  evals/sample:     656

And this is also something I’ve noticed, should g2 only give 1 allocation?

Note that on 0.7, both operations are much slower, not sure if this is a known problem:

julia> @benchmark g1(0.1)
┌ Warning: `indmin` is deprecated, use `argmin` instead.
│   caller = minimum at trials.jl:112 [inlined]
└ @ Core trials.jl:112
┌ Warning: `indmax` is deprecated, use `argmax` instead.
│   caller = maximum at trials.jl:117 [inlined]
└ @ Core trials.jl:117
BenchmarkTools.Trial:
  memory estimate:  1.63 KiB
  allocs estimate:  8
  --------------
  minimum time:     464.990 ns (0.00% GC)
  median time:      479.753 ns (0.00% GC)
  mean time:        538.288 ns (6.74% GC)
  maximum time:     169.435 μs (99.66% GC)
  --------------
  samples:          10000
  evals/sample:     198

julia> @benchmark g2(0.1)
BenchmarkTools.Trial:
  memory estimate:  688 bytes
  allocs estimate:  18
  --------------
  minimum time:     998.583 ns (0.00% GC)
  median time:      1.072 μs (0.00% GC)
  mean time:        1.493 μs (24.84% GC)
  maximum time:     2.842 ms (99.90% GC)
  --------------
  samples:          10000
  evals/sample:     12

Deprecations slow things down, so you’d have to turn them off for fair comparison. I get similar results for g1 (compared to 0.6.2), but do get some performance regression on g2.