# Initializing a number matrix

#1

I was wondering if there was a faster way of initializing a number matrix without needing to traverse the matrix twice as in the following code (where the first `zeros` command needs to go over the whole matrix to zero it out).

``````function f1(N)
n = zeros(Float64,N,N)
@inbounds for i=1:N; n[i,i] = (i-1); end
return n
end
``````

giving:

``````julia> f1(5)
5×5 Array{Float64,2}:
0.0  0.0  0.0  0.0  0.0
0.0  1.0  0.0  0.0  0.0
0.0  0.0  2.0  0.0  0.0
0.0  0.0  0.0  3.0  0.0
0.0  0.0  0.0  0.0  4.0
``````

Of course, checking if `i==j` to decide whether to assign `i-1` or `0` is slower, but I figured that there should be a way with some clever arithmetic. I’ve been thinking about it for a little while now, but couldn’t find a nice way.

#2

Perhaps your actual usage is more complicated than this, but `Diagonal(0.0:4.0)` does what you want for your example.

Edit for good measure: Seems like your way is still faster though.

``````julia> @btime Diagonal(\$0.0:4.0)
165.341 ns (0 allocations: 0 bytes)
5×5 Diagonal{Float64,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}:
0.0   ⋅    ⋅    ⋅    ⋅
⋅   1.0   ⋅    ⋅    ⋅
⋅    ⋅   2.0   ⋅    ⋅
⋅    ⋅    ⋅   3.0   ⋅
⋅    ⋅    ⋅    ⋅   4.0

julia> @btime f1(5)
44.006 ns (1 allocation: 336 bytes)
5×5 Array{Float64,2}:
0.0  0.0  0.0  0.0  0.0
0.0  1.0  0.0  0.0  0.0
0.0  0.0  2.0  0.0  0.0
0.0  0.0  0.0  3.0  0.0
0.0  0.0  0.0  0.0  4.0
``````

#3

I do need the full dense matrix.

#4

This makes 1 pass and is still slower.

``````function f2(N)
n = Array{Float64,2}(undef,N,N)
@inbounds for i in eachindex(IndexCartesian(),n)
n[i] = i[1] == i[2] ? i[1]-1 : 0.0
end
return n
end
``````
``````julia> @btime f2(5)
99.835 ns (1 allocation: 336 bytes)
5×5 Array{Float64,2}:
0.0  0.0  0.0  0.0  0.0
0.0  1.0  0.0  0.0  0.0
0.0  0.0  2.0  0.0  0.0
0.0  0.0  0.0  3.0  0.0
0.0  0.0  0.0  0.0  4.0
``````

#5

To initiate a `Matrix{<:Number}` one does not need to traverse any element. In Julia 0.7-DEV you can use,

``````obj = Matrix{T}(undef, m, n)
``````

where `T<:Number, m<:Integer, n<:Integer` (`T` is the type, `m` is the number of rows, and `n` is the number of columns).
If you need to actually define a `Matrix` will all entries `zero(T) where T<:Number` you can use

``````zeros(T, m, n)
``````

If you do not need to preallocate and just need a `Diagonal{<:Number}`,

``````Diagonal(::AbstractVector{<:Number})
``````

will do it. Just note that `Diagonal` will not allow you to set non-diagonal entries. Use it only if the eventual `Matrix` is guaranteed to be `Diagonal` as otherwise you will need to `Matrix(Diagonal(T)) where T<:AbstractVector{<:Number}`.

In you example, I would recommend doing

``````function magic(n::Integer)
output = Matrix{Float64}(undef, n, n)
for (r,c) ∈ enumerate(1:n)
@inbounds output[r,c] = r == c ? float(c - 1) : float(0)
end
return output
end
@btime magic(5)
46.038 ns (1 allocation: 336 bytes)
``````

#6

Yes, like I mentioned, checking if `i==j` would be slower. I do believe that there is a way to do this without conditionals though.

#7

Another option, just because I’m procrastinating. This time, init to all zeros, but only loop over the diagonal elements.

``````function f4(N)
n = zeros(Float64,N,N)
for i in (CartesianIndex(i,i) for i=1:N)
@inbounds n[i] = i[1]-1
end
return n
end
``````

#8

For the record, even on a 100x100 matrix, `f4` is only 0.4 microseconds faster than your original `f1` and uses the same amount of memory. I don’t think you will see any amazing performance increases in such a simple task.

#9

That’s because `f4` is the same as `f1`. It does seem like allocating the memory is the most expensive operation here though.

I suppose the essence of my question is:
Given `i` and `j`, what is a function that is equivalent to `g(i,j) = i==j ? i : 0`, but without the conditional?.

That might not be possible, but I was curious.

#10

Ha sorry, didn’t notice that you were only looping over the diagonal already. I don’t think you can do better. For example, amend the function so that you pass in the already zeroed out matrix. How long does it take to set the diagonal only? It’s going to be nonallocating and my guess is that it is <10ns for that size matrix. The bulk of the function is actually allocating the matrix in the first place.

``````function f5(N)
n = Matrix{Float64}(undef,N,N)
for i in (CartesianIndex(i,i) for i=1:N)
@inbounds n[i] = i[1]-1
end
return n
end
``````

Takes 29ns compared to the 44ns for `f4`. So the difference tells you how long it takes to pass over every element in the matrix.

#11

You could try using `ifelse` instead of a branch. For this sort of micro-optimization, though, I don’t think you’ll have too much of a win over two iterations through the array. What size arrays are you using? If they’re huge, you’ll possibly be able to do it faster with some `calloc` hackery.

#12

I’ve seen a similar thing done in octave before
`g(i,j) = i * (i==j)`

#13

I’m doing another pass to actually initialize the array to what I want, so there are 3 passes:

1. zeros, 2. diagonal, 3. full matrix

If there was an extremely cheap way to set the diagonal elements directly in the third pass, I could save (estimated by simply removing 1 & 2 above) around 20% of the runtime, for a 25x25 matrix. `ifelse` is still too expensive, which is why I was thinking about some clever modular arithmetic.

#14

Did you benchmark this? I am not sure it works this way, you may lose more on branch mispredictions than you gain on memory access. For a 25x25 matrix, everything should be in the L1 cache.

#15

Yes, I did.

The three out-of-order memory access are a decent fraction of the total (compared to a single pass over the matrix) since the actual calculations in step 3 are somewhat simple.

The whole point of the original question was to find a way of doing this without a branch or conditional on `i==j`. Though, again, there might not be such a way.

Here is a simplified benchmark which reproduces what I am seeing:

``````const H_u = [rand(25,25) for i=1:5];

function f1(ξ=0.22)
ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
M = zeros(typeof(ξ),25,25)
@inbounds for i=1:25; M[i,i] = (i-1); end
@inbounds for i = 1:length(M)
M[i] += ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
end
return M
end

function f2(ξ=0.22)
ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
M = Matrix{Float64}(25,25)
@inbounds for i = 1:length(M)
M[i] = ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
end
return M
end

using BenchmarkTools

f1() == f2()+Diagonal(0:24)

@benchmark f1()
@benchmark f2()
``````

#16

This is unrelated to your original question, but you might find the `@evalpoly` macro to be beneficial here.

#17

Try

``````function f3(ξ, H_u)
M = f2(ξ, H_u)
M[1:26:625] += 0:24
M
end
``````

where

``````function f1(ξ, H_u)
ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
M = zeros(typeof(ξ),25,25)
@inbounds for i=1:25; M[i,i] = (i-1); end
@inbounds for i = 1:length(M)
M[i] += ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
end
return M
end

function f2(ξ, H_u)
ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
M = Matrix{Float64}(25,25)
@inbounds for i = 1:length(M)
M[i] = ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
end
return M
end
``````

eg

``````julia> @btime f1(\$ξ, \$H_u);
2.556 μs (1 allocation: 5.06 KiB)

julia> @btime f2(\$ξ, \$H_u);
2.052 μs (1 allocation: 5.06 KiB)

julia> @btime f3(\$ξ, \$H_u);
2.253 μs (5 allocations: 5.78 KiB)
``````

so it is a bit better.

Also, I don’t know if you use the same `H_u` multiple times, but

``````S_u = hcat(vec.(H_u)...)

function f4(ξ, S_u)
M = reshape(S_u * (ξ .^ (1:5)), 25, 25)
M[1:26:625] += 0:24
M
end
``````

then

``````julia> f4(ξ, S_u) ≈ f1(ξ, H_u)
true

julia> @btime f4(\$ξ, \$S_u);
1.565 μs (8 allocations: 6.00 KiB)
``````

#18

That reshape trick is nice! I’m assuming it is faster because OpenBLAS is multi-threaded.

#19

I did know about the macro. It seems to work with matrices as constants, but does not support fusion:

``````using BenchmarkTools
const A,B,C,D,E = (rand(4,4),rand(4,4),rand(4,4),rand(4,4),rand(4,4));
g1(ξ) = @evalpoly(ξ,A,B,C,D,E)
g2(ξ) = A .+ ξ.*B .+ ξ^2 .* C .+ ξ^3 .* D .+ ξ^4 .* E
@benchmark g1(0.1)
@benchmark g2(0.1)
``````
``````BenchmarkTools.Trial:
memory estimate:  1.63 KiB
allocs estimate:  8
--------------
minimum time:     364.313 ns (0.00% GC)
median time:      387.858 ns (0.00% GC)
mean time:        427.272 ns (4.85% GC)
maximum time:     3.043 μs (65.09% GC)
--------------
samples:          10000
evals/sample:     211

BenchmarkTools.Trial:
memory estimate:  240 bytes
allocs estimate:  2
--------------
minimum time:     192.030 ns (0.00% GC)
median time:      199.607 ns (0.00% GC)
mean time:        214.374 ns (3.03% GC)
maximum time:     1.572 μs (83.73% GC)
--------------
samples:          10000
evals/sample:     656
``````

And this is also something I’ve noticed, should `g2` only give 1 allocation?

Note that on 0.7, both operations are much slower, not sure if this is a known problem:

``````julia> @benchmark g1(0.1)
┌ Warning: `indmin` is deprecated, use `argmin` instead.
│   caller = minimum at trials.jl:112 [inlined]
└ @ Core trials.jl:112
┌ Warning: `indmax` is deprecated, use `argmax` instead.
│   caller = maximum at trials.jl:117 [inlined]
└ @ Core trials.jl:117
BenchmarkTools.Trial:
memory estimate:  1.63 KiB
allocs estimate:  8
--------------
minimum time:     464.990 ns (0.00% GC)
median time:      479.753 ns (0.00% GC)
mean time:        538.288 ns (6.74% GC)
maximum time:     169.435 μs (99.66% GC)
--------------
samples:          10000
evals/sample:     198

julia> @benchmark g2(0.1)
BenchmarkTools.Trial:
memory estimate:  688 bytes
allocs estimate:  18
--------------
minimum time:     998.583 ns (0.00% GC)
median time:      1.072 μs (0.00% GC)
mean time:        1.493 μs (24.84% GC)
maximum time:     2.842 ms (99.90% GC)
--------------
samples:          10000
evals/sample:     12
``````

#20

Deprecations slow things down, so you’d have to turn them off for fair comparison. I get similar results for `g1` (compared to 0.6.2), but do get some performance regression on `g2`.