I was wondering if there was a faster way of initializing a number matrix without needing to traverse the matrix twice as in the following code (where the first zeros command needs to go over the whole matrix to zero it out).
function f1(N)
n = zeros(Float64,N,N)
@inbounds for i=1:N; n[i,i] = (i-1); end
return n
end
Of course, checking if i==j to decide whether to assign i-1 or 0 is slower, but I figured that there should be a way with some clever arithmetic. I’ve been thinking about it for a little while now, but couldn’t find a nice way.
To initiate a Matrix{<:Number} one does not need to traverse any element. In Julia 0.7-DEV you can use,
obj = Matrix{T}(undef, m, n)
where T<:Number, m<:Integer, n<:Integer (T is the type, m is the number of rows, and n is the number of columns).
If you need to actually define a Matrix will all entries zero(T) where T<:Number you can use
zeros(T, m, n)
If you do not need to preallocate and just need a Diagonal{<:Number},
Diagonal(::AbstractVector{<:Number})
will do it. Just note that Diagonal will not allow you to set non-diagonal entries. Use it only if the eventual Matrix is guaranteed to be Diagonal as otherwise you will need to Matrix(Diagonal(T)) where T<:AbstractVector{<:Number}.
In you example, I would recommend doing
function magic(n::Integer)
output = Matrix{Float64}(undef, n, n)
for (r,c) ∈ enumerate(1:n)
@inbounds output[r,c] = r == c ? float(c - 1) : float(0)
end
return output
end
@btime magic(5)
46.038 ns (1 allocation: 336 bytes)
For the record, even on a 100x100 matrix, f4 is only 0.4 microseconds faster than your original f1 and uses the same amount of memory. I don’t think you will see any amazing performance increases in such a simple task.
Ha sorry, didn’t notice that you were only looping over the diagonal already. I don’t think you can do better. For example, amend the function so that you pass in the already zeroed out matrix. How long does it take to set the diagonal only? It’s going to be nonallocating and my guess is that it is <10ns for that size matrix. The bulk of the function is actually allocating the matrix in the first place.
function f5(N)
n = Matrix{Float64}(undef,N,N)
for i in (CartesianIndex(i,i) for i=1:N)
@inbounds n[i] = i[1]-1
end
return n
end
Takes 29ns compared to the 44ns for f4. So the difference tells you how long it takes to pass over every element in the matrix.
You could try using ifelse instead of a branch. For this sort of micro-optimization, though, I don’t think you’ll have too much of a win over two iterations through the array. What size arrays are you using? If they’re huge, you’ll possibly be able to do it faster with some calloc hackery.
I’m doing another pass to actually initialize the array to what I want, so there are 3 passes:
zeros, 2. diagonal, 3. full matrix
If there was an extremely cheap way to set the diagonal elements directly in the third pass, I could save (estimated by simply removing 1 & 2 above) around 20% of the runtime, for a 25x25 matrix. ifelse is still too expensive, which is why I was thinking about some clever modular arithmetic.
Did you benchmark this? I am not sure it works this way, you may lose more on branch mispredictions than you gain on memory access. For a 25x25 matrix, everything should be in the L1 cache.
The three out-of-order memory access are a decent fraction of the total (compared to a single pass over the matrix) since the actual calculations in step 3 are somewhat simple.
The whole point of the original question was to find a way of doing this without a branch or conditional on i==j. Though, again, there might not be such a way.
Here is a simplified benchmark which reproduces what I am seeing:
const H_u = [rand(25,25) for i=1:5];
function f1(ξ=0.22)
ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
M = zeros(typeof(ξ),25,25)
@inbounds for i=1:25; M[i,i] = (i-1); end
@inbounds for i = 1:length(M)
M[i] += ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
end
return M
end
function f2(ξ=0.22)
ξ² = ξ*ξ; ξ³ = ξ²*ξ; ξ⁴ = ξ³*ξ; ξ⁵ = ξ⁴*ξ
M = Matrix{Float64}(25,25)
@inbounds for i = 1:length(M)
M[i] = ξ*H_u[1][i] + ξ²*H_u[2][i] + ξ³*H_u[3][i] + ξ⁴*H_u[4][i] + ξ⁵*H_u[5][i]
end
return M
end
using BenchmarkTools
f1() == f2()+Diagonal(0:24)
@benchmark f1()
@benchmark f2()
Deprecations slow things down, so you’d have to turn them off for fair comparison. I get similar results for g1 (compared to 0.6.2), but do get some performance regression on g2.