I expected this to be a trivial optimization, but am struggling with writing vectorized code that doesn’t allocation memory. Consider the following code:
function gram_schmidt1!{T <: Real}(U::Matrix{T})
m = size(U, 2)
@inbounds for k = 1:m
uk = view(U,:,k)
@inbounds for j = 1:k-1
uk .-= (U[:,j] ⋅ uk) .* U[:,j]
end
uk ./= norm(uk)
end
end
Then I get
using BenchmarkTools
N = 100
A = randn(N, N)
@btime gram_schmidt1!($A)
6.859 ms (19900 allocations: 8.62 MiB)
which results in a surprisingly high number of allocations.
If I instead use a de-vectorized version,
function gram_schmidt2!{T <: Real}(U::Matrix{T})
n, m = size(U)
@assert n ≥ 1
@inbounds for k = 1:m
for j = 1:k-1
dotjk = U[1,j] * U[1,k]
@simd for i = 2:n
dotjk += U[i,j] * U[i,k]
end
@simd for i = 1:n
U[i,k] -= dotjk * U[i,j]
end
end
uknorm = abs2(U[1,k])
@simd for i = 2:n
uknorm += abs2(U[i,k])
end
uknorm = √(uknorm)
@simd for i = 1:n
U[i,k] /= uknorm
end
end
end
I played around with benchmarking and memory allocation analysis, and even the dot product seems to allocate memory. Am I doing something fundamentally wrong here?
The U[:,j] should be a view instead, but other than that your original code is ok. Doing that on 0.7 results in much more acceptable allocations (~200kB). I haven’t looked at the remaining allocations, but I suspect it’s just that the compiler is currently not capable of stack allocating wrappers of objects containing heap references (such as views) - at least across an inlining boundary. If you’re up for it, you may want to add this benchmark (the view version) to https://github.com/JuliaCI/BaseBenchmarks.jl to make sure it gets tracked as the compiler improves.
OK, great! I was worried that I am doing something wrong. I’m not yet ready to move my research code to 0.7/1.0, and so might stick with the de-vectorized version for now.
Just to clarify, with making the U[:,j] a view, do you mean
function gram_schmidt1!{T <: Real}(U::Matrix{T})
m = size(U, 2)
@inbounds for k = 1:m
uk = view(U,:,k)
@inbounds for j = 1:k-1
uj = view(U,:,j)
uk .-= (uj ⋅ uk) .* uj
end
uk ./= norm(uk)
end
end
This one actually significantly reduces memory allocations, even on 0.6.3:
using BenchmarkTools
N = 100
A = randn(N, N)
@btime gram_schmidt1!($A)
6.069 ms (5050 allocations: 236.72 KiB)
Nonetheless, it isn’t anywhere close to the performance of the de-vectorized version.
Either way, I’d be happy to add it to BaseBenchmarks. Which benchmark group would be the adequate one?
There exist “unsafe views” defined by packages that are guaranteed not to allocate but are unsafe in the sense that they don’t keep the parent array alive.
julia> using LinearAlgebra, UnsafeArrays, BenchmarkTools
julia> Base.mightalias(A::UnsafeArray, B::UnsafeArray) = false
julia> function gram_schmidt1!(U::Matrix{T}) where {T <: Real}
m = size(U, 2)
@inbounds for k = 1:m
uk = uview(U,:,k)
@inbounds for j = 1:k-1
uj = uview(U,:,j)
uk .-= (uj ⋅ uk) .* uj
end
uk ./= norm(uk)
end
end
gram_schmidt1! (generic function with 1 method)
julia> N = 100;
julia> A = randn(N, N);
julia> @btime gram_schmidt1!($A);
294.633 μs (0 allocations: 0 bytes)
Looks good. As in my case the views are guaranteed to be safe, I might try this instead.
Given that a view is only safe if it keeps the parent object alive, does this mean that we can’t expect to get (safe) views that don’t allocate memory?
It is worth pointing out that you can protect the arrays. UnsafeArrays provides a convenient macro; from their Readme:
using UnsafeArrays
@uviews A begin
colnorms!(dest, A)
end
@uviews protects the original array A from GC, so the above is safe as long as the original array is not reallocated (via resize!, etc.) while the scope of @uviews is active.
The macro also replaces regular views with uviews:
julia> function colnorms!(dest::AbstractVector, A::AbstractMatrix)
for i in axes(A, 2)
dest[i] = norm(view(A, :, i))
end
dest
end
julia> A = rand(50, 100000);
julia> dest = similar(A, size(A, 2));
julia> @benchmark colnorms!($dest, $A)
BenchmarkTools.Trial:
memory estimate: 4.58 MiB
allocs estimate: 100000
--------------
minimum time: 4.436 ms (0.00% GC)
median time: 4.527 ms (0.00% GC)
mean time: 4.848 ms (6.03% GC)
maximum time: 7.441 ms (20.70% GC)
--------------
samples: 1030
evals/sample: 1
julia> function colnormsu!(dest, A)
@uviews A begin
colnorms!(dest, A)
end
end
colnormsu! (generic function with 1 method)
julia> @benchmark colnormsu!($dest, $A)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.037 ms (0.00% GC)
median time: 4.050 ms (0.00% GC)
mean time: 4.056 ms (0.00% GC)
maximum time: 4.326 ms (0.00% GC)
--------------
samples: 1230
evals/sample: 1
If @uviews protects the original array, is there any reason to ever not use @uviews (as opposed to @views)
Sorry for joining in so late. Yes, I recommend to always use @uviews. The uview() function exists only because I came up with the idea the macro later on during the development of UnsafeArrays.jl. @uviews will protect the original array from GC, both with Julia v0.6 and v0.7.
The view might survive the scope (e.g. you put in in a struct) and the parent array can get reallocated (resize!).
Yes, please keep that in mind - even @uviews can’t protect you against that.
@uviews uses the same name for the unsafe view as the original array, it makes it (intentionally) hard to access (and resize, etc.) the original array within the scope of the macro.
But it’s true, the original array may be accessible in some other way. UnsafeArrays.jl should only be used if you can keep access to the array in question fully under control.
An @kristoffer.carlsson is right - never, ever let the unsafe array escape the scope of @uviews!