Half Vectorization

Is there an algorithm that can do half vectorization, the vech (here) or generates elimination of duplication matrices (here)? I’m trying to recover only the unique elements of a covariance matrix in vector form.

Sure, the algorithm is: write a loop. (Or rather, two nested loops are probably the easiest way to implement vech.)

2 Likes

If compact code is more important than performance you can do A[tril(trues(A))]. Otherwise the loop solution can easily be implemented by a comprehension:
[A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]

Actually, A[tril(trues(A))] is significantly faster than the comprehension on my machine.

Explicit loops are even better, about 50x faster than this comprehension on my machine:

vech0(A) = A[tril(trues(A))]
vech1(A) = [A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]

function vech(A::AbstractMatrix{T}) where T
    m = LinAlg.checksquare(A)
    v = Vector{T}((m*(m+1))>>1)
    k = 0
    for j = 1:m, i = j:m
        @inbounds v[k += 1] = A[i,j]
    end
    return v
end

B = rand(1000,1000); B = B + B';
using BenchmarkTools, Compat
@btime vech0($B); @btime vech1($B); @btime vech($B);

gives

  893.148 μs (16 allocations: 4.06 MiB)
  21.599 ms (24 allocations: 5.00 MiB)
  385.942 μs (2 allocations: 3.82 MiB)

on my machine with Julia 0.6.

I knew loops would be faster than comprehensions, but I’m honestly surprised that the difference is so large. The comprehension version does multiple allocations (comprehensions with a filter, like i >= j, have to grow the array as they go along), loops over the entire array, and is not type-stable according to @code_warntype. Still, it seems like filtered comprehensions could faster…

3 Likes

In Julia 0.7 on my machine, things are somewhat better:

  vech0:   1.200 ms (9 allocations: 4.06 MiB)
  vech1:   5.514 ms (23 allocations: 5.00 MiB)
  vech:    377.758 μs (2 allocations: 3.82 MiB)

Even so, the factor of 14 difference is larger than I would have initially thought.

2 Likes

For completeness,

i = ceil(Int, sqrt(2 * n + 0.25) - 0.5)
j = n - i * (i - 1)  ÷ 2

gives an enumeration of the lower triangle indices.

Right, that’s what you get for guessing at performance with insufficient insight. It should have been obvious that it would require pretty advanced compiler optimizations to correctly predict the size of the filtered comprehension and avoid growing and other overhead.

Thank you all! These work great!

This code will not run with Julia 1.5.
vech0 will run with the modification: trues(A)trues(size(A))
vech will not run. (I did make the change LinAlg.checksquareLinearAlgebra.checksquare)

ERROR: LoadError: MethodError: no method matching Array{Float64,1}(::Int64)
Closest candidates are:
  Array{Float64,1}() where T at boot.jl:425
  Array{Float64,1}(::UndefInitializer, ::Int64) where T at boot.jl:406
  Array{Float64,1}(::UndefInitializer, ::Int64...) where {T, N} at boot.jl:412

What else must be modified?

vech0(A) = A[tril(trues(size(A)))]
vech1(A) = [A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]

function vech(A::AbstractMatrix{T}) where T
    m = LinearAlgebra.checksquare(A)
    v = Vector{T}((m*(m+1))>>1)
    k = 0
    for j = 1:m, i = j:m
        @inbounds v[k += 1] = A[i,j]
    end
    return v
end

B = rand(1000,1000); B = B + B';
using BenchmarkTools, Compat
@btime vech0($B); @btime vech1($B); @btime vech($B);

it’s erroring about this line, I think it should be

Vector{T}(undef, (m*(m+1))>>1)

? to initialize

2 Likes

Thank you! That solved it. Below are the times in Julia 1.5 on my laptop, in case someone is interested.

537.001 μs (8 allocations: 4.06 MiB)
  4.250 ms (20 allocations: 5.00 MiB)
  304.750 μs (2 allocations: 3.82 MiB)