Half Vectorization



Is there an algorithm that can do half vectorization, the vech (here) or generates elimination of duplication matrices (here)? I’m trying to recover only the unique elements of a covariance matrix in vector form.

Vector of upper triangle

Sure, the algorithm is: write a loop. (Or rather, two nested loops are probably the easiest way to implement vech.)


If compact code is more important than performance you can do A[tril(trues(A))]. Otherwise the loop solution can easily be implemented by a comprehension:
[A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]


Actually, A[tril(trues(A))] is significantly faster than the comprehension on my machine.

Explicit loops are even better, about 50x faster than this comprehension on my machine:

vech0(A) = A[tril(trues(A))]
vech1(A) = [A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]

function vech(A::AbstractMatrix{T}) where T
    m = LinAlg.checksquare(A)
    v = Vector{T}((m*(m+1))>>1)
    k = 0
    for j = 1:m, i = j:m
        @inbounds v[k += 1] = A[i,j]
    return v

B = rand(1000,1000); B = B + B';
using BenchmarkTools, Compat
@btime vech0($B); @btime vech1($B); @btime vech($B);


  893.148 μs (16 allocations: 4.06 MiB)
  21.599 ms (24 allocations: 5.00 MiB)
  385.942 μs (2 allocations: 3.82 MiB)

on my machine with Julia 0.6.

I knew loops would be faster than comprehensions, but I’m honestly surprised that the difference is so large. The comprehension version does multiple allocations (comprehensions with a filter, like i >= j, have to grow the array as they go along), loops over the entire array, and is not type-stable according to @code_warntype. Still, it seems like filtered comprehensions could faster…


In Julia 0.7 on my machine, things are somewhat better:

  vech0:   1.200 ms (9 allocations: 4.06 MiB)
  vech1:   5.514 ms (23 allocations: 5.00 MiB)
  vech:    377.758 μs (2 allocations: 3.82 MiB)

Even so, the factor of 14 difference is larger than I would have initially thought.


For completeness,

i = ceil(Int, sqrt(2 * n + 0.25) - 0.5)
j = n - i * (i - 1)  ÷ 2

gives an enumeration of the lower triangle indices.


Right, that’s what you get for guessing at performance with insufficient insight. It should have been obvious that it would require pretty advanced compiler optimizations to correctly predict the size of the filtered comprehension and avoid growing and other overhead.


Thank you all! These work great!