Is there an algorithm that can do half vectorization, the vech (here) or generates elimination of duplication matrices (here)? I’m trying to recover only the unique elements of a covariance matrix in vector form.

# Half Vectorization

**stevengj**#2

Sure, the algorithm is: write a loop. (Or rather, two nested loops are probably the easiest way to implement vech.)

If compact code is more important than performance you can do `A[tril(trues(A))]`

. Otherwise the loop solution can easily be implemented by a comprehension:

`[A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]`

**stevengj**#4

Actually, `A[tril(trues(A))]`

is significantly *faster* than the comprehension on my machine.

Explicit loops are even better, about 50x faster than this comprehension on my machine:

```
vech0(A) = A[tril(trues(A))]
vech1(A) = [A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]
function vech(A::AbstractMatrix{T}) where T
m = LinAlg.checksquare(A)
v = Vector{T}((m*(m+1))>>1)
k = 0
for j = 1:m, i = j:m
@inbounds v[k += 1] = A[i,j]
end
return v
end
B = rand(1000,1000); B = B + B';
using BenchmarkTools, Compat
@btime vech0($B); @btime vech1($B); @btime vech($B);
```

gives

```
893.148 μs (16 allocations: 4.06 MiB)
21.599 ms (24 allocations: 5.00 MiB)
385.942 μs (2 allocations: 3.82 MiB)
```

on my machine with Julia 0.6.

I knew loops would be faster than comprehensions, but I’m honestly surprised that the difference is so large. The comprehension version does multiple allocations (comprehensions with a filter, like `i >= j`

, have to grow the array as they go along), loops over the entire array, and is not type-stable according to `@code_warntype`

. Still, it seems like filtered comprehensions could faster…

**stevengj**#5

In Julia 0.7 on my machine, things are somewhat better:

```
vech0: 1.200 ms (9 allocations: 4.06 MiB)
vech1: 5.514 ms (23 allocations: 5.00 MiB)
vech: 377.758 μs (2 allocations: 3.82 MiB)
```

Even so, the factor of 14 difference is larger than I would have initially thought.

For completeness,

```
i = ceil(Int, sqrt(2 * n + 0.25) - 0.5)
j = n - i * (i - 1) ÷ 2
```

gives an enumeration of the lower triangle indices.

Right, that’s what you get for guessing at performance with insufficient insight. It should have been obvious that it would require pretty advanced compiler optimizations to correctly predict the size of the filtered comprehension and avoid growing and other overhead.