Is there an algorithm that can do half vectorization, the vech (here) or generates elimination of duplication matrices (here)? I’m trying to recover only the unique elements of a covariance matrix in vector form.

Sure, the algorithm is: write a loop. (Or rather, two nested loops are probably the easiest way to implement vech.)

If compact code is more important than performance you can do `A[tril(trues(A))]`

. Otherwise the loop solution can easily be implemented by a comprehension:

`[A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]`

Actually, `A[tril(trues(A))]`

is significantly *faster* than the comprehension on my machine.

Explicit loops are even better, about 50x faster than this comprehension on my machine:

```
vech0(A) = A[tril(trues(A))]
vech1(A) = [A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]
function vech(A::AbstractMatrix{T}) where T
m = LinAlg.checksquare(A)
v = Vector{T}((m*(m+1))>>1)
k = 0
for j = 1:m, i = j:m
@inbounds v[k += 1] = A[i,j]
end
return v
end
B = rand(1000,1000); B = B + B';
using BenchmarkTools, Compat
@btime vech0($B); @btime vech1($B); @btime vech($B);
```

gives

```
893.148 μs (16 allocations: 4.06 MiB)
21.599 ms (24 allocations: 5.00 MiB)
385.942 μs (2 allocations: 3.82 MiB)
```

on my machine with Julia 0.6.

I knew loops would be faster than comprehensions, but I’m honestly surprised that the difference is so large. The comprehension version does multiple allocations (comprehensions with a filter, like `i >= j`

, have to grow the array as they go along), loops over the entire array, and is not type-stable according to `@code_warntype`

. Still, it seems like filtered comprehensions could faster…

In Julia 0.7 on my machine, things are somewhat better:

```
vech0: 1.200 ms (9 allocations: 4.06 MiB)
vech1: 5.514 ms (23 allocations: 5.00 MiB)
vech: 377.758 μs (2 allocations: 3.82 MiB)
```

Even so, the factor of 14 difference is larger than I would have initially thought.

For completeness,

```
i = ceil(Int, sqrt(2 * n + 0.25) - 0.5)
j = n - i * (i - 1) ÷ 2
```

gives an enumeration of the lower triangle indices.

Right, that’s what you get for guessing at performance with insufficient insight. It should have been obvious that it would require pretty advanced compiler optimizations to correctly predict the size of the filtered comprehension and avoid growing and other overhead.

Thank you all! These work great!

This code will not run with Julia 1.5.

`vech0`

will run with the modification: `trues(A)`

→ `trues(size(A))`

`vech`

will not run. (I did make the change `LinAlg.checksquare`

→ `LinearAlgebra.checksquare)`

```
ERROR: LoadError: MethodError: no method matching Array{Float64,1}(::Int64)
Closest candidates are:
Array{Float64,1}() where T at boot.jl:425
Array{Float64,1}(::UndefInitializer, ::Int64) where T at boot.jl:406
Array{Float64,1}(::UndefInitializer, ::Int64...) where {T, N} at boot.jl:412
```

What else must be modified?

```
vech0(A) = A[tril(trues(size(A)))]
vech1(A) = [A[i, j] for i = 1:size(A, 1), j = 1:size(A, 2) if i >= j]
function vech(A::AbstractMatrix{T}) where T
m = LinearAlgebra.checksquare(A)
v = Vector{T}((m*(m+1))>>1)
k = 0
for j = 1:m, i = j:m
@inbounds v[k += 1] = A[i,j]
end
return v
end
B = rand(1000,1000); B = B + B';
using BenchmarkTools, Compat
@btime vech0($B); @btime vech1($B); @btime vech($B);
```

it’s erroring about this line, I think it should be

```
Vector{T}(undef, (m*(m+1))>>1)
```

? to initialize

Thank you! That solved it. Below are the times in Julia 1.5 on my laptop, in case someone is interested.

```
537.001 μs (8 allocations: 4.06 MiB)
4.250 ms (20 allocations: 5.00 MiB)
304.750 μs (2 allocations: 3.82 MiB)
```