I posted two other variants at the gist above along with benchmark results. The variants are:

Symmetric()
calls tril()
or triu()
and A_mul_B!()
is straightforward;

Symmetric()
does not modify A
, and A_mul_B!()
skip irrelevant indices;

Symmetric()
does not modify A
but builds an index of relevant indices (corresponding to either the upper or lower triangle), and A_mul_B!()
relies on that index.
Variant 1 has the disadvantage of modifying A
. However, it seems to me that one of the reasons for using Symmetric
is to save storage. This allows users to only assemble one triangle (e.g., in FEM or optimization). In addition, the sparse symmetric factorization packages I know only access one triangle. One could imagine requiring the input A
to be triangular when calling Symmetric()
.
Variant 2 has the advantage of not modifying A
and not requiring extra storage. Strangely, multiplying with the lower triangle by scanning row indices backward is slower than multiplying with the upper triangle (and scanning forward). Perhaps this can be improved?! The performance of the product with the upper triangle is basically the same as that of Variant 1.
Variant 3 costs an extra array of length n
(the number of columns of A
) that gives the initial/final index for each column. The performance is basically the same as that of Variant 1.
Comments?