A flexible sparse matrix data format and parallel algorithms for the assembly of sparse matrices in general finite element applications using atomic synchronisation primitives

@elrod @PetrKryslUCSD Seems intriguing: https://www.researchgate.net/publication/346555859_A_flexible_sparse_matrix_data_format_and_parallel_algorithms_for_the_assembly_of_sparse_matrices_in_general_finite_element_applications_using_atomic_synchronisation_primitives

we propose a new flexible storage format for sparse matrices and compare its performance with the compressed row storage format using abstract benchmarks based on common characteristics of finite element problems

Maybe Julia should adopt this new CRAC format from last year:

In contrast, this paper focuses on the application of synchronization primitives in order to derive data-race free data structures and algorithms and compares their respective performance. Specifically, the usage of atomics is emphasized
[…]
In this paper we introduce a new flexible sparse matrix storage format designated compressed row aligned columns (CRAC). The flexibility of the format stems from its storage of blocks of various sizes, making it independent of the problem type. Consequently, this format integrates well in finite element programs, where the problem to be solved is defined at runtime (dynamically).

Remark 1.1 The focus of this paper is not the improvement in performance of a specific finite element problem or the definition of a correspondingly matching sparse matrix format, but rather the introduction and a general investigation of the performance of parallel assembly algorithms with respect to the parallelization strategy and the sparse matrix format. Consequently, the computation of each element stiffness matrix is not the main focus of this article and dummy elements are employed.

using lock-free methods does not guarantee higher performance.
Additionally, the CRAC format now performs better than the CSR format for p > 3.

Unlike the above, below is just my speculation (that I moved from other thread, since maybe a bit off-topic there):

@elrod [are you only working on dense?]
This (likely) “sparsity 133/8 = 16.625” got got me thinking, how would you design hardware (or software, maybe we can do better for Julia) for sparse matrices.

I’m guessing matrices are split up into 4x4=16-sized blocks/submatrices (would 1x16 for one matrix and 16x1 be better for the other you are multiplying, or maybe 2x8 etc.?).

If your matrix is very sparse only at most one value will be nonzero in those submatrices. You could encode the emptyness of the submatrices into 1 bit, 0 for empty, 1 for non-empty (one or, hopefully not more). With such a dense array encoding the structure, you can bit-multiply to see where you actually need to calculate later the actual result, in a separate step. [This would also work for larger block, I’m not wedded to 16 in a block.]

You need to also keep track of your actual non-zero values, for e.g. Float8 in 4x4 box, you needs 8-bit, plus 2-bit x 2-bit for its position = 4 bits. If the box is empty you could store 0.0 in the Float8 part, and use or ignore the position. And maybe just use that (such arrays) and skip the bit-array…

This gives you 1.5 bytes for 4x4 subarrays that otherwise would give you 16 bytes (assuming max one non-zero, the common case), so only 10.6x compression, and you’ve still not handled the unlikely case when more than one non-zero in some box. I was thinking some overlay, one for the other 1+ non-zeros (or strictly 0 - max 15 extra), with a different structure (same as Julia already has?), or maybe a list of up to max 15 overlay arrays with same structure as previously described, with the completely non-zero for the full arrays eliminated in storage and computation.

Good design?

Good find! I will check it out.
Thanks.

1 Like