we propose a new ﬂexible storage format for sparse matrices and compare its performance with the compressed row storage format using abstract benchmarks based on common characteristics of ﬁnite element problems

Maybe Julia should adopt this new CRAC format from last year:

In contrast, this paper focuses on the application of synchronization primitives in order to derive data-race free data structures and algorithms and compares their respective performance. Speciﬁcally, the usage of atomics is emphasized
[…]
In this paper we introduce a new ﬂexible sparse matrix storage format designated compressed row aligned columns (CRAC). The ﬂexibility of the format stems from its storage of blocks of various sizes, making it independent of the problem type. Consequently, this format integrates well in ﬁnite element programs, where the problem to be solved is deﬁned at runtime (dynamically).

Remark 1.1The focus of this paper is not the improvement in performance of a speciﬁc ﬁnite element problem or the deﬁnition of a correspondingly matching sparse matrix format, but rather the introduction and a general investigation of the performance of parallel assembly algorithms with respect to the parallelization strategy and the sparse matrix format. Consequently, the computation of each element stiﬀness matrix is not the main focus of this article and dummy elements are employed.

using lock-free methods does not guarantee higher performance.
Additionally, the CRAC format now performs better than the CSR format for p > 3.

Unlike the above, below is just my speculation (that I moved from other thread, since maybe a bit off-topic there):

@elrod [are you only working on dense?]
This (likely) “sparsity 133/8 = 16.625” got got me thinking, how would you design hardware (or software, maybe we can do better for Julia) for sparse matrices.

I’m guessing matrices are split up into 4x4=16-sized blocks/submatrices (would 1x16 for one matrix and 16x1 be better for the other you are multiplying, or maybe 2x8 etc.?).

If your matrix is very sparse only at most one value will be nonzero in those submatrices. You could encode the emptyness of the submatrices into 1 bit, 0 for empty, 1 for non-empty (one or, hopefully not more). With such a dense array encoding the structure, you can bit-multiply to see where you actually need to calculate later the actual result, in a separate step. [This would also work for larger block, I’m not wedded to 16 in a block.]

You need to also keep track of your actual non-zero values, for e.g. Float8 in 4x4 box, you needs 8-bit, plus 2-bit x 2-bit for its position = 4 bits. If the box is empty you could store 0.0 in the Float8 part, and use or ignore the position. And maybe just use that (such arrays) and skip the bit-array…

This gives you 1.5 bytes for 4x4 subarrays that otherwise would give you 16 bytes (assuming max one non-zero, the common case), so only 10.6x compression, and you’ve still not handled the unlikely case when more than one non-zero in some box. I was thinking some overlay, one for the other 1+ non-zeros (or strictly 0 - max 15 extra), with a different structure (same as Julia already has?), or maybe a list of up to max 15 overlay arrays with same structure as previously described, with the completely non-zero for the full arrays eliminated in storage and computation.