Unlike the above, below is just my speculation (that I moved from other thread, since maybe a bit off-topic there):
@elrod [are you only working on dense?]
This (likely) “sparsity 133/8 = 16.625” got got me thinking, how would you design hardware (or software, maybe we can do better for Julia) for sparse matrices.
I’m guessing matrices are split up into 4x4=16-sized blocks/submatrices (would 1x16 for one matrix and 16x1 be better for the other you are multiplying, or maybe 2x8 etc.?).
If your matrix is very sparse only at most one value will be nonzero in those submatrices. You could encode the emptyness of the submatrices into 1 bit, 0 for empty, 1 for non-empty (one or, hopefully not more). With such a dense array encoding the structure, you can bit-multiply to see where you actually need to calculate later the actual result, in a separate step. [This would also work for larger block, I’m not wedded to 16 in a block.]
You need to also keep track of your actual non-zero values, for e.g. Float8 in 4x4 box, you needs 8-bit, plus 2-bit x 2-bit for its position = 4 bits. If the box is empty you could store 0.0 in the Float8 part, and use or ignore the position. And maybe just use that (such arrays) and skip the bit-array…
This gives you 1.5 bytes for 4x4 subarrays that otherwise would give you 16 bytes (assuming max one non-zero, the common case), so only 10.6x compression, and you’ve still not handled the unlikely case when more than one non-zero in some box. I was thinking some overlay, one for the other 1+ non-zeros (or strictly 0 - max 15 extra), with a different structure (same as Julia already has?), or maybe a list of up to max 15 overlay arrays with same structure as previously described, with the completely non-zero for the full arrays eliminated in storage and computation.
Good design?