Sparse matrix handling (in Julia) and Sparse is Enough in Scaling Transformers paper

Where does Julia stand regarding sparse matrices for neural networks or otherwise? They have generally relied on dense matrix multiply, until now:

The improvement in complexity holds not just asymptotically but yields over 2.6x speedup in wall-clock hed decoding time already for a model with 800M parameters and 20x improvement for a model with 17B parameters, as shown in Table 1.
[…]
3 Sparse is Enough
We study how to sparsify every part of the Transformer model—otherwise the non-sparse parts
dominate decoding time and become a bottleneck.
[…]
While integrating sparse attention layers into a Scaling Transformer, we notice that the architecture
of the Transformer decoder block is suboptimal and can be redesigned […] We therefore remove the encoder-decoder attention

This seems like a game-changer, and could tilt neural networks in Julia’s favor.

It seems this new 2021 paper could help:

In recent years, several efficient SpGEMM algorithms have been proposed, however, most of them are based on the compressed sparse row (CSR) format, and the possible performance gain from exploiting other formats has not been well studied. […] We then propose a pattern-based SpGEMM library, that provides a unified programming interface in the CSR format, analyses the pattern of two input matrices, and automatically determines the best format, algorithm, and parameter for arbitrary matrix pairs. For this purpose, we build an algorithm set that integrates three new designed algorithms with existing popular libraries, and design a hybrid deep learning model called MatNet to quickly identify patterns of input matrices and accurately predict the best solution by using sparse features and density representations. The evaluation shows that this library consistently outperforms the state-of-the-art library.

On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSpace, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.

See hee: https://github.com/google/trax/blob/master/trax/examples/Terraformer_from_scratch.ipynb

This colab contains all relevant code for the paper “Sparse is Enough in Scaling Transformers”. We depend on the Trax library

1 Like