Sparse matrix handling (in Julia) and Sparse is Enough in Scaling Transformers paper

Palli · December 2, 2021, 2:53pm

Where does Julia stand regarding sparse matrices for neural networks or otherwise? They have generally relied on dense matrix multiply, until now:

The improvement in complexity holds not just asymptotically but yields over 2.6x speedup in wall-clock hed decoding time already for a model with 800M parameters and 20x improvement for a model with 17B parameters, as shown in Table 1.
[…]
3 Sparse is Enough
We study how to sparsify every part of the Transformer model—otherwise the non-sparse parts
dominate decoding time and become a bottleneck.
[…]
While integrating sparse attention layers into a Scaling Transformer, we notice that the architecture
of the Transformer decoder block is suboptimal and can be redesigned […] We therefore remove the encoder-decoder attention

This seems like a game-changer, and could tilt neural networks in Julia’s favor.

It seems this new 2021 paper could help:

In recent years, several efficient SpGEMM algorithms have been proposed, however, most of them are based on the compressed sparse row (CSR) format, and the possible performance gain from exploiting other formats has not been well studied. […] We then propose a pattern-based SpGEMM library, that provides a unified programming interface in the CSR format, analyses the pattern of two input matrices, and automatically determines the best format, algorithm, and parameter for arbitrary matrix pairs. For this purpose, we build an algorithm set that integrates three new designed algorithms with existing popular libraries, and design a hybrid deep learning model called MatNet to quickly identify patterns of input matrices and accurately predict the best solution by using sparse features and density representations. The evaluation shows that this library consistently outperforms the state-of-the-art library.

On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSpace, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.

See hee: https://github.com/google/trax/blob/master/trax/examples/Terraformer_from_scratch.ipynb

This colab contains all relevant code for the paper “Sparse is Enough in Scaling Transformers”. We depend on the Trax library

Topic		Replies	Views
Scaling a sparse matrix row-wise and column-wise too slow Performance broadcast , sparse	20	435	June 23, 2024
[ANN] Fast SpMv with CompressedSparseBlocks.jl Package Announcements performance , linearalgebra , sparse	9	714	July 26, 2022
Sparse matrix multiplication complexity General Usage	2	1622	January 30, 2017
Looking for collaborators - implementing SLIDE in julia Machine Learning	12	555	June 29, 2022
Module idea for helping calculate large sparse Jacobians Optimization (Mathematical)	5	527	August 25, 2019

Sparse matrix handling (in Julia) and Sparse is Enough in Scaling Transformers paper

Related topics