Many breakthroughs: Complex-valued transformer neural networks, or even "quaternion backpropagation", or none at all? Predictive coding

[I usually post ML stuff under in “offtopic” but I think I may get better answers/discussion here, I’m not asking for a solution to a very specific problem I have. I start with interesting practical developments then more theoretical.]

It seems to be Transformers/attention are going out, with SiMBA; or mostly/partially with Jamba (NLP) AI/LLM (from AI21labs, the company hiring Julia programmers). It’s “production-grade” and the best model to fit one one GPU. Are Universal Transformers (see below) the mainstream transformer/GPT model, as opposed to the less powerful non-Turing-complete Transformers (from the classic “Attention is all you need” paper; which is wrong)?

It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.

And backpropagation is also on the way out, predictive coding taking over, see below. And MoE is also on the way out will be replaced with modular ideas that are though similar. I understand [Q]LoRA is also on the way out, but not sure about FSDP-QLoRA, that is new to me, but suspect it’s also on the way out. It’s however brand-new: Answer.AI - You can now train a 70b language model at home

The 7B-based Jamba model (12B active parameters, 52B total available parameters) we are releasing was designed to fit in a single 80GB GPU,

First, and still only, model tagged with the Julia label (there a fine-tuned variant): ajibawa-2023/Code-Jamba-v0.1 · Hugging Face

It is finetuned on Jamba-v0.1 . It is very very good in Code generation in various languages such as Python, Java, JavaScript, GO, C++, Rust, Ruby, Sql, MySql, R, Julia, Haskell, etc…

Also: Severian/Jamba-Hercules · Hugging Face “Name was changed from Open-Hermes to Hercules. During multiple trainings and testings with lots of different datasets, I found that Jamba has BY FAR reacted the best to this dataset.”

Should Julia (or Lux.jl and/or CUDA.jl) support FP4 and/or NF4 and/or 4-bit AF4 = AbnormalFloats: (and/or int2)? (I believe none of those, we’ll see 1-bit, e.g. based on my own idea). It’s likely only useful for neural networks, and two FP4s can be stuffed into an UInt8 (maybe no separate type needed in base) only (that in) some library or that Python one needed:
Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

For FP4 there is no fixed format and as such one can try combinations of different mantissa/exponent combinations. In general, 3 exponent bits do a bit better in most cases. But sometimes 2 exponent bits and a mantissa bit yield better performance.

4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights

This library is newer (and thus competing with it?), the blog post from last month: Quanto: a pytorch quantization toolkit

For lower bitwidth quantization types, such as int2 or int4, the projection is affine […] Activations are dynamically quantized using static scales (defaults to the range [-1, 1]). The model needs to be calibrated to evaluate the best activation scales (using momentum).

@stevengj This new “Einstein FFT (EinFFT)” (“the application of Einstein Matrix multiplication on complex number representations”) is intriguing, and the state-of-the-art (e.g. for “Multivariate Time Series Forecasting, including Electricity, Weather, Traffic”, see table 3) Microsoft SiMBA model that introduced it:

SiMBA: Simplified Mamba-based Architecture for Vision and Multivariate Time series

Abstract. Transformers have widely adopted attention networks for sequence
mixing and MLPs for channel mixing, playing a pivotal role in achieving break-
throughs across domains. However, recent literature highlights issues with atten-
tion networks, including low inductive bias and quadratic complexity concerning
input sequence length. […]
We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. Extensive performance studies across image and time-series benchmarks demonstrate that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers.
Notably, SiMBA establishes itself as the new state-of-the-art SSM on ImageNet
and transfer learning benchmarks such as Stanford Car and Flower as well as task
learning benchmarks as well as seven time series benchmark datasets.

SiMBA addresses this gap by incorporating Mamba for token mixing, replacing attention networks, and leveraging Einstein FFT (EinFFT) for channel mixing.
SiMBA introduces the Einstein blending method for channel mixing, offering a novel
approach without the constraints of requiring perfect square dimensions for sequence
length N and channel dimensions. Furthermore, SiMBA adopts the pyramid version
of the transformer architecture, providing a significant performance boost compared
to vanilla state space models. While many state space models reduce complexity to
𝑂(𝑁 log(𝑁)), they often fall short of achieving the performance levels seen in state-
of-the-art attention-based transformers.

3 Method
In this study, we introduce EinFFT, a novel approach for frequency-domain channel mixing utilizing Einstein Matrix multiplication. EinFFT is specifically designed for complex number representations of frequency components, enabling the effective capture of key patterns in Image patch data with a global view and energy compaction. It must be noted that EinFFT is also applicable for other sequence data modalities like time series or speech or even text data. We have validated EinFFT-based SiMBA for image and time series benchmark datasets.

Jamba: A Hybrid Transformer-Mamba Language Model

Jamba is trained on an in-house dataset that contains text data from the Web, books, and code, with the last update in March 2024.

I note “on code”, but unclear if Julia code… but this it the company hiring Julia programmers…

Table 2: Comparison of Jamba with other publicly available models. Jamba obtains similar or better performance with much better throughput. [Sometimes best, those slightly lower on the code metrics I noticed.]

In summary, Jamba demostrates the ability of hybrid architectures to reach the performance of state-of-the-art Transformer based models of the same size class, while having the benefits of an SSM.

5.2 Long-Context Evaluations
We have successfully trained Jamba models with context lengths of up to 1M tokens. The released model handles context lengths of up to 256K tokens.

It uses “int8 quantization”, and it’s known you can do 4-bit (current mainstream), even 2-bit for 2x and 4x more parameters, and that 2-bit (or lower!) is enough for at least other architectures, and FP4 better than int4, so it’s worth trying here too, and I would like to know why not done from the start?

On long-context evaluations, Jamba outperformes Mixtral on most of the evaluated datasets. At the same time, Jamba is extremely efficient; for example, its throughput is 3x that of Mixtral-8x7B for long contexts. Moreover, our model fits in a single GPU (with 8bit weights) even with contexts of over 128K tokens, which is impossible with similar-size attention-only models such as Mixtral-8x7B.
Somewhat unusually for a new architecture, we release Jamba (12B active parameters, 52B total available parameters) under Apache 2.0 license: ai21labs/Jamba-v0.1 · Hugging Face.
We do so since we feel that the novel architecture of Jamba calls for further study, experimentation, and optimization by the community.
In summary, the different degrees of freedom in the Jamba architecture are:
• l: The number of layers.
• a : m: ratio of attention-to-Mamba layers.
• e: how often to use MoE instead of a single MLP.
• n: total number of experts per layer.
• K: number of top experts used at each token

Given this design space, Jamba provides flexibility in preferring certain properties over others. For example, increasing m and decreasing a, that is, increasing the ratio of Mamba layers at the expense of attention layers, reduces the required memory for storing the key-value cache. This reduces the overall memory footprint, which is especially important for processing long sequences. Increasing the ratio of Mamba layers also improves throughput, especially at long sequences. However, decreasing a might lower the model’s capabilities.

Additionally, balancing n, K, and e affects the relationship between active parameters and total available parameters. A larger n increases the model capacity at the expense of memory footprint, while a larger K increases the active parameter usage and the compute requirement. In contrast, a larger e decreases the model capacity, while decreasing both compute (when K>1) and memory requirements, and allowing for less communication dependencies (decreasing memory transfers as well as inter-GPU communication during expert-parallel training and inference).
The tokenizer is trained with BPE [15, 29, 39] and each digit is a separate token [6]. We also remove the dummy space used in Llama and Mistral tokenizers for more consistent and reversible tokenization.

I meant to ask why are we not seeing more complex-valued (or hyper-complex-valued) neural networks (given promise and fewer parameters, ANNs are already memory-bound)? See my own speculation below. But also ANNs (and backpropagation) are pretty far from biological neural networks, and going to more complex might be a step in the wrong direction…

Can the Brain Do Backpropagation? — Exact Implementation of Backpropagation in Predictive Coding Networks

Despite tremendous efforts, however, no previous model has bridged the gaps at a degree of demonstrating an equivalence to BP […] Here, we present for the first time a framework […] bridges the above crucial gaps. We propose a BL
model that (1) produces exactly the same updates of the neural weights as BP, while (2) employing local plasticity, i.e., all neurons perform only local computations, done simultaneously. We then modify it to an alternative BL model that (3) also works fully autonomously. Overall, our work provides important evidence for the debate on the long-disputed question whether the brain can perform BP.

In 2021:

To our knowledge, we are the first to show that a biologically plausible algorithm is able to exactly replicate the accuracy of BP [backpropagation] on such complex architectures, bridging the existing gap […] , and setting an unprecedented performance for PCNs, which can now be considered as efficient alternatives to BP.

2022 survey paper:

Neuroscience-inspired learning algorithms, however, such as predictive coding, which utilize local learning, have the potential to overcome these limitations and advance beyond current deep learning technologies. While predictive coding originated in theoretical neuroscience as a model of information processing in the cortex, recent work has developed the idea into a general-purpose algorithm able to train neural networks using only local computations. In this survey, we review works that have contributed to this perspective and demonstrate the close theoretical connections between predictive coding and backpropagation
Specifically, we show the substantially greater flexibility of predictive coding networks against equivalent deep neural networks, which can function as classifiers, generators, and associative memories simultaneously
PC (K. Friston, 2005; Rao & Ballard, 1999; Srinivasan, Laughlin, & Dubs, 1982) has emerged as an influential theory in computational neuroscience, which has a significant mathematical foundation as variational inference, linking it closely
with normative theories of the Bayesian brain (Knill & Pouget, 2004), and which provides a single mechanism that can explain many varied perceptual and neurophysiological effects (Auksztulewicz & Friston, 2016; Hohwy, Roepstorff,
& Friston, 2008; Lotter, Kreiman, & Cox, 2016), while also postulating a biologically plausible neural dynamics and synaptic update rules (K. Friston, 2003; Lillicrap, Santoro, Marris, Akerman, & Hinton, 2020; Millidge, Tschantz, Seth,
& Buckley, 2020b).

The fundamental idea of PC is to treat the cortex as performing simultaneous inference and learning on a hierarchical probabilistic generative model, which is trained in an unsupervised setting to predict incoming sensory signals (Clark,
2015; K. Friston, 2005; Rao & Ballard, 1999). In such an architecture, at each layer of the hierarchy, top-down predictions emanating from higher layers are matched with and cancel out incoming sensory data or prediction errors from lower layers. Unexplained aspects of the sensory data, in the form of prediction errors, are then transmitted upwards for higher layers of the hierarchy to explain
3 Predictive Coding and Backpropagation
Recently, multiple results have explored similarities and relationships between PC and BP, showing that PC can closely approximate or exactly perform BP under certain conditions on supervised learning tasks.
Second, it provides a novel (but equivalent) implementation of BP, which is able to learn via local computations. All these results are experimentally validated on multiple architectures, such as LSTMs (Hochreiter & Schmidhuber, 1997), transformers (Vaswani et al., 2017), and ResNets (He, Zhang, Ren, & Sun, 2016). A historical sketch of these results is given in Fig. 2(a).

In general, that survey paper is from 2022 (on ideas from 2005, even 1982 predating backpropagation), and why do good ideas take so long to take over? It seems like science/ANN-research or at least practice, seems to develop (like biology) with incremental improvements, small modifications to current practice. Which is understandable, for large LLMs they are very costly to train, and you probably want to change one or few options at a time, to know what works, or to get a paper out. [Training a large model from scratch is now down to just under $0.1 million, i.e. competitive with Llama2. So maybe people can become more risky.]

More interesting stuff, I lost track of citations:

6 Learning on Arbitrary Graph Topologies
Learning on networks of any structure is not possible using BP, where information first flows in one direction via the feedforward pass, and the error in the reverse direction during the backwards pass. Hence, a cycle in the computational graph of an ANN trained with BP would cause an infinite loop. While the problem of training on some specific cyclic structures has been partially addressed using BP through time (Hochreiter & Schmidhuber, 1997; Rumelhart et al., 1986; Williams & Zipser, 1989) on sequential data, the restriction to hierarchical architectures may present a limitation to reaching brain-like intelligence, since the human brain has an extremely complex and entangled neural structure that is heterarchically organized with small-world connections (Avena-Koenigsberger, Misic, & Sporns, 2018)—a topology that is likely highly optimized by evolution. Hence, a recent direction of research aimed to extend learning to arbitrary graph topologies. A popular example is the assembly calculus (Papadimitriou, Vempala, Mitropolsky, Collins, & Maass, 2020), a Hebbian learning method that can perform different operations implicated in cognitive phenomena. However, Hebbian learning methods cannot perform well compared to error-driven ones such as BP (Movellan, 1991). PC, however, has both the desired properties that allow high-quality representation learning on arbitrary graph topologies: it is error-driven, and only learns via local computations. Moreover, it has been shown that it is possible to perform generation and classification tasks on extremely entangled networks, which closely resemble brain regions (Salvatori et al., 2022). This enables a more general learning framework, which converges to a global solution via energy minimization that can perform multiple tasks simultaneously, such as classification and generation, but also to develop novel architectures, optimized for a single specific task. Tested on generation, reconstruction, and denoising tasks, this model has been shown to have a performance superior or comparable to standard autoencoders.

The assembly calculus in the section above is very intriguing, see paper below (though likely a distraction, PC better, see above):

Brain computation by assemblies of neurons

Assemblies and their operations constitute a computational model of the brain which we call the Assembly Calculus, occupying a level of detail intermediate between the level of spiking neurons and synapses and that of the whole brain. The resulting computational system can be shown, under assumptions, to be, in principle, capable of carrying out arbitrary computations. We hypothesize that something like it may underlie higher human cognitive functions such as reasoning, planning, and language.

After introducing the required fundamentals and quaternion mathematics, we showed that by using plain partial derivatives with respect to the quaternion components as in other approaches to quaternion backpropagation, the
product and more critical, the chain rule, does not hold. By
applying the GHR calculus, we end up with derivatives which
do, to create our novel quaternion backpropagation algorithm.
We further provided insights on the relation of automatic
differentiation and quaternion backpropagation, and pointed
out a scenario where automatic differentiation can be used to
train QNN.

Seems interesting for the “episodic memory module” (this older paper is reference in one newer below):
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

[…] We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging
Similar to dynamic memory networks (Kumar et al., 2016), there is an iterative attention process in UTs that allows the model to condition its attention over memory on the result of previous iterations.
The results are shown in Table 3. Universal Transformer achieves state-of-the-art results in both the language modeling and reading comprehension setup, outperforming both LSTMs and vanilla Transformers.

Our best fixed UT results used 6 steps. However, the average number of steps that the best UT with dynamic halting took on the test data over all positions and examples was 8.2±2.1. In order to see if the dynamic model did better simply because it took more steps, we trained two fixed UT models with 8 and 9 steps respectively (see last two rows). Interestingly, these two models achieve better results compared to the model with 6 steps, but do not outperform the UT with dynamic halting. This leads us to believe that dynamic halting may act as a useful regularizer for the model via incentivizing a smaller numbers of steps for some of the input symbols, while allowing more computation for others.
UT achieves perfect scores in all the memorization tasks and also outperforms
both LSTMs and Transformers in all program evaluation tasks by a wide margin.

1 Like