Recent AI developments: Roformer (transformer w/Rotary Position Embedding) and DL to Rejuvenate Symbolic AI: Neural Production

A. Seems important:

ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING

We investigate various methods to encode positional information in transformer-based language models and propose a novel implementation named Rotary Position Embedding (RoPE). The proposed RoPE encodes absolute positional information with rotation matrix […] Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding. As a result, the enhanced transformer with rotary position embedding, or RoFormer, achieves superior performance in tasks with long texts.

@ChrisRackauckas I thought of you reading further:

In addition to these approaches, [13] has proposed to model the dependency of position encoding from the perspective with Neural ODE [1], and [22] has proposed to model the position information in complex space.

I wasn’t really expecting (Neural) ODEs to have anything to do with natural language processing, let alone “superior performance” for NLP. I hope you’re ok with me tagging you, and all with me posting in this category (so far I’ve only posted some ML stuff to offtopic, when I do not have very specific questions). For you, is this new to you, and while you know applications of complex numbers, maybe even for neural networks, I’ve been waiting for them to hit mainstream there. I suppose they are much used for differential equations, but I’m ignorant of your work/neural ODEs/PINNs, are they just standard there?

The paper cites:

ENCODING WORD ORDER IN COMPLEX EMBEDDINGS

We extend CNN, RNN and Transformer NNs to complex-valued versions to in-
corporate our complex embedding (we make all code available). Experiments 1
on text classification, machine translation and language modeling show gains over
both classical word embeddings and position-enriched word embeddings. To our
knowledge, this is the first work in NLP to link imaginary numbers in complex-
valued representations to concrete meanings (i.e., word order).

I know Julia has transformers code, and those seem to be taking over everything, but what about variants? A problem was quadratic complexity, but solved in many variants, e.g. above and Longformer. See also Performer variant and more. Is all that missing with Julia code?

B. This also seems like a big deal:

Neural Production Systems

outperforming state-of-the-art methods using GNNs, and allows for the extrapolation from simple (few object) environments to more complex environments.

See about it here:

No worries. Generally neural ODEs do worse in NLP. If there’s no natural ODE, it’s kind of pointless, except… we did recently show at ICML how to make neural ODEs into a recurrent network that automatically does hyperparameter optimization to choose the least amount of layers in a way that also improves training time.

For a full discussion on what that algorithm is rather interesting to ML frameworks as a software question, see the blog post:

I can’t say I know whether this will ever be “the thing” for NLP, but the blog post goes into why the algorithm is interesting from an AD perspective and how it hits the limitations of many software packages. I think this disconnect of quasi-static optimizers and the true adaptive nature of ODE solvers is precisely why you haven’t seen them showcased throughout a lot of ML: you hit a wall of what the frameworks will optimize, so without new frameworks the methods will seem very slow.

3 Likes

To Chris’s point, modern transformer architectures were designed to get around the issue of RNNs and more not being ML library and accelerator friendly enough. That said, I don’t think it’s some kind of nail in the coffin for all other research. For example, invertibility has recently been explored in the context of efficient language models, and https://github.com/slimgroup/InvertibleNetworks.jl seems pretty relevant there. Same for some of the fancier attention schemes and whatnot that seem to come out every week—the contortions papers go through to express them using vectorized operations shows a) how constraining the building blocks are and b) how much performance is probably left on the table.

Edit: I should also caution that there’s a lot of hype around transformers, but the rigour isn’t always there. Every few months we get a paper evaluating the current crop of efficient attention mechanisms, and inevitably most of them don’t perform nearly as well under a thorough comparison. Likewise for specific architecture modifications: at what point does your “transformer” become a CNN with a patchwise MLP and a couple of FFTs?

2 Likes

Right, I know about the speedup e.g. 15,000x for NASA, but I always considered that work and your work kind of specialized (you also seemed to confirm privately). I remember you writing something like: I saw a killer advantage (of Julia) and I exploit it, but for SciML; not regular ML. Seeing Neural ODE for NLP (and maybe transformers in general, not just NLP), makes me even more intrigued, wanting to understand it better.

Would you say, if I see “Neural ODE”, that it’s code for: have your work in mind and that term should always be a key to large speedup? I’ve been a bit puzzled why Julia is playing catchup for ML (not SciML), and thinking if Julia should provide a performance edge, not just easier development (it’s all clear to me now, in an old offtopic thread of mine). Basically, huge models have an edge elsewhere because of multi-GPU infrastructure. I’m not sure if Julia can use Deepspeed (or fork of it Deeperspeed), and its most recent breakthrough:

[it doesn’t help for only few GPUs]

Experiments on up to 256 GPUs show that 1-bit Adam enables up to 3.3× higher throughput for BERT-Large pre-training and up to 2.9× higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for our proposed work.

Then improved on by 1-bit LAMB, which I think is now the state-of-the-art:

Now training an NLP model is down to hours from days, just because of changing to LAMB if I recall.

Big kernels like matmuls do not present a speed advantage from a faster language or compiler. If it’s all hitting cublas in the end, the speed is the same. So Julia’s value proposition is reduced in that area of deep learning, except when things get weird. But I only like it weird :sweat_smile: .

2 Likes