A. Seems important:
ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING
We investigate various methods to encode positional information in transformer-based language models and propose a novel implementation named Rotary Position Embedding (RoPE). The proposed RoPE encodes absolute positional information with rotation matrix […] Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding. As a result, the enhanced transformer with rotary position embedding, or RoFormer, achieves superior performance in tasks with long texts.
@ChrisRackauckas I thought of you reading further:
In addition to these approaches,  has proposed to model the dependency of position encoding from the perspective with Neural ODE , and  has proposed to model the position information in complex space.
I wasn’t really expecting (Neural) ODEs to have anything to do with natural language processing, let alone “superior performance” for NLP. I hope you’re ok with me tagging you, and all with me posting in this category (so far I’ve only posted some ML stuff to offtopic, when I do not have very specific questions). For you, is this new to you, and while you know applications of complex numbers, maybe even for neural networks, I’ve been waiting for them to hit mainstream there. I suppose they are much used for differential equations, but I’m ignorant of your work/neural ODEs/PINNs, are they just standard there?
The paper cites:
ENCODING WORD ORDER IN COMPLEX EMBEDDINGS
We extend CNN, RNN and Transformer NNs to complex-valued versions to in-
corporate our complex embedding (we make all code available). Experiments 1
on text classification, machine translation and language modeling show gains over
both classical word embeddings and position-enriched word embeddings. To our
knowledge, this is the first work in NLP to link imaginary numbers in complex-
valued representations to concrete meanings (i.e., word order).
I know Julia has transformers code, and those seem to be taking over everything, but what about variants? A problem was quadratic complexity, but solved in many variants, e.g. above and Longformer. See also Performer variant and more. Is all that missing with Julia code?
B. This also seems like a big deal:
Neural Production Systems
outperforming state-of-the-art methods using GNNs, and allows for the extrapolation from simple (few object) environments to more complex environments.
See about it here: