What would people want to see here, longer contexts, for input, or output? I think you only get one number, and it applies to both.
You mean on Julia? Could be getting better for e.g. Python, but not Julia, and benchmarks are only that, and I see something about “contamination” and thus them not be reliable, i.e. numbers inflated, only made to show working well for them (but not out-of-distribution).
I would like to see Monarch-Mixer models (and/or RWKV), they are the new thing, that will take over, I believe, also for code, most likely.
Text embeddings are a critical piece of many pipelines, from search, to RAG, to vector databases and more. Most embedding models are BERT/Transformer-based and typically have short context lengths (e.g., 512). That’s only about two pages of text, but documents can be very long –
[…]
code repositories, etc can be tens of thousands of tokens long (or more). Here, we’re taking a first step towards developing long-context retrieval models.
[I added bold above and below]
We build on Monarch Mixer (M2), a recent model family developing attention- and MLP-free BERT models, which are enabling long-context BERT models. Today, we’re releasing a preview of a few models: long-context versions of M2-BERT up to 32K context length, as well as
These models achieve state-of-the-art performance in MTEB showing comparable or even better accuracies than closed models. Additionally, M2-BERT retrieval models significantly outperform other closed models in long context retrieval tasks. This means you can now generate embeddings for long documents without splitting them into many short chunks while containing more meaningful contexts in the embeddings. You can also access these powerful models at very competitive prices (up to 4x cheaper) as seen in the pricing graph below.
Check out code here, and models up on HuggingFace here:
These models are also available on Together AI’s new embedding service – check it out here!
Monarch matrices are a sub-quadratic primitive (you can compute them in O(N^3/2)) that are also hardware-efficient and expressive. The block-diagonal matrices map onto tensor cores, and the permutations generalize the Fast Fourier Transform. As a result, Monarch matrices can efficiently capture all sorts of structured linear transforms:
Monarch matrices can capture many structured linear transforms, including Toeplitz, Fourier, Hadamard, and more.
In M2, we use Monarch matrices to replace both attention and MLPs in Transformers. We replace attention by using Monarch matrices to construct a gated long convolution layer, similar to work like H3, Hyena, GSS, and BiGS. Specifically, Monarch matrices can implement the FFT, which can be used to compute a long convolution efficiently:
I couldn’t try out those models at HF, nor the endpoint (I suppose I could, pay, I’m looking for a web interface, to test). I suppose not trained on code, though unsure, but could, and these models still small, since not scaled up, and that is costly since a new type of model, even thought time-complexity is better. Transformers are O(n^2), these are O(n^1.5). Linear transformers exist (even before RWKV that claims “infinite” context length), just known to be fast(er), its point, while quality suffers (I believe also for RWKV, but I’ve not kept up with their updates). I hope, and it seems, that doesn’t happen here, just becomes more efficient.
Better for text (and images), thus code too:
and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1× higher throughput at sequence length 4K. […] Causal GPT-style models introduce a technical challenge: enforcing causality via masking introduces a quadratic bottleneck. To alleviate this bottleneck, we develop a novel theoretical view of Monarch matrices based on multivariate polynomial evaluation and interpolation, which lets us parameterize M2 to be causal while remaining sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers at 360M parameters in pretraining perplexity on The PILE–showing for the first time that it may be possible to match Transformer quality without attention or MLPs.
[table, intriguingly the throughput increases with larger context windows, in absolute terms, and not just their relative advantage, to 9x (also lowers for the others at least up to 8192 if not OOM there).]
That OOM in HF BERT-base is particularly important (and FlashAttention BERT-base eventually OOMs as well). That means that any retriever with a Transformers-based BERT backbone will have trouble with long-context – that’s everything from sentence-BERT to ColBERT to BGE and more!
[their bold here, the time-complexity is that, but I think also for space, why OOM, and will happen for them too eventually just not as quickly.]
Github is only 7.59%, 95.16 GiB of the Pile (and Julia a fraction of that…): GitHub - EleutherAI/the-pile but at least was then trained on code, I’m not sure of the fraction on the best code models, because you need e.g. English too.
Over the past six years, we’ve seen Transformers take the world by storm. [E,g, ChatGPT]
Are Transformers the only way to get this amazing performance?
Now, the first reason we’ve been poking around at this is because it’s really interesting! […] – hence the line of work in our lab looking into replacing attention with a sub-quadratic operator (S4, H3, Hyena, HyenaDNA. And we’re encouraged by the groundswell of work into new architectures for long sequences, from RetNet to RWKV, and positional interpolation – just to name a few!
Its paper updated in Dec. with:
Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.
https://wiki.rwkv.com/
From 9 hours ago (v6 in training):
- Ranks as the world’s greenest 7B model (per token)
- Trained on 1.1 Trillion Tokens across 100+ languages (70% English, 15% multi lang, 15% code)
- Outperforms all 7B class models in multi-lingual benchmarks
- Approaches Falcon (1.5T), LLaMA2 (2T), Mistral (>2T?) level of performance in English evals
Also smaller, slightly older (and larger “14B model / 7B 2T model”) available (and 8x7B MoE model scheduled): RWKV/rwkv-4-world-1b5 · Hugging Face
I’m guessing MTEB benchmark might be relevant or when extended to code:
- Multinguality MTEB contains multilingual classification, STS and bitext mining datasets. […] Further, MTEB does not contain any code datasets that could be used to benchmark code models (Neelakantan et al., 2022; Allal et al., 2023). It should be easy to extend MTEB with datasets, such as CodeSearchNet