AI bubble: time to panic? Perhaps not yet... maybe now

But are they really? You can run very good models already on your local GPU, and that’s with e.g. 4-bit quantization (already mainstream in open-source models), which is already outdated, floats no longer needed for the weights, you can go to 2-bit or less, that is coming, radically simplifying hardware and lowering energy use for running/inference and for training. That’s practical for 3B+ models (at least Transformers), so basically all mainstream models until recently (except maybe on mobile phones, and they can even run some models).

The Chinchilla scaling laws are outdated, only takes training cost into account, not inference, and according to a later paper, if you take running, i.e. inference, also into account, then smaller models turn out to be more the cost-effective-models. [Also smaller AND better models have already be done for Julia, see below.] The paper on such cost-optimal LLMs however (probably) assumes Transformers (and while mainstrain still, on the way out). Since both papers and Transformers being replaced as we speak, it will be very interesting to see an updated paper, at least running (and training) is getting way less expensive, i.e. tokens/per sc going up with better algorithms (not just because of better hardware like Grok).

The cost to train GPT4 was more than $100 million CEO “Sam Altman stated” (likely not just for the compute, it though seems over 70% of that cost). By now you can train a good model (not just fine-tune) for under [likely, way less actually] $0.1 million according to MIT-IBM Watson AI Lab (and other people, e.g. from Princeton), lead author on the paper (probably only the compute cost, but likely also way too high, since better more efficient models are already out), so it seems this AI report not keeping track with latest, less costly, developments:

3. Frontier models get way more expensive.
[…] while Google’s Gemini Ultra cost $191 million for compute […]
7. The data is in: AI makes workers more productive and leads to higher quality work. […]

Public Sentiment Dips Negative […]

JetMoE: Reaching Llama2 Performance with 0.1M Dollars https://arxiv.org/pdf/2404.07413.pdf

This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought.

And the cost is going to plummet further then the already 1000x, seemingly already has.

Transformers are on the way out, too costly, fully or largely in some of the hybrid models I’m exited about, based on Mamba:

we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length.

The researchers from Microsoft introduced SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling.

I don’t have cost for these or this one (I just found this one, days old):

One of the most impressive achievements of **Zamba-7B i**s its remarkable training efficiency. The model was developed by a team of just seven researchers over a period of 30 days, using 128 H100 GPUs. The team trained the model on approximately 1 trillion tokens extracted from open web datasets. The training process involved two phases, beginning with lower-quality web data and then transitioning to higher-quality datasets. This strategy not only enhances the model’s performance but also reduces overall computational demands.

In comparative benchmarks, Zamba-7B performs better than LLaMA-2 7B and OLMo-7B. It achieves near-parity with larger models like Mistral-7B and Gemma-7B while using fewer data tokens, demonstrating its design efficacy.

30 x 128 = 3840 GPU hours? for 30000/3840*1000 = 7812x cheaper to train than Llama2?

Out today, claimed very good:

This one might also be interesting for Julia:

Vezora/Mistral-22B-v0.1 · Hugging Face WizardLM 2 mistralai/Mixtral-8x22B-Instruct-v0.1 · Hugging Face alpindale/WizardLM-2-8x22B · Hugging Face mistral-community/Mixtral-8x22B-v0.1 · Hugging Face

Code generation benchmarks: HumanEval, MBPP, BabelCode (C++, C#, Go, Java, JavaScript, Kotlin, Python, Rust)

Julia isn’t named there, but Julia is part of Googe’s BabelCode at least and is very plausibly also part of Google’s Codegemma’s training data, to at least a useful degree:

The paper on BabelCode:
Measuring The Impact Of Programming Language Distribution https://arxiv.org/pdf/2302.01973.pdf

Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language
[…]
We replace Python-specific terms with their equivalent names in the target language. For tasks formulated as code-completion, we support formatting the problem description as a native docstring […] For example, in Julia, the ”$” in docstrings will raise errors if not properly escaped. Thus, we implement methods to automatically handle such cases and ensure correctness. […]
We further consider the bottom 7 languages to be low-resource(LR): Dart, Lua, Rust, C#, R, Julia, and Haskell […]
We also observe that performance on languages do not scale with respect to their resource level nor the model’s size. C#, Dart, Julia, and Haskell have significantly higher gains when scaling to 4B model size when compared to the other languages. While this may be due to the increased number of training tokens, it is not consistent across all LR language

Julia has the highest gain, seemingly, despite being second lowest-resource (HS=Haskell is lowest) language 0.03% of training date vs 36.95% for Java and 16.80% for Python. See (Table 4 and 5, Python missing there?! and) Figure 6. Mean relative difference of pass@k for each of the models trained on the different Unimax distributions compared to the pass@k of the same sized model trained on the Natural distribution […]

Julia gains by having a larger model up to a point at least, i.e. to 4B-sized, then 72 questions passed vs. only 5 for 8B (also improvement on the 62B model), but that seems to be not because of only larger model, but older PaLM and PaLM Coder, are those larger models. See table 14. I count that as a 14.4x improvement (though not the best metric(?), not what they highlight with other tables), with half-size model… unlike for most other languages, though R and Haskell also with good gains.

2 Likes