Community Interest Check: LLMs from Scratch in Pure Julia

Those are huggingface’s terminology, and they are mostly regular MLP/Attention/Gelu. They prefer to define a new class for each component of a model instead of defining/using a compositional class. This is also the reason why we need to register a loader for each model in Julia. The class layout would affect the structure of the state_dict and thus we need to manually align the layout. The layers defined in Transformers.jl are designed to be as composable as possible to reduce the need to define new structs when registering a new loader.

There are a few things on my list of priorities. Two major parts I’m currently slowly working on are splitting out the workpiece/sentencepiece tokenizer into a separate package and GPU abstraction for attentions. Besides, I would also want to enhance HuggingFaceApi.jl to use huggingface’s dataset viewer API and use DuckDB.jl to load those processed datasets. Unfortunately, my current bandwidth is mostly allocated to surviving and job hunting, so you probably won’t be able to see them in the near future.

One package I would love to see and surely beyond my scope is a better data loader design/interface with distributed support. I only roughly scan through so this might not be precise, but it seems the data loaders we have are relatively naive compared to the distributed data loader in pytorch or huggingface.

6 Likes

I will be back, I want to do the fastest Llama2.jl gpu support I can do, I just couldn’t finish it ye.

I am curious how fast things would get if we fuse most kernel calls (I keep hearing it is already done, but I want to double test it).

2 Likes

Since December before my Christmas holiday, my thinking on what we need to do has completely transformed.

Just generating tokens one by one has known limitations.
https://arxiv.org/pdf/2404.19737

Better & Faster Large Language Models via Multi-token Prediction

Gains are especially pronounced on generative benchmarks like coding […]
Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3× faster at inference, even with large batch sizes.

But what caught my attention the most wasn’t multi-token like in DeepSeek (why I looked this research from April, by e.g. Meta FAIR/“Facebook”), rather the multi-byte prediction (and table 1) not mentioned in the abstract, i.e. with multi-byte, it’s finally very practical to get rid of tokens (which should be a goal since they have lots of problems):

Section 3.3 demonstrates how multi-token prediction promotes learning longer-term patterns, a fact most apparent in the extreme case of byte-level tokenization.

We believe this usefulness only at scale to be a likely reason why multi-token prediction has so far been largely overlooked

3.3. Learning global patterns with multi-byte prediction

To show that the next-token prediction task latches to local
patterns, we went to the extreme case of byte-level tokeniza-
tion by training a 7B parameter byte-level transformer on
314B bytes, which is equivalent to around 116B tokens.
The 8-byte prediction model achieves astounding improve-
ments compared to next-byte prediction, solving 67% more
problems on MBPP pass@1 and 20% more problems on
HumanEval pass@1.

Multi-byte prediction is therefore a very promising avenue to unlock efficient training of byte-level models. Self-
speculative decoding can achieve speedups of 6 times for
the 8-byte prediction model, which would allow to fully
compensate the cost of longer byte-level sequences at infer-
ence time and even be faster than a next-token prediction
model by nearly two times. The 8-byte prediction model
is a strong byte-based model, approaching the performance
of token-based models despite having been trained on 1.7×
less data.

[I added bold. Note how 8-byte is always best, in table 1, for byte-level, while 4-token is usually best, but somethings 6-token, and you likely need to choose one fixed n, 8, or maybe 7 or 9 better for bytes…? Also I though you had 1 or more epochs, so what does 0.5 epochs really mean there?! It seems the stopped the byte experiment prematurely, and with a full epoch or more, and training on as much data, not just “116B token”-equivalent amount, would have been fair, and likely shows it even better.]

7. Conclusions

[…] We posit that our method reduces distribution mismatch between teacher-forced training and autoregressive generation. When used with speculative decoding, exact inference gets 3 times faster.[…]

Also, optimal vocabulary sizes for multi-token prediction are likely different from those for next-token prediction, and tuning them could lead to better
results, as well as improved trade-offs between compressed sequence length and compute-per-byte expenses

Interesting, but I’m not yet sure if it’s a step forward or backwords (since it thinks in thought vectors not words, going even further from interpretability, and maybe further away how the brain think, unless it uses multi-dimenional vectors too, highly plausible):

Training Large Language Models to Reason in a Continuous Latent Space

we introduce a new paradigm Coconut (Chain of Continuous Thought). […]
Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference.

I’m not sure if “thinking tokens” are some specialized tokens, I rather think they are simply tokens generated while “thinking”, what OpenAI doesn’t show you, but DeepSeek does I believe (I guess you can disable/enable showing the thinking process). There likely are some tokens like <though> ... </thought> so that the user interface knows what NOT to show in-between, I see now it’s called <bot> for beginning-of-thought and <eot>, and it might conflict with bytes only rather than tokens (also for visual models), then unclear what to show, if every byte is potentially a legal UTF-8 (use some illegal byte sequences?).

I will not summarize, best to read that full blog:

We expect to see ModernBERT become the new standard in the numerous applications where encoder-only models are now deployed, such as in RAG pipelines (Retrieval Augmented Generation) and recommendation systems.

In addition to being faster and more accurate, ModernBERT also increases context length to 8k tokens (compared to just 512 for most encoders), and is the first encoder-only model that includes a large amount of code in its training data.
[…]
For code retrieval, ModernBERT is unique. There’s nothing to really compare it to, since there’s never been an encoder model like this trained on a large amount of code data before. For instance, on the StackOverflow-QA dataset (SQA), which is a hybrid dataset mixing both code and natural language, ModernBERT’s specialized code understanding and long-context capabilities make it the only backbone to score over 80 on this task.

This means whole new applications are likely to be built on this capability. For instance, imagine an AI-connected IDE which had an entire enterprise codebase indexed with ModernBERT embeddings, providing fast long context retrieval of the relevant code across all repositories. Or a code chat service which described how an application feature worked that integrated dozens of separate projects.
[…]
ModernBERT is 2-3x faster than the next fastest model. And, on the “practicality” dimension again: ModernBERT doesn’t require the additional heavy “xformers” dependency, but instead only requires the now commonplace Flash Attention as a dependency.

Paper updated in December:

PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving 11.68 tokens/s. Notably, these performance improvements preserve model quality with negligible accuracy degradation.

ReLU seems back, when done right with dReLU, and I believe sparsification is very important:

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

https://arxiv.org/pdf/2406.05955

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising per-
formance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation.
To address these challenges, we propose a novel dReLU function, which is de-
signed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. […]
Evaluation results demonstrate that this sparsity achieves a 2-5× decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second.
Our models are available at PowerInfer (PowerInfer).

To address these challenges, we first conduct a comprehensive analysis of the existing ReLUfication approach and identify that its shortcomings stem from the negative activations in the GLU component.
Therefore, we propose an efficient activation function named dReLU. We apply dReLU in the pretraining of small-scale LLMs, alongside SwiGLU, and our findings indicate that LLMs using dReLU match the performance of those using SwiGLU, while also achieving close to 90% sparsity.
Additionally, we collect a diverse range of pretraining corpora from the open-source community, including web, code, and mathematical datasets, to enhance the effectiveness of ReLUfication.
Meanwhile, we also conduct a sparsity analysis on MoE-based LLMs. Interestingly, we observe that the feed-forward networks (FFNs) within the experts remain sparsely activated, similar to the behavior exhibited by dense LLMs. This phenomenon suggests an opportunity to further accelerate
inference speed by combining MoE techniques with ReLU-based sparse activation.

The key contributions of this paper include:
• Efficient dReLU activation function: Our method utilizes fewer than 150B tokens, repre-
senting less than 1% of the typical pretraining tokens (commonly 15T tokens [11]).
• Sparse activated models: We will release our sparsely-activated TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B models. Both models demonstrate better performance compared to the original versions.
• Practical inference speedup: Evaluation shows that with our models, we can achieve a 2-5× speedup. Notably, we can achieve up to 10 tokens/s even without a GPU on TurboSparse-Mixtral-47B

Figure 3. The figure shows that after ReLUfication, the combined activation becomes more concentrated around 0, with the sparsity increasing to 67%.
This can be attributed to the ReLU activation function applied after the gate weight, which masks all negative activations to zero.

3.2 dReLU

We introduce a new activation function, named dReLU (Equation 2), where
ReLU is applied after both the up- and gate-projection1.

CombineddReLU(x) := max(0, xWgate) ∗ max(0, xWup) (2)

Inspired by our discoveries in MoE models, we are convinced that ReLUfication can be extended to MoE models and is not restricted to dense models. As the proportion of FFN weights in MoE models increases, the FLOP reduction achieved through ReLUfication will be even more pronounced.

@mantzaris, I may add even more links here when I find them again…
But at least one more paper I just discovered that seems like a big deal:

Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networks

https://arxiv.org/pdf/2405.05097

Biological neural networks seem qualitatively superior
(e.g. in learning, flexibility, robustness) to current artificial like
Multi-Layer Perceptron (MLP) or Kolmogorov-Arnold Network
(KAN). Simultaneously, in contrast to them: biological have funda-
mentally multidirectional signal propagation [1], also of probability
distributions e.g. for uncertainty estimation, and are believed not
being able to use standard backpropagation training [2]. There
are proposed novel artificial neurons based on HCR (Hierarchical
Correlation Reconstruction) allowing to remove the above low level
differences: with neurons containing local joint distribution model
(of its connections),
[…]
we get simple formulas for e.g. conditional expected
values for propagation in any direction, like E[x|y, z], E[y|x], which
degenerate to KAN-like parametrization if restricting to pairwise
dependencies. Such HCR network can also propagate probability
distributions (also joint) like ρ(y, z|x). It also allows for additional
training approaches, like direct (aj) estimation, through tensor
decomposition, or more biologically plausible information bottleneck
training: layers directly influencing only neighbors, optimizing con-
tent to maximize information about the next layer, and minimizing
about the previous to remove noise, extract crucial information.

Keywords: machine learning, neural networks, Kolmogorov-
Arnold Network, joint distribution, conditional distribution,
Bayesian Neural Networks, tensor decomposition, mutual infor-
mation, information bottleneck approach, HSIC

Biological neurons use complex propagation of action po-
tentials, travelling in both directions of e.g. axons: ”it is not
uncommon for axonal propagation of action potentials to happen
in both directions” [1].

[My computer froze while editing this so I post in case it does again and web browser crashes… intend to add more]

I was looking into Jarek Duda again, he’s famous for other things, well inventing the new ANS compression, wasn’t expecting this, nor what he references, as collaborative works:

[13] J. Duda and G. Bhatta, “Gamma-ray blazar variability: new statistical meth-
ods of time-flux distributions,” Monthly Notices of the Royal Astronomical
Society, vol. 508, no. 1, pp. 1446–1458, 2021.
[14] J. Duda and S. Podlewska, “Prediction of probability distributions of
molecular properties: towards more efficient virtual screening and better
understanding of compound representations,” Molecular Diversity, pp. 1–
12, 2022.
[15] J. Duda and G. Bhatta, “Predicting conditional probability distributions of
redshifts of active galactic nuclei using hierarchical correlation reconstruc-
tion,” Monthly Notices of the Royal Astronomical Society, p. stae963, 2024

4 Likes

Great references again! Lots of new things to read up on.

1 Like