consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs.
You can try out 2- or 3-bit such models here:
But I’m a little confused, the 2- and 3-bit models seem the same size, and same as uncompressed LLama, I think, so what is happening is that since it uses the safetensor format (that Julia has support for with a package), it has not caught up with (such) quantizatrion. I.e. you are storing a lot of zeros. There must be some way to store without them (i.e. compress, or not yet developed? here then only showing the quantization, per se, works well, and most likely you wouldn’t run it with the optimized code that takes it into account?).
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
https://arxiv.org/pdf/2405.04532
Anything that gets to or below 4-bit (without degradation) is a major deal, but it’s not the whole story, even 1-bit isn’t the limit, you can even prune whole layers:
Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy. While effective, the interplay between these two methods remains an open question.[…] We mathematically prove that applying sparsity before quantization is the optimal sequence for these operations, minimizing error in computation. […]
In addition, through rigorous analysis, we demonstrate that sparsity and quantization are not orthogonal; their interaction can significantly harm model accuracy, with quantization error playing a dominant role in this degradation
[only major savings there] Activations can take up a significant amount of memory during training
We may be at the limit for activations already, so now we’re shifting to quantification of the KV-cache.
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner
Transformer models are mainstream (for LLMs and much else e.g. for computer vision), but Diffusion models for generating images and video. Hybrids of them are very intriguing, for multi-modal, and here:
HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization
In short, no I think for quantized, and might even be limited to FP32 in practice, but more on this at the bottom. I don’t think Float32 is used at all anymore in any mainstream or research (since 2x, or 4x+ slower), except by Julia still?! But even Julia’s Float16 might be ok for KANs:
My thinking is if we want to do a “Julia LLM from scratch” project at all, for training (also even for just inference) it needs to be competitive, using recent methods, or why bother? Or even better, be ahead of the curve, as by using KAN. There’s no need or even helpful to implement old ideas if/since they take much longer to train, with naive or old ideas.
Quantization has been the go-to method to allowing faster inference (allow more parameters in same fixed amount of [GPU/TPU] memory), and can actually also help for training I believe by now.
So, I’ve been thinking does it or can it even apply to KAN (or other new ideas, I do not want to lead us down the wrong path), it seems so:
Hardware Acceleration of Kolmogorov–Arnold Network (KAN) for Lightweight Edge Inference
Recently, a novel model named Kolmogorov-Arnold Networks (KAN) has been proposed with the potential to achieve the functionality of traditional deep neural networks (DNNs) using orders of magnitude fewer parameters by parameterized B-spline functions with trainable coefficients. However, the B-spline functions in KAN present new challenges for hardware acceleration. Evaluating the B-spline functions can be performed by using look-up tables (LUTs) to directly map the B-spline functions, thereby reducing computational resource requirements. However, this method still requires substantial circuit resources (LUTs, MUXs, decoders, etc.). For the first time, this paper employs an algorithm-hardware co-design methodology to accelerate KAN. The proposed algorithm-level techniques include Alignment-Symmetry and PowerGap KANhardware aware quantization, KAN sparsity aware mapping strategy, and circuit-level techniques […]
with analog-CIM (ACIM) circuits. The impact of non-ideal effects, such as partial sum errors caused by the process variations, has been evaluated with the statistics measured from the TSMC 22nm RRAM-ACIM prototype chips. With the best searched hyperparameters of KAN and the optimized circuits implemented in 22 nm node, we can reduce hardware area by 41.78x, energy by 77.97x with 3.03% accuracy boost compared to the traditional DNN hardware
We of course don’t have access to exotic analog prototype chips, but I think that’s ok, we could start with non-quantized KANs, as has been done in Python (with only 285 lines in the code below) and then later add quantization to 8-bit also just on regular hardware, otherwise as done here:
We propose an Alignment-Symmetry and PowerGap KAN hardware aware quantization that, for the first time, investigates the interaction between quantization grid and knot grid in KAN. The proposed method significantly minimizes the cost of LUTs, MUXs, decoders for B(X) function.
[…]
our focus is on accelerating wsspline(x) computation. In our implementation, ws is multiplied with ci and becomes ci’, then is quantized to 8-bit, transforming the formula to equation (3)
Note KANs are a drop in replacement for MLPs, i.e. a critical part of transformers and more.
It might seems like down to only 8 bits is not good, 2x to 8x larger than competing mainstream, but it’s a win if parameter count is reduced by at least as much e.g. 8x+. And I think it must have already been considered a win (for space) before quantization was introduced, given “orders of magnitude fewer parameters”, likely is a huge win for space; and for compute?
For backpropagation purposes, it is also convenient to store the output derivative with respect to the inputs:
However, all activation functions are linear combination of a fixed set of basis functions which are B-splines; given that, we can reformulate the computation as activate the input with different basis functions and then combine them linearly. This reformulation can significantly reduce the memory cost and make the computation a straightforward matrix multiplication, and works with both forward and backward pass naturally.
The problem is in the sparsification which is claimed to be critical to KAN’s interpretability.
The activations were the bottleneck before, with KAN or anything newer, then the bottleneck might shift. Still the current mainstream quantization (and pruning and other sparsification) might work for it, complement KAN that might only be used for some of the parameters.
My other worry is, should we go for more brain-inspired (e.g. the brain doesn’t use backpropagation, provably, considered not plausible, though I’ve seen it might actually be happening…)? I.e. is current AI ML/LLMs on the wrong path, as argued for in Jeff Hawkins’s excellent book (that I’ve read, and his earlier, I’ve not)? It’s not based on spiking neurons. There is already work on spiking neural networks in Julia, and also some spiking neural network hardware available.
Not read, $254 is pretty steep, though used cheaper and $75 on Kindle:
don’t think Flux uses mixed-precision, so probably no. It is possible to configure CUDA.jl to use tensor cores more eagerly, at the expense of some precision, by starting Julia with fast math enabled or by calling CUDA.math_mode!(CUDA.FAST_MATH), which will e.g. use TF32 when doing an F32xF32 matmul. Further speed-ups are possible by setting CUDA.jl’s math precision to :BFloat16 or even :Float16. Ideally though, I guess Flux.jl would have an interface to use mixed-precision arithmetic.
I don’t think Float32 is used at all anymore (since 2x slower), except by Julia still?! Usually brainfloat bfloat16, NOT the same as Float16 in Julia, which is not as good. bfloat16 is also outdated for inference, and I think also for training by now.
NOT needed to understand or implement:
KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search
Quantum architecture search (QAS) is a promising direction for optimization and automated design of quantum circuits towards quantum advantage. Recent techniques in QAS focus on machine learning-based approaches from reinforcement learning, like deep Q-network. […]
Moreover, in noisy scenarios, KAN can achieve a better fidelity in approximating maximally entangled state than MLPs, where the performance of the MLP significantly depends on the choice of activation function. In tackling quantum chemistry problems, we enhance the recently proposed QAS algorithm by integrating Curriculum Reinforcement Learning (CRL) with a KAN structure instead of the traditional MLP. This modification allows us to design a parameterized quantum circuit that contains fewer 2-qubit gates and has a shallower depth, thereby improving the efficiency of finding the ground state of a chemical Hamiltonian. Further investigation reveals that KAN requires a significantly smaller number of learnable parameters compared to MLPs; however, the average time of executing each episode for KAN is higher.
@Palli Thank you for this post, it was very informative.
when you say “In short, no I think for quantized, and might even be limited to FP32 in practice,…I don’t think Float32 is used at all anymore in any mainstream or research (since 2x, or 4x+ slower), except by Julia still?! But even Julia’s Float16 might be ok for KANs” I was not aware that this new trend had such an impact and the direction was totally changed. How much of the community uses KANs now?
You may have a point with “Or even better, be ahead of the curve, as by using KAN. There’s no need or even helpful to implement old ideas if/since they take much longer to train, with naive or old ideas.”
Regarding the ‘new direction’ possibility “we could start with non-quantized KANs, as has been done in Python (with only 285 lines in the code below) and then later add quantization to 8-bit also just on regular hardware, otherwise as done here” Could it be integrated into Flux? Would the chaining etc have to be be rewritten? There are a ton of great things in Flux. When you say “Note KANs are a drop in replacement for MLPs, i.e. a critical part of transformers and more.” I wonder if some type of integration is possible.
Jeff Hawkin’s book, just skimming through some of the material reminded me of the Marvin Minsky theories from ‘society of mind’.
I have understood the power of the 1bit net, but is it certain that KANs are the future as well? I have looked as some of the results on small scale studies and it seems comparable to the MLP but I have not looked into the large scale network comparisons.
I was talking out bfloat16 being dominant for training, not Float32, and then quantized used a lot. But not KANs, they are new. Sorry for the confusion, I just immediately continued with mentioning them, didn’t explicitly say they were popular yet, but I think they will be; in transformers. Quantization to 4-bit is I think mainstream, though I often see NONE used, i.e. models released first with e.g. bloat16, and then the quantization community post-quantizees models, and maybe fine-tunes.
I see from @ForceBru “Just use automatic differentiation” for backpropagation for KANs. And in a paper I linked “trained four KAN models using PyTorch, each sized 17x1x14, with G values of 7, 15, 30, and 60 corresponding to array sizes of 128, 256, 512, and 1024, respectively.” So I certainly think KANs would fit into Flux.jl. While I’m no expert on Flux or Lux, if KANs do not fit there then they should. An alternative, and an ok first step is to do independently of as with: “we use KANs as a nice opportunity to implement them from scratch in simple Python (no PyTorch / TensorFlow: just some good old numpy!).”
Amazing page:
Look at e.g.
Parallelism Concepts
And Julia is most likely behind.
It’s not too important to do everything from scratch (see e.g. Jjama3.jl, being impure, @noob, @AntonOresten, depending on Python/Rust code, but why not Rust directly? It’s other BytePairEncoding.jl is though a"Pure Julia implementation of the Byte Pair Encoding (BPE) method."):
I absolutely agree we shouldn’t bother implementing tokenizers in Julia, rather reuse, and even better get rid of (I also see Karpathy is now at a new AI company after, Eureka Labs, after leaving OpenAI, and Tesla before):
There is a whole separate stage with its own training and inference, and additional libraries. It complicates the ingest of additional modalities. Tokenization also has many subtle sharp edges. Few examples: […]
Tokenization creates attack surfaces, e.g. SolidGoldMagikarp […]
The list goes on, TLDR everyone should hope that tokenization could be thrown away. Maybe even more importantly, we may find general-purpose strategies for multi-scale training in the process.
Looking into “multi-scale training” I find a lot (most not directly on LLMs, on images, or time-series, not sure if the ideas translate to LLMs):
Experimental results demonstrate that SA-FedLoRA is an efficient FL, achieving superior performance to FedAvg and significantly reducing communication parameters by up to 93.62%"
I did not expect to see “wireless” and “jamming resistant” in relation to LLMs:
R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models
Jjama3.jl itself is pure, and you can use any tokenizer you like. It initially used BytePairEncoding.jl for its tokenizer, but then I couldn’t get that to work for eg. HF’s SmolLM2. @AntonOresten wrapped the Python interface in HuggingFaceTokenizers.jl I think just because of familiarity with wrapping Python rather than Rust (I’m not actually sure their Rust library will handle eg. importing a pre-trained tokenizer in the config we need?).
I 100% support the BytePairEncoding.jl effort, but for now we want to tinker with models and we don’t care what does the tokenization, especially if it means avoiding subtle footguns (and in tokenization there are MANY).
Transformers.jl is not just the interface to pre-trained models. It has all the building blocks needed for both training and inference. You can define your own LM with it and train it with Flux on CPU/GPU. The corresponding operations are defined in NeuralAttentionlib.jl for both CPU and GPU with many optimizations.
Unfortunately, no. A lot of configurations are hard-coded inside the transformers package, The tokenizers package is not sufficient to perform the exact tokenization. That’s the main reason why we cannot have a stable tokenizer loader in Julia.
Best for many languages including Julia, Python and R (and on average across the languages), compared to e.g. GPT-40-224-08-06; DS_coder-V2-Instruct matched for Julia, and exceeds for some languages, such as PHP and Scala.
I ask all LLMs this first:
What is the Julia language and can you show me example code?
It gave a good answer, then with code appended, not just from simple to:
7. Solving a Simple Differential Equation (using DifferentialEquations.jl)
I didn’t confirm it correct yet, but it was plausible and so is (though likely at least a bit wrong), after some further questions:
As of my last update, Adan is not directly implemented in Flux.jl. However, you can implement it yourself or use a custom implementation. Below is a basic implementation of the Adan optimizer in Julia:
Custom Adan Optimizer in Julia
using Flux
using Flux: @epochs, crossentropy, throttle
using Statistics
using MLDataPattern
using RDatasets
# Define the Adan optimizer
struct Adan
lr::Float32
β1::Float32
β2::Float32
β3::Float32
eps::Float32
step::Int
m::Dict
v::Dict
h::Dict
end
function Adan(lr::Float32=0.001, β1::Float32=0.9, β2::Float32=0.999, β3::Float32=0.999, eps::Float32=1e-8)
return Adan(lr, β1, β2, β3, eps, 0, Dict(), Dict(), Dict())
end
function (opt::Adan)(gs::Grads)
opt.step += 1
for (p, g) in pairs(gs)
if !haskey(opt.m, p)
opt.m[p] = zero(p)
opt.v[p] = zero(p)
opt.h[p] = zero(p)
end
m, v, h = opt.m[p], opt.v[p], opt.h[p]
m .= opt.β1 * m .+ (1 - opt.β1) * g
v .= opt.β2 * v .+ (1 - opt.β2) * (g .^ 2)
h .= opt.β3 * h .+ (1 - opt.β3) * (g .- m) .^ 2
m_hat = m ./ (1 - opt.β1 ^ opt.step)
v_hat = v ./ (1 - opt.β2 ^ opt.step)
h_hat = h ./ (1 - opt.β3 ^ opt.step)
p .-= opt.lr * m_hat ./ (sqrt(v_hat + h_hat) + opt.eps)
end
end
# Load the Iris dataset
iris = dataset("datasets", "iris")
# Convert the data to arrays
X = convert(Matrix{Float32}, iris[:, 1:4]')
Y = iris[:, 5]
# One-hot encode the labels
Y = Flux.onehotbatch(Y, unique(Y))
# Split the data into training and testing sets
(train_x, train_y), (test_x, test_y) = stratifiedobs((X, Y), p=0.8)
# Define the neural network model
model = Chain(
Dense(4, 10, relu),
Dense(10, 3),
softmax
)
# Define the loss function
loss(x, y) = crossentropy(model(x), y)
# Define the Adan optimizer
opt = Adan(0.001, 0.9, 0.999, 0.999, 1e-8)
# Define the training loop
function train!(model, data, opt, n_epochs=100)
for epoch in 1:n_epochs
for (x, y) in data
gs = gradient(() -> loss(x, y), params(model))
opt(gs)
end
if epoch % 10 == 0
println("Epoch $epoch: Loss = $(loss(train_x, train_y))")
end
end
end
# Create a DataLoader for batch training
data_loader = [(train_x, train_y)]
# Train the model
train!(model, data_loader, opt, 100)
# Make predictions on the test set
predictions = model(test_x)
# Convert predictions to class labels
predicted_labels = [argmax(predictions[:, i]) for i in 1:size(predictions, 2)]
true_labels = [argmax(test_y[:, i]) for i in 1:size(test_y, 2)]
# Calculate accuracy
accuracy = mean(predicted_labels .== true_labels)
println("Test Accuracy: $(accuracy * 100)%")
Explanation
Adan Optimizer Definition:
Parameters: Learning rate (lr), betas for the first, second, and third moments (β1, β2, β3), and epsilon (eps).
State Variables: m (first moment), v (second moment), and h (third moment).
Update Rule:
Moments Calculation: Compute the first, second, and third moments.
Bias Correction: Apply bias correction to the moments.
Parameter Update: Update the parameters using the corrected moments and the learning rate.
Training Loop:
Gradient Calculation: Compute the gradients using gradient.
Optimizer Update: Apply the Adan optimizer to update the model parameters.
Evaluation:
Predictions: Make predictions on the test set.
Accuracy Calculation: Calculate the test accuracy.
Conclusion
By implementing the Adan optimizer in Julia, you can leverage its advanced features for better convergence and performance in your deep learning models. If you find that this custom implementation meets your needs, you can use it in your projects. If you encounter any issues or need further optimizations, feel free to ask!
Great! Though it doesn’t seem to have the best building blocks. Then what can be improved, for best bang for the buck, and in which order?
It uses Adam (not to be confused with Adan or AdamW), i.e. what Flux.jl has but Flux doesn’t have Adan, and we can’t compete in training without it:
[The first line there is added to the Readme 6 months ago, and I also think we need, or want to support MoE, but can’t yet I believe]
Results on large language models, like MoE and GPT2, are released.
FusedAdan with less memory footprint is released.
[…]
Adan has a slightly higher GPU memory cost than Adam/AdamW on a single node. However, this problem can be solved using the ZeroRedundancyOptimizer, which shares optimizer states across distributed data-parallel processes to reduce per-process memory footprint. Specifically, when using the ZeroRedundancyOptimizer on more than two GPUs, Adan and Adam consume almost the same amount of memory.
[…] Adan obtains comparable results with only half cost.
I think, but I’m not sure, that most do not train from scratch, and that we might want to get a checkpoint from someone (if not fully pre-trained), and continue from there. And that you can even change optimizer to Adan or whatever, or even finish with just Adam[W] (or whatever best Julia has available), if Adan was used.
If you want to train from scratch you really want to use Adan somehow, might not mean reimplementing it, if somehow possible to train with it, use such code already existing.
We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling across a range of model sizes, hyperparameters, and architecture variants. Our findings indicate that, except for SGD, these algorithms all perform comparable
While in the Adan paper:
We evaluate Adan on the large language models (LLMs), GPT-2 [45], for code generalization tasks, which enables the completion and synthesis of code, both from other code snippets and natural language descriptions
[…] B.1 Pre-training Results on LLMs
[…]
TABLE 15: Comparison of training loss for MoE with different data volumes and model sizes using Adan and AdamW.
[…]
The results, as summarized in Table 15, indicate that Adan consistently outperforms the AdamW optimizer across all configurations and data volumes. This improvement underscores Adan’s capacity for efficient parameter updates and its utility in large-scale distributed training setups.
[…] B.2 Detailed Comparison on ViTs
Besides AdamW, we also compare Adan with several other popular optimizers, including Adam, SGD-M, and LAMB, on ViT-S. Table 16 shows that SGD, Adam, and LAMB perform poorly on ViT-S, which is also observed in the works [ 103],
[ 29]. These results demonstrate that the decoupled weight decay in Adan and AdamW is much more effective than 1) the vanilla weight decay, namely the commonly used ℓ2 regularization in SGD, and 2) the one without any weight decay, since as shown in Eqn. (6), the decoupled weight decay is a dynamic regularization along the training trajectory and could better regularize the loss. Compared with AdamW, Adan’s advantages mainly come from its faster convergence speed. This empirical evidence solidifies Adan as a superior choice for training ViTs, particularly when rapid convergence is essential
The paper on Adan is from 2002, but updated days ago, doesn’t mention “fused”, and FusedAdan seems also critical.
I only found Adan and Julia here (but “PLSR1 [underscore, not searchable] AdaN” probably means something else where, though that optimizer from September 2023 in Julia might be interesting): https://www.gerad.ca/fr/papers/G-2023-41.pdf
PLSR1: A limited-memory partitioned quasi-Newton optimizer for partially-separable loss functions
Abstract : Improving neural network optimizer convergence speed is a long-standing priority. Recently, there has been a focus on quasi-Newton optimization methods, which have fewer hyperparameters compared to gradient-based methods and show improved convergence results in deterministic optimization. We introduce PLSR1, a limited-memory partitioned quasi-Newton optimizer designed for optimizing a partially separable loss function, which is a sum of element loss functions of smaller dimensions
[…]
Thus, we adapt standard neural network architectures by incorporating separable layers, creating a partitioned architecture (PSNet). The numerical results compare the performance of several optimizers training the same partially-separable loss function on LeNet and PSNet architectures of similar sizes and effectiveness. The graphs exhibit the optimizer accuracies over epochs, on both the MNIST and CIFAR10 datasets. PLSR1 and an adaptative Nesterov variant show a training convergence comparable to Adam and outperforms LBFGS and SGD.
[…]
Figure 4 illustrates the comparison between several optimizers: SGD, Adam, LBFGS, and PLSR1 and PLSR1_AdaN.
[…]
To take full advantage of partial separability within the loss, it’s crucial
to incorporate separable layers into the architecture, creating a partitioned architecture. In this setup, our empirical results demonstrate that PLSR1 outperforms SGD and LBFGS and is competitive with Adam. Regarding future work, we aim to integrate this partitioned training into a distributed computing framework to lean toward federated learning
It’s not meant as a criticism of your package. By “have”, I should have written “uses”, e.g. it uses Adam not Adan (and that is about Flux.jl, i.e. your dependency), as I explained, so I meant the whole Julia ecosystem doesn’t have the best components available. KAN instead of MLP (a drop-in replacement for) is another improvement.
If you were just asking about the order of improving, then both KAN and better optimizers are important, and they can be done in either order. I’m pretty sure KAN is an improvement, and is here to stay, maybe with alternative b-spline implementations, but getting any implementation would be a good first step.
I might be wrong about Adan, or some even better can always come along. It is used for LLMs too, though at first I wasn’t too sure if it’s only an improvement for non-LLMs. Even then it would be good to have such available.
Quantization seems important, but less so if we go with KAN, then I would skip implementing such, as it may be a distraction, and not needed going forward.
@chengchingwen, Why did you use Adam, not AdamW? I’m reading AdamW is rather better, so was it not available then in Julia? Adan is then also better, but wasn’t available, so understandable. I want to know which is best, and be confident in my findings, so I’m not saying Adam was a bad choice, it’s hard to know and be confident, and people tend to go with the flow. Adan is better but maybe not a lot better, then KAN may be the thing to focus on, unless Adan is just simple to implement, and I may have it already in Julia, i.e. the LLM generated for me, and after fixing some bugs it seems to run.
This is also the right kind of question to ask. I think most of us who’ve done ML-related research understand how exhausting it is to walk on the paper treadmill. When 95% of them end up amounting to nothing, it doesn’t make sense to dedicate resources toward testing every remotely promising or novel idea.
The fundamental limitation of ML/DL in Julia is that there are very few people who could contribute (i.e. have the expertise and willingness), and they have very little bandwidth to do so. Any strategy to improve the ecosystem must take this into account. It’s fine to try implementing a paper every now and again, but that has to come with the recognition that implementing any old paper is going to on average do nothing to improve the overall ecosystem.
So if not the paper treadmill, then what? Let me repeat a suggestion I made on another recent thread: start with a couple of concrete gaps or use cases, then work upstream to see what is need across the ecosystem to support them. This is how we got Transformers.jl. It’s why Llama2.jl and now Jjama3.jl exist too! Heck, it’s arguably why the entire JuliaGenAI org exists. Are we trying to push the SOTA? No, and that’s fine because it would be unrealistic to try with the resources the Julia ML ecosystem has. Better to target what people are actually using/doing in the wild.
I was looking around HuggingFace for models that were trained in Julia completely and could not find any; maybe I did not look correctly or thoroughly enough. It would be great to see that happen. It is possible that Julia is used and it is simply not mentioned.
@ToucheSir I can imagine how busy everyone is. It would be great if there was a map/diagram of what has been done so that new efforts don’t reinvent the wheel. Like if some of the people who have developed key packages can outline parts they do not have time to develop so that a ‘roadmap’ of some sort is made to not waste efforts and reuse some of the great work already done like what @chengchingwen describes in Transformers.jl. I imagine that a lot of the core parts are already there, and wonder what of the peripheral components are needed? Then some advice on how to explore new approaches of the sort @Palli mentioned easily/efficiently. Maybe it is just the way that I think, but imaging a high level component diagram with stages of a stack for milestones to be set up. I totally agree with your approach to just get into the task and see how the effort ends up and in absence of any top down advice I will just go along that route or with anyone also eager to take up the task.
It’s not a bad idea, but in my experience these roadmaps never materialize because they require extra side-of-desk work from a bunch of people. The number of relevant packages for creating a LLM in Julia are pretty finite. Case in point, I believe at least one author from each has replied on this thread.
As such, it may be easier to ask questions about specific packages as they come up. Alternatively, declare what route(s) you want to take when you embark on a task and let others chime in with suggestions
Simply because I don’t have any preference for optimizers. Optimizers aren’t and shouldn’t be part of the transformer package. Any optimizer that works with Flux can be applied, but whether it would give optimal results is beyond the scope (specifically, beyond my bandwidth).
I understand that @Palli 's point is to overtake others on a curve with the rise of some new structures. That may be hopeful. But a more healthy approach is what @ToucheSir suggests here. So that we can prepare well for the next boom.
Based on my experience, there are still tons of missing components to train a LLM with from scratch in pure Julia. (But don’t take me wrong, I’m still optimistic Personally I think the most important issue right now is that top tier GPUs are still too expensive and not that easily accessible to most people. But in the long run, I believe the price will go down and more developers will realize that they deserve a better programming language or framework. I’m talking about Megatron )
I got some bandwidth recently and tried to implement many deep generative models (VAE, GAN, VQGAN, LLAMA, MoE, DDPM) from scratch with Lux.jl Honestly speaking, I really like its design. But to move on to models of large size across multiple nodes, we still have a lot of work to do. (And I’m not sure if it is worth working in this direction right now)
A more practical roadmap from my point is probably:
Single card inference
Multi card(node) inference
Single card training
Multi card training
Multi node training
And it would be great to have one public repo for people to report the benchmark results so that we can know what others have achieved until now.
I personally and optimistically hope that can be handled by package/people that implement the distributed optimizers. While I do actually hope we can have that in the package, it is not on my list of priorities. Essentially it’s still the same problem: too much stuff, too little time, too few people.
That’s a recent paper updated in August but not as recent as FC-KAN from September:
[…]
In our experiments, we compare FC-KAN with multi-layer perceptron network (MLP) and other existing KANs, such as BSRBF-KAN, EfficientKAN, FastKAN, and FasterKAN, on the MNIST and Fashion-MNIST datasets. A variant of FC-KAN, which uses a combination of outputs from B-splines and Difference of Gaussians (DoG) in the form of a quadratic function, outperformed all other models on the average of 5 independent training runs. We expect that FC-KAN can leverage function combinations to design future KANs. Our repository is publicly available at: this https URL.
[…] By introducing a new perspective to the scientific community in the new design of neural networks, KANs inspired many works to prove their effectiveness by topics, including expensive problems [26], keyword spotting [27], mechanics problems [28], quantum computing [ 29, 30, 31], survival analysis [ 32], time series forecasting [33 , 34 , 35, 36, 37], and vision tasks [38 , 39, 40]. Also, many novel KANs utilize well-known mathematical functions, particularly those capable of handling curves, such as B-Splines [41] (Original KAN [1], EfficientKAN, BSRBF-KAN [9]), Gaussian Radial Basis Functions (GRBFs) (FastKAN [5], DeepOKAN [28], BSRBF-KAN [9]), Reflection SWitch Activation Function (RSWAF) in FasterKAN [ 6], Chebyshev polynomials (TorchKAN [42], Chebyshev KAN [8]), Legendre polynomials (TorchKAN [42]), Fourier transform (FourierKAN2, FourierKAN-GCF [43]), wavelets [7, 44], and other polynomial functions [45].
[…]
Wav-KAN is a neural network architecture that integrates wavelet functions into Kolmogorov-Arnold Networks to address challenges in interpretability, training speed, robustness, and computational efficiency found in MLP and LiuKAN [7]. By efficiently capturing both high and low-frequency components of input data, Wav-KAN achieves a balance between accurately representing the data structure and avoiding overfitting. The authors use several wavelet types, including the DoG, Mexican hat, Morlet, and Shannon. In our paper, we use the DoG function to combine other functions to create function combinations
I wouldn’t give up on the KAN idea, at least some variant of. I think implementing even the original KAN would help, and you could always change the spline in later iteration (my feeling is that Julia is the best language to explore such variants).
FC-KAN has highest accuracy (important), FastKAN and FasterKAN are though well faster, only about 4-10% slower than MLP, while FC-KAN is yes, 55-84% slower than MLP.
What seemingly may kill the idea is that all the KANs have order of magnitude more parameters (and those in MLPs can likely be quantized more) BUT the loss goes down way faster while you train, and goes way lower, so maybe you can stop sooner, when you’ve hit same loss or better.
But it worries me though, can you scale KANs up to the huge neural networks you see in transformers/LLMs? Or do you actually not have to scale those up that much in parameter count and/or they will get very sparse. If not a lot of those parameters are zeros, and compress well then it’s a limit for one (or any amount of) GPUs.
From the former paper:
We use din and dout to denote the input and output dimensions of the neural network layer. We use K to denote the order of the spline […] We use G to denote the number of spline interval
[…]
The learnable parameters in KAN include the B-spline control points, the shortcut weights, the
B-spline weights, and the bias. The total learnable parameters are
(din × dout) × (G + K + 3) + dout.
Correspondingly, the learnable parameters of one MLP layer is
(din × dout) + dout
So it’s has (G + K + 3) times more parameters, but this is per layer, and it seems to me if they, the width, can be shrunk by that factor (and maybe twice that, for 4-bit quantization vs 8-bit integers) then you’re on even ground. If such a smaller network (by 10x) isn’t enough, then maybe you can have it 10x deeper, and seemingly it doesn’t cost you such a factor (only when in training).
I do see MLP in the code, and actually BloomMLP only there in the docs… I think it’s only a synonym for regular MLP(?), but patterned after the Bloom model (as in dimensions of). I.e. I’ve never heard of BloomAttention either, or BloomGelu. Yes, attention, and FlashAttention it doesn’t seem to support, that we likely want. And GeLU is much used.
I also see MLP in Flux.jl itself, I didn’t look carefully where KAN should replace MLP, likely in either package. I understand KAN is a drop-in replacement, i.e. conceptually, you need to adjust for more parameters (see I’ve edited my previous post).
What worries me is this part:
Under fixed training iterations, we found that KAN’s forgetting issue is more severe than that of MLP.
“Catastrophic forgetting” is a known problem with transformers/MLP (with fine-tuning later, and continual learning). I need to look more into what that refers to in KANs, the same thing, and if it’s a real issue. Note, they are not comparing KAN to traditional MLP, but MLP with b-splines, i.e. seemingly taking ideas from KAN, already and we most likely just implement original MLP.
The differences between KAN and common MLP are in two aspects. (1) Different activation functions. Commonly, the activation functions in MLP, such as ReLU and GELU, have no learnable parameters and are uniform for all input elements. However, in KAN, the activation function is a spline function, which has learnable parameters and is different for each input element. (2) The order of linear and non-linear operation. Generally, we conceptualize MLP as performing a linear transformation followed by a non-linear transformation. However, KAN actually performs a non-linear transformation first, followed by a linear transformation.