Community Interest Check: LLMs from Scratch in Pure Julia

svilupp · November 26, 2024, 7:26am

Incredible! Looking forward to it

Palli · November 26, 2024, 3:18pm

consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs.

You can try out 2- or 3-bit such models here:

But I’m a little confused, the 2- and 3-bit models seem the same size, and same as uncompressed LLama, I think, so what is happening is that since it uses the safetensor format (that Julia has support for with a package), it has not caught up with (such) quantizatrion. I.e. you are storing a lot of zeros. There must be some way to store without them (i.e. compress, or not yet developed? here then only showing the quantization, per se, works well, and most likely you wouldn’t run it with the optimized code that takes it into account?).

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

https://arxiv.org/pdf/2405.04532
Anything that gets to or below 4-bit (without degradation) is a major deal, but it’s not the whole story, even 1-bit isn’t the limit, you can even prune whole layers:

Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy. While effective, the interplay between these two methods remains an open question.[…]
We mathematically prove that applying sparsity before quantization is the optimal sequence for these operations, minimizing error in computation. […]
In addition, through rigorous analysis, we demonstrate that sparsity and quantization are not orthogonal; their interaction can significantly harm model accuracy, with quantization error playing a dominant role in this degradation

[only major savings there] Activations can take up a significant amount of memory during training

We may be at the limit for activations already, so now we’re shifting to quantification of the KV-cache.

we provide the official implementation of FP6-LLM

Note, the FP6-LLM (paper) is newer than FP4-LLM below from 2023.

This might also be of interest, though I think already outdated:
https://arxiv.org/html/2405.13938v1

https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

LLM-FP4: 4-Bit Floating-Point Quantized Transformers https://aclanthology.org/2023.emnlp-main.39.pdf

We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner

https://dennisforbes.ca/articles/understanding-floating-point-numbers.html

Transformer models are mainstream (for LLMs and much else e.g. for computer vision), but Diffusion models for generating images and video. Hybrids of them are very intriguing, for multi-modal, and here:

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

https://arxiv.org/html/2405.19751v2

Palli · November 27, 2024, 10:27am

In short, no I think for quantized, and might even be limited to FP32 in practice, but more on this at the bottom. I don’t think Float32 is used at all anymore in any mainstream or research (since 2x, or 4x+ slower), except by Julia still?! But even Julia’s Float16 might be ok for KANs:

My thinking is if we want to do a “Julia LLM from scratch” project at all, for training (also even for just inference) it needs to be competitive, using recent methods, or why bother? Or even better, be ahead of the curve, as by using KAN. There’s no need or even helpful to implement old ideas if/since they take much longer to train, with naive or old ideas.

Quantization has been the go-to method to allowing faster inference (allow more parameters in same fixed amount of [GPU/TPU] memory), and can actually also help for training I believe by now.

So, I’ve been thinking does it or can it even apply to KAN (or other new ideas, I do not want to lead us down the wrong path), it seems so:

Hardware Acceleration of Kolmogorov–Arnold Network (KAN) for Lightweight Edge Inference

https://arxiv.org/pdf/2409.11418
Acceptance date: September 2, 2024

Recently, a novel model named Kolmogorov-Arnold Networks (KAN) has been proposed with the potential to achieve the functionality of traditional deep neural networks (DNNs) using orders of magnitude fewer parameters by parameterized B-spline functions with trainable coefficients. However, the B-spline functions in KAN present new challenges for hardware acceleration. Evaluating the B-spline functions can be performed by using look-up tables (LUTs) to directly map the B-spline functions, thereby reducing computational resource requirements. However, this method still requires substantial circuit resources (LUTs, MUXs, decoders, etc.). For the first time, this paper employs an algorithm-hardware co-design methodology to accelerate KAN. The proposed algorithm-level techniques include Alignment-Symmetry and PowerGap KAN hardware aware quantization, KAN sparsity aware mapping strategy, and circuit-level techniques […]
with analog-CIM (ACIM) circuits. The impact of non-ideal effects, such as partial sum errors caused by the process variations, has been evaluated with the statistics measured from the TSMC 22nm RRAM-ACIM prototype chips. With the best searched hyperparameters of KAN and the optimized circuits implemented in 22 nm node, we can reduce hardware area by 41.78x, energy by 77.97x with 3.03% accuracy boost compared to the traditional DNN hardware

We of course don’t have access to exotic analog prototype chips, but I think that’s ok, we could start with non-quantized KANs, as has been done in Python (with only 285 lines in the code below) and then later add quantization to 8-bit also just on regular hardware, otherwise as done here:

We propose an Alignment-Symmetry and PowerGap KAN hardware aware quantization that, for the first time, investigates the interaction between quantization grid and knot grid in KAN. The proposed method significantly minimizes the cost of LUTs, MUXs, decoders for B(X) function.
[…]
our focus is on accelerating wsspline(x) computation. In our implementation, ws is multiplied with ci and becomes ci’, then is quantized to 8-bit, transforming the formula to equation (3)

Note KANs are a drop in replacement for MLPs, i.e. a critical part of transformers and more.

It might seems like down to only 8 bits is not good, 2x to 8x larger than competing mainstream, but it’s a win if parameter count is reduced by at least as much e.g. 8x+. And I think it must have already been considered a win (for space) before quantization was introduced, given “orders of magnitude fewer parameters”, likely is a huge win for space; and for compute?

https://www.reddit.com/r/learnmachinelearning/comments/1ffqutz/how_would_backpropagation_work_in_kans/

For backpropagation purposes, it is also convenient to store the output derivative with respect to the inputs:

However, all activation functions are linear combination of a fixed set of basis functions which are B-splines; given that, we can reformulate the computation as activate the input with different basis functions and then combine them linearly. This reformulation can significantly reduce the memory cost and make the computation a straightforward matrix multiplication, and works with both forward and backward pass naturally.

The problem is in the sparsification which is claimed to be critical to KAN’s interpretability.

The activations were the bottleneck before, with KAN or anything newer, then the bottleneck might shift. Still the current mainstream quantization (and pruning and other sparsification) might work for it, complement KAN that might only be used for some of the parameters.

My other worry is, should we go for more brain-inspired (e.g. the brain doesn’t use backpropagation, provably, considered not plausible, though I’ve seen it might actually be happening…)? I.e. is current AI ML/LLMs on the wrong path, as argued for in Jeff Hawkins’s excellent book (that I’ve read, and his earlier, I’ve not)? It’s not based on spiking neurons. There is already work on spiking neural networks in Julia, and also some spiking neural network hardware available.

Not read, $254 is pretty steep, though used cheaper and $75 on Kindle:

https://arxiv.org/html/2408.14811v1

BitNet: LLM Quantization at its Extreme Kolmogorov-Arnold Networks

Explaining Paper: [2410.23168] TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Julia has some bfloat16 native support by now (in 1.11) Adapt to upstream changes wrt. native support for BFloat16 by maleadt · Pull Request #51 · JuliaMath/BFloat16s.jl · GitHub for e.g. AMD EPYC 9554 CPU, but I’m not sure good enough, since you want to use GPUs or TPUs anyway, and I see:

github.com/FluxML/ONNX.jl

Comment by dfdx - update ONNX2JULIA_TYPE

FluxML:master ← ordicker:master

> I think that onnx.proto3 (and any other proto files) should be part of the rep…ository for tracking the supported version. Sounds good to me. > Is a Dict is the best option? The Error message that I got was cryptic (Missing Key 7). I think function would be better, but IDK. Yeah, I think half of the codebase now uses dicts, while the other half uses functions. Functions are definitely more flexible and pleasant to work with, consider conversions via dictionaries a legacy. > Not sure about FLOAT16 vs BFLOAT16. bfloat16 is quite different from float16, I don't think reading bfloat16 number as float16 is valid. If I understand it correctly, [JuliaMath/BFloat16s.jl](https://github.com/JuliaMath/BFloat16s.jl) requires using `LowPrecArray` to preserve semantics of ML operations in ONNX files, but the codebase in ONNX.jl doesn't assume arrays other than `Array{FP}` and `CuArray{FP}`, so I have no idea how much effort it may take to add support for `LowPrecArray`. Also, if I understand correctly, [CUDA.jl doesn't fully support bloat16](https://github.com/JuliaGPU/CUDA.jl/issues/391), which is quite limiting. Maybe a safer way would be to throw an error if we encounter bfloat16 for now. > Tests are added > Not sure how to test those feature We have a [couple of tests](https://github.com/FluxML/ONNX.jl/blob/master/test/readwrite.jl#L34) that simply iterate over the data types, it should be easy to extend the list.

Also, if I understand correctly, CUDA.jl doesn’t fully support bloat16, which is quite limiting.

github.com/JuliaGPU/CUDA.jl

Tracker: Float16 support

opened 07:43AM - 25 Aug 20 UTC

maleadt

help wanted cuda array cuda kernels cuda libraries

This issue is meant to be an overview / track progress on supporting Float16 in …CUDA.jl ## Library APIs We'll be able to do a lot already if we just dispatch to the correct library calls, e.g., `cublasHgemm` for half-precision, or `cublasGemmEx` with appropriate CUDA datatypes set. Note that this requires a matching math mode, https://docs.nvidia.com/cuda/cublas/index.html#cublassetmathmode, so https://github.com/JuliaGPU/CUDA.jl/issues/354 may be relevant. Since we support many devices / toolkits, we'll need to check what the requirements for these calls are, and whether the APIs offer a fall-back (e.g., when tensor-cores are disabled or unavailable). If not, we may need to offer pure Julia fallbacks, convert to Float32 arrays, or require the caller (i.e. Knet, Flux) to check for device capabilities before using half-precision data. That may be the easiest solution. Alternatively, we could look into the TF32 format which IIUC should make it possible to more easily use Float16 functions when available, or fall back to existing Float32 functionality (since the representation doesn't change). That way we also do not need kernel programming capabilities (see below) to implement non-library operations. ## Kernel programming Beyond API calls, we need to be able to compile and execute half-precision code. This requires several features: ### Half-precision intrinsics `libdevice` doesn't seem to provide half-precision intrinsics, but instead `cuda_fp16.hpp` contains inline assembly snippets: ```c++ __CUDA_FP16_DECL__ __half hsin(const __half a) { __half r = __hsin_internal(a); asm("{\n\t" " .reg.b16 i,r,t; \n\t" " mov.b16 r, %0; \n\t" " mov.b16 i, %1; \n\t" " mov.b16 t, 0x8000; \n\t" " and.b16 t,r,t; \n\t" __SPEC_CASE(i, r, 0X32B3, 0x0800) __SPEC_CASE(i, r, 0X5CB0, 0x1000) __SPEC_CASE(i, r, 0XB2B3, 0x8800) __SPEC_CASE(i, r, 0XDCB0, 0x9000) " or.b16 r,r,t; \n\t" " mov.b16 %0, r; \n" "}\n" : "+h"(__HALF_TO_US(r)) : "h"(__HALF_TO_CUS(a))); return r; } ``` We can do inline assembly from Julia: https://github.com/JuliaGPU/CUDA.jl/blob/d9db2beeb6bb61f5d6b4c586b8de2e82dc43d78c/src/device/intrinsics/warp_vote.jl#L11-L21 ### `half` LLVM representation Julia currently represents half-precision floating-point as `i16` and implements essential operations itself. Ideally we'd switch to using `half` and let LLVM emit the appropriate GPU instructions, like https://github.com/JuliaLang/julia/pull/26381 does. Alternatively, we can use our own implementation, like https://github.com/vchuravy/IEEEFloat16.jl. @vchuravy, any status update here? Note that we already can use LLVM's `half` with `llvmcall` after https://github.com/JuliaLang/julia/pull/33970; which makes it possible to, e.g., `llvmcall` the WMMA hardware intrinsics. We could use this to interface with other intrinsics, or even implement more essential operations, but it'd be better to update the Julia compiler and have it use `half` instead of reimplementing everything. cc @denizyuret

don’t think Flux uses mixed-precision, so probably no. It is possible to configure CUDA.jl to use tensor cores more eagerly, at the expense of some precision, by starting Julia with fast math enabled or by calling CUDA.math_mode!(CUDA.FAST_MATH), which will e.g. use TF32 when doing an F32xF32 matmul. Further speed-ups are possible by setting CUDA.jl’s math precision to :BFloat16 or even :Float16. Ideally though, I guess Flux.jl would have an interface to use mixed-precision arithmetic.

I don’t think Float32 is used at all anymore (since 2x slower), except by Julia still?! Usually brainfloat bfloat16, NOT the same as Float16 in Julia, which is not as good. bfloat16 is also outdated for inference, and I think also for training by now.

NOT needed to understand or implement:

KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search

https://arxiv.org/pdf/2406.17630

Quantum architecture search (QAS) is a promising direction for optimization and automated design of quantum circuits towards quantum advantage. Recent techniques in QAS focus on machine learning-based approaches from reinforcement learning, like deep Q-network. […]
Moreover, in noisy scenarios, KAN can achieve a better fidelity in approximating maximally entangled state than MLPs, where the performance of the MLP significantly depends on the choice of activation function. In tackling quantum chemistry problems, we enhance the recently proposed QAS algorithm by integrating Curriculum Reinforcement Learning (CRL) with a KAN structure instead of the traditional MLP. This modification allows us to design a parameterized quantum circuit that contains fewer 2-qubit gates and has a shallower depth, thereby improving the efficiency of finding the ground state of a chemical Hamiltonian. Further investigation reveals that KAN requires a significantly smaller number of learnable parameters compared to MLPs; however, the average time of executing each episode for KAN is higher.

mantzaris · November 29, 2024, 5:49am

@Palli Thank you for this post, it was very informative.

when you say “In short, no I think for quantized, and might even be limited to FP32 in practice,…I don’t think Float32 is used at all anymore in any mainstream or research (since 2x, or 4x+ slower), except by Julia still?! But even Julia’s Float16 might be ok for KANs” I was not aware that this new trend had such an impact and the direction was totally changed. How much of the community uses KANs now?

You may have a point with “Or even better, be ahead of the curve, as by using KAN. There’s no need or even helpful to implement old ideas if/since they take much longer to train, with naive or old ideas.”

Regarding the ‘new direction’ possibility “we could start with non-quantized KANs, as has been done in Python (with only 285 lines in the code below) and then later add quantization to 8-bit also just on regular hardware, otherwise as done here” Could it be integrated into Flux? Would the chaining etc have to be be rewritten? There are a ton of great things in Flux. When you say “Note KANs are a drop in replacement for MLPs, i.e. a critical part of transformers and more.” I wonder if some type of integration is possible.

Jeff Hawkin’s book, just skimming through some of the material reminded me of the Marvin Minsky theories from ‘society of mind’.

I have understood the power of the 1bit net, but is it certain that KANs are the future as well? I have looked as some of the results on small scale studies and it seems comparable to the MLP but I have not looked into the large scale network comparisons.

Palli · November 29, 2024, 12:16pm

I was talking out bfloat16 being dominant for training, not Float32, and then quantized used a lot. But not KANs, they are new. Sorry for the confusion, I just immediately continued with mentioning them, didn’t explicitly say they were popular yet, but I think they will be; in transformers. Quantization to 4-bit is I think mainstream, though I often see NONE used, i.e. models released first with e.g. bloat16, and then the quantization community post-quantizees models, and maybe fine-tunes.

I see from @ForceBru “Just use automatic differentiation” for backpropagation for KANs. And in a paper I linked “trained four KAN models using PyTorch, each sized 17x1x14, with G values of 7, 15, 30, and 60 corresponding to array sizes of 128, 256, 512, and 1024, respectively.” So I certainly think KANs would fit into Flux.jl. While I’m no expert on Flux or Lux, if KANs do not fit there then they should. An alternative, and an ok first step is to do independently of as with: “we use KANs as a nice opportunity to implement them from scratch in simple Python (no PyTorch / TensorFlow: just some good old numpy!).”

Amazing page:

Look at e.g.

Parallelism Concepts

And Julia is most likely behind.

It’s not too important to do everything from scratch (see e.g. Jjama3.jl, being impure, @noob, @AntonOresten, depending on Python/Rust code, but why not Rust directly? It’s other BytePairEncoding.jl is though a"Pure Julia implementation of the Byte Pair Encoding (BPE) method."):

I absolutely agree we shouldn’t bother implementing tokenizers in Julia, rather reuse, and even better get rid of (I also see Karpathy is now at a new AI company after, Eureka Labs, after leaving OpenAI, and Tesla before):

There is a whole separate stage with its own training and inference, and additional libraries. It complicates the ingest of additional modalities. Tokenization also has many subtle sharp edges. Few examples: […]
Tokenization creates attack surfaces, e.g. SolidGoldMagikarp […]
The list goes on, TLDR everyone should hope that tokenization could be thrown away. Maybe even more importantly, we may find general-purpose strategies for multi-scale training in the process.

Looking into “multi-scale training” I find a lot (most not directly on LLMs, on images, or time-series, not sure if the ideas translate to LLMs):

https://arxiv.org/html/2410.11674

PKU-YuanGroup/Open-Sora-Plan/blob/main/docs/Report-v1.3.0.md

# Report v1.3.0

In August 2024, we released Open-Sora-Plan v1.2.0, transitioning to a 3D full attention architecture, which enhanced the capture of joint spatial-temporal features. However, the substantial computational cost made it unsustainable, and the lack of a clear training strategy hindered continuous progress along a focused path.

In version 1.3.0, Open-Sora-Plan introduced the following five key features:

**1. A more powerful and cost-efficient WFVAE.** We decompose video into several sub-bands using wavelet transforms, naturally capturing information across different frequency domains, leading to more efficient and robust VAE learning.

**2. Prompt Refiner.** A large language model designed to refine short text inputs.

**3. High-quality data cleaning strategy.** The cleaned panda70m dataset retains only 27% of the original data.

**4. DiT with new sparse attention.** A more cost-effective and efficient learning approach.

**5. Dynamic resolution and dynamic duration.** This enables more efficient utilization of videos with varying lengths (treating a single frame as an image).

### Open-Source Release
We open-source the Open-Sora-Plan to facilitate future development of Video Generation in the community. Code, data, model will be made publicly available.
- Code: All training scripts and sample scripts.
- Model: Both Diffusion Model and CasualVideoVAE [here](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0).

This file has been truncated. show original

Something I was seeing but not looked at close enough to know if relevant for us:

This seems important and only 9 pages:
https://arxiv.org/pdf/2407.00952

https://arxiv.org/pdf/2405.09394

Experimental results demonstrate that SA-FedLoRA is an efficient FL, achieving superior performance to FedAvg and significantly reducing communication parameters by up to 93.62%"

I did not expect to see “wireless” and “jamming resistant” in relation to LLMs:

R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models

https://arxiv.org/pdf/2407.11654

https://www.reddit.com/r/MachineLearning/comments/1cfj9kf/crosspost_on_improving_llm_efficiency_using_split/

Noob · November 29, 2024, 2:44pm

Jjama3.jl itself is pure, and you can use any tokenizer you like. It initially used BytePairEncoding.jl for its tokenizer, but then I couldn’t get that to work for eg. HF’s SmolLM2. @AntonOresten wrapped the Python interface in HuggingFaceTokenizers.jl I think just because of familiarity with wrapping Python rather than Rust (I’m not actually sure their Rust library will handle eg. importing a pre-trained tokenizer in the config we need?).

I 100% support the BytePairEncoding.jl effort, but for now we want to tinker with models and we don’t care what does the tokenization, especially if it means avoiding subtle footguns (and in tokenization there are MANY).

chengchingwen · December 1, 2024, 3:47am

Transformers.jl is not just the interface to pre-trained models. It has all the building blocks needed for both training and inference. You can define your own LM with it and train it with Flux on CPU/GPU. The corresponding operations are defined in NeuralAttentionlib.jl for both CPU and GPU with many optimizations.

Unfortunately, no. A lot of configurations are hard-coded inside the transformers package, The tokenizers package is not sufficient to perform the exact tokenization. That’s the main reason why we cannot have a stable tokenizer loader in Julia.

Palli · December 2, 2024, 11:24am

The current best models are: Qwen/QwQ-32B-Preview · Hugging Face

Best for many languages including Julia, Python and R (and on average across the languages), compared to e.g. GPT-40-224-08-06; DS_coder-V2-Instruct matched for Julia, and exceeds for some languages, such as PHP and Scala.

I ask all LLMs this first:

What is the Julia language and can you show me example code?

It gave a good answer, then with code appended, not just from simple to:

7. Solving a Simple Differential Equation (using DifferentialEquations.jl)

I didn’t confirm it correct yet, but it was plausible and so is (though likely at least a bit wrong), after some further questions:

As of my last update, Adan is not directly implemented in Flux.jl. However, you can implement it yourself or use a custom implementation. Below is a basic implementation of the Adan optimizer in Julia:

Custom Adan Optimizer in Julia
using Flux
using Flux: @epochs, crossentropy, throttle
using Statistics
using MLDataPattern
using RDatasets

# Define the Adan optimizer
struct Adan
    lr::Float32
    β1::Float32
    β2::Float32
    β3::Float32
    eps::Float32
    step::Int
    m::Dict
    v::Dict
    h::Dict
end

function Adan(lr::Float32=0.001, β1::Float32=0.9, β2::Float32=0.999, β3::Float32=0.999, eps::Float32=1e-8)
    return Adan(lr, β1, β2, β3, eps, 0, Dict(), Dict(), Dict())
end

function (opt::Adan)(gs::Grads)
    opt.step += 1
    for (p, g) in pairs(gs)
        if !haskey(opt.m, p)
            opt.m[p] = zero(p)
            opt.v[p] = zero(p)
            opt.h[p] = zero(p)
        end

        m, v, h = opt.m[p], opt.v[p], opt.h[p]

        m .= opt.β1 * m .+ (1 - opt.β1) * g
        v .= opt.β2 * v .+ (1 - opt.β2) * (g .^ 2)
        h .= opt.β3 * h .+ (1 - opt.β3) * (g .- m) .^ 2

        m_hat = m ./ (1 - opt.β1 ^ opt.step)
        v_hat = v ./ (1 - opt.β2 ^ opt.step)
        h_hat = h ./ (1 - opt.β3 ^ opt.step)

        p .-= opt.lr * m_hat ./ (sqrt(v_hat + h_hat) + opt.eps)
    end
end

# Load the Iris dataset
iris = dataset("datasets", "iris")

# Convert the data to arrays
X = convert(Matrix{Float32}, iris[:, 1:4]')
Y = iris[:, 5]

# One-hot encode the labels
Y = Flux.onehotbatch(Y, unique(Y))

# Split the data into training and testing sets
(train_x, train_y), (test_x, test_y) = stratifiedobs((X, Y), p=0.8)

# Define the neural network model
model = Chain(
    Dense(4, 10, relu),
    Dense(10, 3),
    softmax
)

# Define the loss function
loss(x, y) = crossentropy(model(x), y)

# Define the Adan optimizer
opt = Adan(0.001, 0.9, 0.999, 0.999, 1e-8)

# Define the training loop
function train!(model, data, opt, n_epochs=100)
    for epoch in 1:n_epochs
        for (x, y) in data
            gs = gradient(() -> loss(x, y), params(model))
            opt(gs)
        end
        if epoch % 10 == 0
            println("Epoch $epoch: Loss = $(loss(train_x, train_y))")
        end
    end
end

# Create a DataLoader for batch training
data_loader = [(train_x, train_y)]

# Train the model
train!(model, data_loader, opt, 100)

# Make predictions on the test set
predictions = model(test_x)

# Convert predictions to class labels
predicted_labels = [argmax(predictions[:, i]) for i in 1:size(predictions, 2)]
true_labels = [argmax(test_y[:, i]) for i in 1:size(test_y, 2)]

# Calculate accuracy
accuracy = mean(predicted_labels .== true_labels)
println("Test Accuracy: $(accuracy * 100)%")
Explanation

Adan Optimizer Definition:

Parameters: Learning rate (lr), betas for the first, second, and third moments (β1, β2, β3), and epsilon (eps).

State Variables: m (first moment), v (second moment), and h (third moment).

Update Rule:

Moments Calculation: Compute the first, second, and third moments.

Bias Correction: Apply bias correction to the moments.

Parameter Update: Update the parameters using the corrected moments and the learning rate.

Training Loop:

Gradient Calculation: Compute the gradients using gradient.

Optimizer Update: Apply the Adan optimizer to update the model parameters.

Evaluation:

Predictions: Make predictions on the test set.

Accuracy Calculation: Calculate the test accuracy.

Conclusion

By implementing the Adan optimizer in Julia, you can leverage its advanced features for better convergence and performance in your deep learning models. If you find that this custom implementation meets your needs, you can use it in your projects. If you encounter any issues or need further optimizations, feel free to ask!

Great! Though it doesn’t seem to have the best building blocks. Then what can be improved, for best bang for the buck, and in which order?

It uses Adam (not to be confused with Adan or AdamW), i.e. what Flux.jl has but Flux doesn’t have Adan, and we can’t compete in training without it:

[The first line there is added to the Readme 6 months ago, and I also think we need, or want to support MoE, but can’t yet I believe]

Results on large language models, like MoE and GPT2, are released.
FusedAdan with less memory footprint is released.
[…]
Adan has a slightly higher GPU memory cost than Adam/AdamW on a single node. However, this problem can be solved using the ZeroRedundancyOptimizer, which shares optimizer states across distributed data-parallel processes to reduce per-process memory footprint. Specifically, when using the ZeroRedundancyOptimizer on more than two GPUs, Adan and Adam consume almost the same amount of memory.
[…]
Adan obtains comparable results with only half cost.

I think, but I’m not sure, that most do not train from scratch, and that we might want to get a checkpoint from someone (if not fully pre-trained), and continue from there. And that you can even change optimizer to Adan or whatever, or even finish with just Adam[W] (or whatever best Julia has available), if Adan was used.

If you want to train from scratch you really want to use Adan somehow, might not mean reimplementing it, if somehow possible to train with it, use such code already existing.

I really want to know what the best optimizer is, e.g. Lion or Sophia, or if it depends on if used in for an LLM, or other modalities (though the world is going multi-modal, so that is what we likely want to focus on): [2307.06440] No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

I am contradicting what this paper states, about optimzers comparable (since it didn’t look at Adan, nor AdamW if I recall): [2407.07972] Deconstructing What Makes a Good Optimizer for Language Models

We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling across a range of model sizes, hyperparameters, and architecture variants. Our findings indicate that, except for SGD, these algorithms all perform comparable

While in the Adan paper:

We evaluate Adan on the large language models (LLMs), GPT-2 [45], for code generalization tasks, which enables the completion and synthesis of code, both from other code snippets and natural language descriptions
[…]
B.1 Pre-training Results on LLMs
[…]
TABLE 15: Comparison of training loss for MoE with different data volumes and model sizes using Adan and AdamW.
[…]
The results, as summarized in Table 15, indicate that Adan consistently outperforms the AdamW optimizer across all configurations and data volumes. This improvement underscores Adan’s capacity for efficient parameter updates and its utility in large-scale distributed training setups.
[…]
B.2 Detailed Comparison on ViTs
Besides AdamW, we also compare Adan with several other popular optimizers, including Adam, SGD-M, and LAMB, on ViT-S. Table 16 shows that SGD, Adam, and LAMB perform poorly on ViT-S, which is also observed in the works [ 103],
[ 29]. These results demonstrate that the decoupled weight decay in Adan and AdamW is much more effective than 1) the vanilla weight decay, namely the commonly used ℓ2 regularization in SGD, and 2) the one without any weight decay, since as shown in Eqn. (6), the decoupled weight decay is a dynamic regularization along the training trajectory and could better regularize the loss. Compared with AdamW, Adan’s advantages mainly come from its faster convergence speed. This empirical evidence solidifies Adan as a superior choice for training ViTs, particularly when rapid convergence is essential

The paper on Adan is from 2002, but updated days ago, doesn’t mention “fused”, and FusedAdan seems also critical.

I only found Adan and Julia here (but “PLSR1 [underscore, not searchable] AdaN” probably means something else where, though that optimizer from September 2023 in Julia might be interesting): https://www.gerad.ca/fr/papers/G-2023-41.pdf

PLSR1: A limited-memory partitioned quasi-Newton optimizer for partially-separable loss functions

Abstract : Improving neural network optimizer convergence speed is a long-standing priority. Recently, there has been a focus on quasi-Newton optimization methods, which have fewer hyperparameters compared to gradient-based methods and show improved convergence results in deterministic optimization. We introduce PLSR1, a limited-memory partitioned quasi-Newton optimizer designed for optimizing a partially separable loss function, which is a sum of element loss functions of smaller dimensions
[…]
Thus, we adapt standard neural network architectures by incorporating separable layers, creating a partitioned architecture (PSNet). The numerical results compare the performance of several optimizers training the same partially-separable loss function on LeNet and PSNet architectures of similar sizes and effectiveness. The graphs exhibit the optimizer accuracies over epochs, on both the MNIST and CIFAR10 datasets. PLSR1 and an adaptative Nesterov variant show a training convergence comparable to Adam and outperforms LBFGS and SGD.
[…]
Figure 4 illustrates the comparison between several optimizers: SGD, Adam, LBFGS, and PLSR1 and PLSR1_AdaN.
[…]
To take full advantage of partial separability within the loss, it’s crucial
to incorporate separable layers into the architecture, creating a partitioned architecture. In this setup, our empirical results demonstrate that PLSR1 outperforms SGD and LBFGS and is competitive with Adam. Regarding future work, we aim to integrate this partitioned training into a distributed computing framework to lean toward federated learning

Also done with Julia:

chengchingwen · December 2, 2024, 12:15pm

Could you elaborate?

Palli · December 2, 2024, 1:26pm

It’s not meant as a criticism of your package. By “have”, I should have written “uses”, e.g. it uses Adam not Adan (and that is about Flux.jl, i.e. your dependency), as I explained, so I meant the whole Julia ecosystem doesn’t have the best components available. KAN instead of MLP (a drop-in replacement for) is another improvement.

If you were just asking about the order of improving, then both KAN and better optimizers are important, and they can be done in either order. I’m pretty sure KAN is an improvement, and is here to stay, maybe with alternative b-spline implementations, but getting any implementation would be a good first step.

I might be wrong about Adan, or some even better can always come along. It is used for LLMs too, though at first I wasn’t too sure if it’s only an improvement for non-LLMs. Even then it would be good to have such available.

Quantization seems important, but less so if we go with KAN, then I would skip implementing such, as it may be a distraction, and not needed going forward.

@chengchingwen, Why did you use Adam, not AdamW? I’m reading AdamW is rather better, so was it not available then in Julia? Adan is then also better, but wasn’t available, so understandable. I want to know which is best, and be confident in my findings, so I’m not saying Adam was a bad choice, it’s hard to know and be confident, and people tend to go with the flow. Adan is better but maybe not a lot better, then KAN may be the thing to focus on, unless Adan is just simple to implement, and I may have it already in Julia, i.e. the LLM generated for me, and after fixing some bugs it seems to run.

ToucheSir · December 2, 2024, 3:23pm

Your initial impression is probably correct, see [2407.16674] KAN or MLP: A Fairer Comparison.

This is also the right kind of question to ask. I think most of us who’ve done ML-related research understand how exhausting it is to walk on the paper treadmill. When 95% of them end up amounting to nothing, it doesn’t make sense to dedicate resources toward testing every remotely promising or novel idea.

The fundamental limitation of ML/DL in Julia is that there are very few people who could contribute (i.e. have the expertise and willingness), and they have very little bandwidth to do so. Any strategy to improve the ecosystem must take this into account. It’s fine to try implementing a paper every now and again, but that has to come with the recognition that implementing any old paper is going to on average do nothing to improve the overall ecosystem.

So if not the paper treadmill, then what? Let me repeat a suggestion I made on another recent thread: start with a couple of concrete gaps or use cases, then work upstream to see what is need across the ecosystem to support them. This is how we got Transformers.jl. It’s why Llama2.jl and now Jjama3.jl exist too! Heck, it’s arguably why the entire JuliaGenAI org exists. Are we trying to push the SOTA? No, and that’s fine because it would be unrealistic to try with the resources the Julia ML ecosystem has. Better to target what people are actually using/doing in the wild.

mantzaris · December 2, 2024, 4:10pm

I was looking around HuggingFace for models that were trained in Julia completely and could not find any; maybe I did not look correctly or thoroughly enough. It would be great to see that happen. It is possible that Julia is used and it is simply not mentioned.

@ToucheSir I can imagine how busy everyone is. It would be great if there was a map/diagram of what has been done so that new efforts don’t reinvent the wheel. Like if some of the people who have developed key packages can outline parts they do not have time to develop so that a ‘roadmap’ of some sort is made to not waste efforts and reuse some of the great work already done like what @chengchingwen describes in Transformers.jl. I imagine that a lot of the core parts are already there, and wonder what of the peripheral components are needed? Then some advice on how to explore new approaches of the sort @Palli mentioned easily/efficiently. Maybe it is just the way that I think, but imaging a high level component diagram with stages of a stack for milestones to be set up. I totally agree with your approach to just get into the task and see how the effort ends up and in absence of any top down advice I will just go along that route or with anyone also eager to take up the task.

ToucheSir · December 2, 2024, 4:17pm

It’s not a bad idea, but in my experience these roadmaps never materialize because they require extra side-of-desk work from a bunch of people. The number of relevant packages for creating a LLM in Julia are pretty finite. Case in point, I believe at least one author from each has replied on this thread.

As such, it may be easier to ask questions about specific packages as they come up. Alternatively, declare what route(s) you want to take when you embark on a task and let others chime in with suggestions

chengchingwen · December 2, 2024, 5:04pm

Simply because I don’t have any preference for optimizers. Optimizers aren’t and shouldn’t be part of the transformer package. Any optimizer that works with Flux can be applied, but whether it would give optimal results is beyond the scope (specifically, beyond my bandwidth).

findmyway · December 2, 2024, 5:07pm

That is so true

I understand that @Palli 's point is to overtake others on a curve with the rise of some new structures. That may be hopeful. But a more healthy approach is what @ToucheSir suggests here. So that we can prepare well for the next boom.

Based on my experience, there are still tons of missing components to train a LLM with from scratch in pure Julia. (But don’t take me wrong, I’m still optimistic Personally I think the most important issue right now is that top tier GPUs are still too expensive and not that easily accessible to most people. But in the long run, I believe the price will go down and more developers will realize that they deserve a better programming language or framework. I’m talking about Megatron )

I got some bandwidth recently and tried to implement many deep generative models (VAE, GAN, VQGAN, LLAMA, MoE, DDPM) from scratch with Lux.jl Honestly speaking, I really like its design. But to move on to models of large size across multiple nodes, we still have a lot of work to do. (And I’m not sure if it is worth working in this direction right now)

A more practical roadmap from my point is probably:

Single card inference
Multi card(node) inference
Single card training
Multi card training
Multi node training

And it would be great to have one public repo for people to report the benchmark results so that we can know what others have achieved until now.

findmyway · December 2, 2024, 5:11pm

Not until you try to implement distributed optimizers with zero 1~3 or pipeline parallel

chengchingwen · December 2, 2024, 5:21pm

I personally and optimistically hope that can be handled by package/people that implement the distributed optimizers. While I do actually hope we can have that in the package, it is not on my list of priorities. Essentially it’s still the same problem: too much stuff, too little time, too few people.

Palli · December 2, 2024, 7:17pm

That’s a recent paper updated in August but not as recent as FC-KAN from September:

[…]
In our experiments, we compare FC-KAN with multi-layer perceptron network (MLP) and other existing KANs, such as BSRBF-KAN, EfficientKAN, FastKAN, and FasterKAN, on the MNIST and Fashion-MNIST datasets. A variant of FC-KAN, which uses a combination of outputs from B-splines and Difference of Gaussians (DoG) in the form of a quadratic function, outperformed all other models on the average of 5 independent training runs. We expect that FC-KAN can leverage function combinations to design future KANs. Our repository is publicly available at: this https URL.
[…]
By introducing a new perspective to the scientific community in the new design of neural networks, KANs inspired many works to prove their effectiveness by topics, including expensive problems [26], keyword spotting [27], mechanics problems [28], quantum computing [ 29, 30, 31], survival analysis [ 32], time series forecasting [33 , 34 , 35, 36, 37], and vision tasks [38 , 39, 40]. Also, many novel KANs utilize well-known mathematical functions, particularly those capable of handling curves, such as B-Splines [41] (Original KAN [1], EfficientKAN, BSRBF-KAN [9]), Gaussian Radial Basis Functions (GRBFs) (FastKAN [5], DeepOKAN [28], BSRBF-KAN [9]), Reflection SWitch Activation Function (RSWAF) in FasterKAN [ 6], Chebyshev polynomials (TorchKAN [42], Chebyshev KAN [8]), Legendre polynomials (TorchKAN [42]), Fourier transform (FourierKAN2, FourierKAN-GCF [43]), wavelets [7, 44], and other polynomial functions [45].
[…]
Wav-KAN is a neural network architecture that integrates wavelet functions into Kolmogorov-Arnold Networks to address challenges in interpretability, training speed, robustness, and computational efficiency found in MLP and LiuKAN [7]. By efficiently capturing both high and low-frequency components of input data, Wav-KAN achieves a balance between accurately representing the data structure and avoiding overfitting. The authors use several wavelet types, including the DoG, Mexican hat, Morlet, and Shannon. In our paper, we use the DoG function to combine other functions to create function combinations

I wouldn’t give up on the KAN idea, at least some variant of. I think implementing even the original KAN would help, and you could always change the spline in later iteration (my feeling is that Julia is the best language to explore such variants).

FC-KAN has highest accuracy (important), FastKAN and FasterKAN are though well faster, only about 4-10% slower than MLP, while FC-KAN is yes, 55-84% slower than MLP.

What seemingly may kill the idea is that all the KANs have order of magnitude more parameters (and those in MLPs can likely be quantized more) BUT the loss goes down way faster while you train, and goes way lower, so maybe you can stop sooner, when you’ve hit same loss or better.

But it worries me though, can you scale KANs up to the huge neural networks you see in transformers/LLMs? Or do you actually not have to scale those up that much in parameter count and/or they will get very sparse. If not a lot of those parameters are zeros, and compress well then it’s a limit for one (or any amount of) GPUs.

From the former paper:

We use din and dout to denote the input and output dimensions of the neural network layer. We use K to denote the order of the spline […] We use G to denote the number of spline interval
[…]
The learnable parameters in KAN include the B-spline control points, the shortcut weights, the

B-spline weights, and the bias. The total learnable parameters are

(din × dout) × (G + K + 3) + dout.

Correspondingly, the learnable parameters of one MLP layer is

(din × dout) + dout

So it’s has (G + K + 3) times more parameters, but this is per layer, and it seems to me if they, the width, can be shrunk by that factor (and maybe twice that, for 4-bit quantization vs 8-bit integers) then you’re on even ground. If such a smaller network (by 10x) isn’t enough, then maybe you can have it 10x deeper, and seemingly it doesn’t cost you such a factor (only when in training).

mantzaris · December 2, 2024, 7:33pm

Great to know! Are there any other parts you think need attention?

Is there anything else that there is no time for which was on your mind? Like if you had more time, what packages would you like to create?

Could that be integrated into a current package to test?

Would be great to explore and even better to do in pure julia

Palli · December 2, 2024, 8:17pm

Yes, likely there: Add New Models · Transformers.jl

I do see MLP in the code, and actually BloomMLP only there in the docs… I think it’s only a synonym for regular MLP(?), but patterned after the Bloom model (as in dimensions of). I.e. I’ve never heard of BloomAttention either, or BloomGelu. Yes, attention, and FlashAttention it doesn’t seem to support, that we likely want. And GeLU is much used.

I also see MLP in Flux.jl itself, I didn’t look carefully where KAN should replace MLP, likely in either package. I understand KAN is a drop-in replacement, i.e. conceptually, you need to adjust for more parameters (see I’ve edited my previous post).

What worries me is this part:

Under fixed training iterations, we found that KAN’s forgetting issue is more severe than that of MLP.

“Catastrophic forgetting” is a known problem with transformers/MLP (with fine-tuning later, and continual learning). I need to look more into what that refers to in KANs, the same thing, and if it’s a real issue. Note, they are not comparing KAN to traditional MLP, but MLP with b-splines, i.e. seemingly taking ideas from KAN, already and we most likely just implement original MLP.

The differences between KAN and common MLP are in two aspects. (1) Different activation functions. Commonly, the activation functions in MLP, such as ReLU and GELU, have no learnable parameters and are uniform for all input elements. However, in KAN, the activation function is a spline function, which has learnable parameters and is different for each input element. (2) The order of linear and non-linear operation. Generally, we conceptualize MLP as performing a linear transformation followed by a non-linear transformation. However, KAN actually performs a non-linear transformation first, followed by a linear transformation.

Topic		Replies	Views
Sequence language models in Julia Machine Learning	6	272	August 18, 2025
LLM AI just for Julia? A proposal: Julia plus science LLM? General Usage machine-learning	4	1617	June 24, 2023
[ANN] Julia LLM Leaderboard - Help us make it more relevant for every day problems! Package Announcements announcement , generative-ai , prompting	22	3541	April 5, 2024
AI tools to write (Julia) code (best/worse experience), e.g. ChatGPT, GPT 3.5 Offtopic	62	15951	May 14, 2024
An LLM fine-tuned for Julia, call for comments + help Tooling llm , generative-ai	31	3372	July 16, 2025

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

LLM-FP4: 4-Bit Floating-Point Quantized Transformers https://aclanthology.org/2023.emnlp-main.39.pdf

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

Hardware Acceleration of Kolmogorov–Arnold Network (KAN) for Lightweight Edge Inference

KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search

Parallelism Concepts

R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models

7. Solving a Simple Differential Equation (using DifferentialEquations.jl)

Custom Adan Optimizer in Julia

Explanation

Conclusion

PLSR1: A limited-memory partitioned quasi-Newton optimizer for partially-separable loss functions

Related topics