Community Interest Check: LLMs from Scratch in Pure Julia

Palli · November 27, 2024, 10:27am

In short, no I think for quantized, and might even be limited to FP32 in practice, but more on this at the bottom. I don’t think Float32 is used at all anymore in any mainstream or research (since 2x, or 4x+ slower), except by Julia still?! But even Julia’s Float16 might be ok for KANs:

My thinking is if we want to do a “Julia LLM from scratch” project at all, for training (also even for just inference) it needs to be competitive, using recent methods, or why bother? Or even better, be ahead of the curve, as by using KAN. There’s no need or even helpful to implement old ideas if/since they take much longer to train, with naive or old ideas.

Quantization has been the go-to method to allowing faster inference (allow more parameters in same fixed amount of [GPU/TPU] memory), and can actually also help for training I believe by now.

So, I’ve been thinking does it or can it even apply to KAN (or other new ideas, I do not want to lead us down the wrong path), it seems so:

Hardware Acceleration of Kolmogorov–Arnold Network (KAN) for Lightweight Edge Inference

https://arxiv.org/pdf/2409.11418
Acceptance date: September 2, 2024

Recently, a novel model named Kolmogorov-Arnold Networks (KAN) has been proposed with the potential to achieve the functionality of traditional deep neural networks (DNNs) using orders of magnitude fewer parameters by parameterized B-spline functions with trainable coefficients. However, the B-spline functions in KAN present new challenges for hardware acceleration. Evaluating the B-spline functions can be performed by using look-up tables (LUTs) to directly map the B-spline functions, thereby reducing computational resource requirements. However, this method still requires substantial circuit resources (LUTs, MUXs, decoders, etc.). For the first time, this paper employs an algorithm-hardware co-design methodology to accelerate KAN. The proposed algorithm-level techniques include Alignment-Symmetry and PowerGap KAN hardware aware quantization, KAN sparsity aware mapping strategy, and circuit-level techniques […]
with analog-CIM (ACIM) circuits. The impact of non-ideal effects, such as partial sum errors caused by the process variations, has been evaluated with the statistics measured from the TSMC 22nm RRAM-ACIM prototype chips. With the best searched hyperparameters of KAN and the optimized circuits implemented in 22 nm node, we can reduce hardware area by 41.78x, energy by 77.97x with 3.03% accuracy boost compared to the traditional DNN hardware

We of course don’t have access to exotic analog prototype chips, but I think that’s ok, we could start with non-quantized KANs, as has been done in Python (with only 285 lines in the code below) and then later add quantization to 8-bit also just on regular hardware, otherwise as done here:

We propose an Alignment-Symmetry and PowerGap KAN hardware aware quantization that, for the first time, investigates the interaction between quantization grid and knot grid in KAN. The proposed method significantly minimizes the cost of LUTs, MUXs, decoders for B(X) function.
[…]
our focus is on accelerating wsspline(x) computation. In our implementation, ws is multiplied with ci and becomes ci’, then is quantized to 8-bit, transforming the formula to equation (3)

Note KANs are a drop in replacement for MLPs, i.e. a critical part of transformers and more.

It might seems like down to only 8 bits is not good, 2x to 8x larger than competing mainstream, but it’s a win if parameter count is reduced by at least as much e.g. 8x+. And I think it must have already been considered a win (for space) before quantization was introduced, given “orders of magnitude fewer parameters”, likely is a huge win for space; and for compute?

https://www.reddit.com/r/learnmachinelearning/comments/1ffqutz/how_would_backpropagation_work_in_kans/

For backpropagation purposes, it is also convenient to store the output derivative with respect to the inputs:

However, all activation functions are linear combination of a fixed set of basis functions which are B-splines; given that, we can reformulate the computation as activate the input with different basis functions and then combine them linearly. This reformulation can significantly reduce the memory cost and make the computation a straightforward matrix multiplication, and works with both forward and backward pass naturally.

The problem is in the sparsification which is claimed to be critical to KAN’s interpretability.

The activations were the bottleneck before, with KAN or anything newer, then the bottleneck might shift. Still the current mainstream quantization (and pruning and other sparsification) might work for it, complement KAN that might only be used for some of the parameters.

My other worry is, should we go for more brain-inspired (e.g. the brain doesn’t use backpropagation, provably, considered not plausible, though I’ve seen it might actually be happening…)? I.e. is current AI ML/LLMs on the wrong path, as argued for in Jeff Hawkins’s excellent book (that I’ve read, and his earlier, I’ve not)? It’s not based on spiking neurons. There is already work on spiking neural networks in Julia, and also some spiking neural network hardware available.

Not read, $254 is pretty steep, though used cheaper and $75 on Kindle:

https://arxiv.org/html/2408.14811v1

BitNet: LLM Quantization at its Extreme Kolmogorov-Arnold Networks

Explaining Paper: [2410.23168] TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Julia has some bfloat16 native support by now (in 1.11) Adapt to upstream changes wrt. native support for BFloat16 by maleadt · Pull Request #51 · JuliaMath/BFloat16s.jl · GitHub for e.g. AMD EPYC 9554 CPU, but I’m not sure good enough, since you want to use GPUs or TPUs anyway, and I see:

github.com/FluxML/ONNX.jl

Comment by dfdx - update ONNX2JULIA_TYPE

FluxML:master ← ordicker:master

> I think that onnx.proto3 (and any other proto files) should be part of the rep…ository for tracking the supported version. Sounds good to me. > Is a Dict is the best option? The Error message that I got was cryptic (Missing Key 7). I think function would be better, but IDK. Yeah, I think half of the codebase now uses dicts, while the other half uses functions. Functions are definitely more flexible and pleasant to work with, consider conversions via dictionaries a legacy. > Not sure about FLOAT16 vs BFLOAT16. bfloat16 is quite different from float16, I don't think reading bfloat16 number as float16 is valid. If I understand it correctly, [JuliaMath/BFloat16s.jl](https://github.com/JuliaMath/BFloat16s.jl) requires using `LowPrecArray` to preserve semantics of ML operations in ONNX files, but the codebase in ONNX.jl doesn't assume arrays other than `Array{FP}` and `CuArray{FP}`, so I have no idea how much effort it may take to add support for `LowPrecArray`. Also, if I understand correctly, [CUDA.jl doesn't fully support bloat16](https://github.com/JuliaGPU/CUDA.jl/issues/391), which is quite limiting. Maybe a safer way would be to throw an error if we encounter bfloat16 for now. > Tests are added > Not sure how to test those feature We have a [couple of tests](https://github.com/FluxML/ONNX.jl/blob/master/test/readwrite.jl#L34) that simply iterate over the data types, it should be easy to extend the list.

Also, if I understand correctly, CUDA.jl doesn’t fully support bloat16, which is quite limiting.

github.com/JuliaGPU/CUDA.jl

Tracker: Float16 support

opened 07:43AM - 25 Aug 20 UTC

maleadt

help wanted cuda array cuda kernels cuda libraries

This issue is meant to be an overview / track progress on supporting Float16 in …CUDA.jl ## Library APIs We'll be able to do a lot already if we just dispatch to the correct library calls, e.g., `cublasHgemm` for half-precision, or `cublasGemmEx` with appropriate CUDA datatypes set. Note that this requires a matching math mode, https://docs.nvidia.com/cuda/cublas/index.html#cublassetmathmode, so https://github.com/JuliaGPU/CUDA.jl/issues/354 may be relevant. Since we support many devices / toolkits, we'll need to check what the requirements for these calls are, and whether the APIs offer a fall-back (e.g., when tensor-cores are disabled or unavailable). If not, we may need to offer pure Julia fallbacks, convert to Float32 arrays, or require the caller (i.e. Knet, Flux) to check for device capabilities before using half-precision data. That may be the easiest solution. Alternatively, we could look into the TF32 format which IIUC should make it possible to more easily use Float16 functions when available, or fall back to existing Float32 functionality (since the representation doesn't change). That way we also do not need kernel programming capabilities (see below) to implement non-library operations. ## Kernel programming Beyond API calls, we need to be able to compile and execute half-precision code. This requires several features: ### Half-precision intrinsics `libdevice` doesn't seem to provide half-precision intrinsics, but instead `cuda_fp16.hpp` contains inline assembly snippets: ```c++ __CUDA_FP16_DECL__ __half hsin(const __half a) { __half r = __hsin_internal(a); asm("{\n\t" " .reg.b16 i,r,t; \n\t" " mov.b16 r, %0; \n\t" " mov.b16 i, %1; \n\t" " mov.b16 t, 0x8000; \n\t" " and.b16 t,r,t; \n\t" __SPEC_CASE(i, r, 0X32B3, 0x0800) __SPEC_CASE(i, r, 0X5CB0, 0x1000) __SPEC_CASE(i, r, 0XB2B3, 0x8800) __SPEC_CASE(i, r, 0XDCB0, 0x9000) " or.b16 r,r,t; \n\t" " mov.b16 %0, r; \n" "}\n" : "+h"(__HALF_TO_US(r)) : "h"(__HALF_TO_CUS(a))); return r; } ``` We can do inline assembly from Julia: https://github.com/JuliaGPU/CUDA.jl/blob/d9db2beeb6bb61f5d6b4c586b8de2e82dc43d78c/src/device/intrinsics/warp_vote.jl#L11-L21 ### `half` LLVM representation Julia currently represents half-precision floating-point as `i16` and implements essential operations itself. Ideally we'd switch to using `half` and let LLVM emit the appropriate GPU instructions, like https://github.com/JuliaLang/julia/pull/26381 does. Alternatively, we can use our own implementation, like https://github.com/vchuravy/IEEEFloat16.jl. @vchuravy, any status update here? Note that we already can use LLVM's `half` with `llvmcall` after https://github.com/JuliaLang/julia/pull/33970; which makes it possible to, e.g., `llvmcall` the WMMA hardware intrinsics. We could use this to interface with other intrinsics, or even implement more essential operations, but it'd be better to update the Julia compiler and have it use `half` instead of reimplementing everything. cc @denizyuret

don’t think Flux uses mixed-precision, so probably no. It is possible to configure CUDA.jl to use tensor cores more eagerly, at the expense of some precision, by starting Julia with fast math enabled or by calling CUDA.math_mode!(CUDA.FAST_MATH), which will e.g. use TF32 when doing an F32xF32 matmul. Further speed-ups are possible by setting CUDA.jl’s math precision to :BFloat16 or even :Float16. Ideally though, I guess Flux.jl would have an interface to use mixed-precision arithmetic.

I don’t think Float32 is used at all anymore (since 2x slower), except by Julia still?! Usually brainfloat bfloat16, NOT the same as Float16 in Julia, which is not as good. bfloat16 is also outdated for inference, and I think also for training by now.

NOT needed to understand or implement:

KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search

https://arxiv.org/pdf/2406.17630

Quantum architecture search (QAS) is a promising direction for optimization and automated design of quantum circuits towards quantum advantage. Recent techniques in QAS focus on machine learning-based approaches from reinforcement learning, like deep Q-network. […]
Moreover, in noisy scenarios, KAN can achieve a better fidelity in approximating maximally entangled state than MLPs, where the performance of the MLP significantly depends on the choice of activation function. In tackling quantum chemistry problems, we enhance the recently proposed QAS algorithm by integrating Curriculum Reinforcement Learning (CRL) with a KAN structure instead of the traditional MLP. This modification allows us to design a parameterized quantum circuit that contains fewer 2-qubit gates and has a shallower depth, thereby improving the efficiency of finding the ground state of a chemical Hamiltonian. Further investigation reveals that KAN requires a significantly smaller number of learnable parameters compared to MLPs; however, the average time of executing each episode for KAN is higher.

Topic		Replies	Views
Sequence language models in Julia Machine Learning	3	205	June 29, 2025
LLM AI just for Julia? A proposal: Julia plus science LLM? General Usage machine-learning	4	1616	June 24, 2023
[ANN] Julia LLM Leaderboard - Help us make it more relevant for every day problems! Package Announcements announcement , generative-ai , prompting	22	3524	April 5, 2024
AI tools to write (Julia) code (best/worse experience), e.g. ChatGPT, GPT 3.5 Offtopic	62	15879	May 14, 2024
An LLM fine-tuned for Julia, call for comments + help Tooling llm , generative-ai	31	3321	July 16, 2025

Community Interest Check: LLMs from Scratch in Pure Julia

Hardware Acceleration of Kolmogorov–Arnold Network (KAN) for Lightweight Edge Inference

KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search

Related topics