Community Interest Check: LLMs from Scratch in Pure Julia

Palli · November 29, 2024, 12:16pm

I was talking out bfloat16 being dominant for training, not Float32, and then quantized used a lot. But not KANs, they are new. Sorry for the confusion, I just immediately continued with mentioning them, didn’t explicitly say they were popular yet, but I think they will be; in transformers. Quantization to 4-bit is I think mainstream, though I often see NONE used, i.e. models released first with e.g. bloat16, and then the quantization community post-quantizees models, and maybe fine-tunes.

I see from @ForceBru “Just use automatic differentiation” for backpropagation for KANs. And in a paper I linked “trained four KAN models using PyTorch, each sized 17x1x14, with G values of 7, 15, 30, and 60 corresponding to array sizes of 128, 256, 512, and 1024, respectively.” So I certainly think KANs would fit into Flux.jl. While I’m no expert on Flux or Lux, if KANs do not fit there then they should. An alternative, and an ok first step is to do independently of as with: “we use KANs as a nice opportunity to implement them from scratch in simple Python (no PyTorch / TensorFlow: just some good old numpy!).”

Amazing page:

Look at e.g.

Parallelism Concepts

And Julia is most likely behind.

It’s not too important to do everything from scratch (see e.g. Jjama3.jl, being impure, @noob, @AntonOresten, depending on Python/Rust code, but why not Rust directly? It’s other BytePairEncoding.jl is though a"Pure Julia implementation of the Byte Pair Encoding (BPE) method."):

I absolutely agree we shouldn’t bother implementing tokenizers in Julia, rather reuse, and even better get rid of (I also see Karpathy is now at a new AI company after, Eureka Labs, after leaving OpenAI, and Tesla before):

There is a whole separate stage with its own training and inference, and additional libraries. It complicates the ingest of additional modalities. Tokenization also has many subtle sharp edges. Few examples: […]
Tokenization creates attack surfaces, e.g. SolidGoldMagikarp […]
The list goes on, TLDR everyone should hope that tokenization could be thrown away. Maybe even more importantly, we may find general-purpose strategies for multi-scale training in the process.

Looking into “multi-scale training” I find a lot (most not directly on LLMs, on images, or time-series, not sure if the ideas translate to LLMs):

https://arxiv.org/html/2410.11674

PKU-YuanGroup/Open-Sora-Plan/blob/main/docs/Report-v1.3.0.md

# Report v1.3.0

In August 2024, we released Open-Sora-Plan v1.2.0, transitioning to a 3D full attention architecture, which enhanced the capture of joint spatial-temporal features. However, the substantial computational cost made it unsustainable, and the lack of a clear training strategy hindered continuous progress along a focused path.

In version 1.3.0, Open-Sora-Plan introduced the following five key features:

**1. A more powerful and cost-efficient WFVAE.** We decompose video into several sub-bands using wavelet transforms, naturally capturing information across different frequency domains, leading to more efficient and robust VAE learning.

**2. Prompt Refiner.** A large language model designed to refine short text inputs.

**3. High-quality data cleaning strategy.** The cleaned panda70m dataset retains only 27% of the original data.

**4. DiT with new sparse attention.** A more cost-effective and efficient learning approach.

**5. Dynamic resolution and dynamic duration.** This enables more efficient utilization of videos with varying lengths (treating a single frame as an image).

### Open-Source Release
We open-source the Open-Sora-Plan to facilitate future development of Video Generation in the community. Code, data, model will be made publicly available.
- Code: All training scripts and sample scripts.
- Model: Both Diffusion Model and CasualVideoVAE [here](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0).

This file has been truncated. show original

Something I was seeing but not looked at close enough to know if relevant for us:

This seems important and only 9 pages:
https://arxiv.org/pdf/2407.00952

https://arxiv.org/pdf/2405.09394

Experimental results demonstrate that SA-FedLoRA is an efficient FL, achieving superior performance to FedAvg and significantly reducing communication parameters by up to 93.62%"

I did not expect to see “wireless” and “jamming resistant” in relation to LLMs:

R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models

https://arxiv.org/pdf/2407.11654

https://www.reddit.com/r/MachineLearning/comments/1cfj9kf/crosspost_on_improving_llm_efficiency_using_split/

Topic		Replies	Views
Sequence language models in Julia Machine Learning	3	211	June 29, 2025
LLM AI just for Julia? A proposal: Julia plus science LLM? General Usage machine-learning	4	1616	June 24, 2023
[ANN] Julia LLM Leaderboard - Help us make it more relevant for every day problems! Package Announcements announcement , generative-ai , prompting	22	3526	April 5, 2024
AI tools to write (Julia) code (best/worse experience), e.g. ChatGPT, GPT 3.5 Offtopic	62	15892	May 14, 2024
An LLM fine-tuned for Julia, call for comments + help Tooling llm , generative-ai	31	3336	July 16, 2025

Parallelism Concepts

R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models

Related topics