Community Interest Check: LLMs from Scratch in Pure Julia

Is there active interest in developing LLMs from the ground up in pure Julia or is there ongoing work on this that I have not come across?

Current

I’ve been reviewing existing Julia LLM projects:

  • Transformers.jl (~1K commits, last updated 4 months ago) seems to focus on providing interfaces to pre-trained models like BERT/Llama
  • TransformerBlocks.jl (<100 commits, not recently maintained) offers some building blocks but isn’t actively developed
  • There appears to be a gap in pure Julia implementations for training LLMs from scratch

Project Interest

  1. Developing pure Julia implementations for LLM training (not just inference)
  2. Building the necessary distributed training infrastructure
  3. Creating efficient Julia-native attention mechanisms and optimizers
  4. Leveraging Julia’s strengths

Looking For

  • Others interested in building LLMs from scratch in Julia
  • Insights from those who’ve attempted similar projects
  • Discussion about technical challenges and Julia-specific questions
3 Likes

The existence of GitHub - cafaxo/Llama2.jl: Julia package for inference and training of Llama-style language models should answer some of your questions.

3 Likes

I think you might be interested in this talk by @jpsamaroo and @dhairyagandhi96:

If I am not crazy, I believe they explored just this question you are asking. Otherwise, CC @svilupp and @cpfiffer too!

5 Likes

It’s probably not better to go pure Julia (rather than use state-of-the-art code and algorithms), unless as a learning exercise. But if you do, consider “1-bit networks” (from 2023 and from this week):

https://arxiv.org/pdf/2410.16144

It’s very likely if you redo some software, you reimplement an outdated way. E.g. transformers are likely going away in current form.

We’ve likely gone to the end of the line with quantization with such 1-, 2-bit networks, and it helps keep the size down. To stay competitive with training you need thousands of GPUs, and software that can target so many, so it seems out of the question to use pure Julia. But maybe you can go half way there, leave out some parts like distributing to many GPUs, use DeepSpeed or something for that.

Training from scratch is still very costly, so no need to, since you can finetune a model for Julia use. But then you need to choose the best model to start from and formats/quantization as in llama.cpp or this new bitnet.cpp from Microsoft. See on the former (and relation to Llama2.jl):

KAN networks (they can be drop-in replacement into MLP part of transformers, if I recall) are worth-while to reimplement in Julia:

KAN networks are likely not compatible with 1-bit networks, I mean their weights larger, but might still be a good thing, if you get away with fewer. Also I think not intirely contradictory, since you can still have a transformer and other parts with 1-bit weights, where KAN is not replaceing the MLP part. But isn’t the MLP part the largest part of the total?

I think also worthwhile to help with this:

Best models will likely use new ways of multiplying not yet in software (but you could emulate slowly(?) for compatibility until hardware catches up, or maybe just use Float8, of bflot, I don’t recall, might be compatible with it):
https://arxiv.org/html/2410.00907v2#S2

2 Likes

fun topic! Is there interest? Absolutely!

Would there be demand for pure Julia implementations? Definitely!

My understanding was that people are stretched so thin on existing projects that we need more people interested and willing to hack!

Personally, I’m crazy about the applications of GenAI and building on top of it rather than training, but I bet that differs for everyone.

If you’re keen to hack deeper than just pure inference, but still easy to start maybe you wanna dip your feet with Entropix? Have you played with it? It does a lot of clever stuff with really small models - having that perfect balance of performance, practical, and runnable locally. That could be a fun starter!

2 Likes

https://timkellogg.me/blog/2024/10/10/entropix

You might be surprised to learn that LLMs know when they’re confused, but that’s been known for a little while.

Funny how the link there is to a paper from this month, so “known for a little while” in AI research that means what, about 3 weeks?! [Or does the new paper reference older paper/ideas?] Not really to surprising with the rapid changes even if meaning 3 weeks.

Thanks, I didn’t know of Entropix, seems interesting.

Last I heard, entropix is splitting the repository, one effort going toward huge models and pushing the limits for where this can go. The other is focused on local LLMs, squeezing out every last drop of intelligence.

From same blog:

2 Likes

How good / complete are the existing Julia packages for transformer networks in general, not necessarily for LLMs?

2 Likes