[ANN] Jjama3.jl (unregistered) - Llama3.1 and Llama3.2 (text) in Julia

Noob · November 14, 2024, 4:15pm

What it says in the title. Is this the fastest transformer implementation? No. But is it flexible enough to handle large families of LLMs? Also no.

It is not heavily optimized, but it does have eg. KV-caching so sampling doesn’t grind to a halt as the sequence gets longer. It comfortably runs Llama3.2 1B and 3B on CPU on my laptop, and uncomfortably runs Llama3.1 8B. They’re much faster with CUDA on a GPU though.

Why did I make this? Because I’m too stupid to get the existing packages working for Llama3.2 models.

ToucheSir · November 14, 2024, 5:02pm

By “existing packages”, do you mean GitHub - cafaxo/Llama2.jl: Julia package for inference and training of Llama-style language models?

Jjama3.jl/ext/MetalExt.jl at main · MurrellGroup/Jjama3.jl · GitHub looks like it’d make a great addition to NNlib (much like how NNlib.jl/ext/NNlibAMDGPUExt/batched_mul.jl at master · FluxML/NNlib.jl · GitHub exists now).

Noob · November 14, 2024, 5:20pm

Yes. And Transformers.jl. And some others I tried. Probably a “me” issue.

Yes there is this neglected issue from May: batched_mul doesn't work with MtlArrays · Issue #581 · FluxML/NNlib.jl · GitHub

Noob · November 26, 2024, 11:54pm

A few additions to this today:

Life is too short to spend on tokenizers, so we switched to using a julia wrapper of a python wrapper of Hugging Face’s rust tokenizers library (hat tip @AntonOresten for this!).
Added low-rank (LoRA) finetuning.
Added new samplers, including classics top-p & top-k, but also the more recent https://arxiv.org/pdf/2407.01082 and https://arxiv.org/pdf/2411.07641
In response to a Bluesky conversation about structured sampling we added an example of a sampler that lets you restrict the logits to match a custom template.

Switching tokenizers unlocked some fun small open models, like the SmolLM2 series (more open than Llama3.2, which is behind a permissions wall, so this might reduce a barrier to getting started). With the LoRA addition, this is at a fairly decent point for someone wanting to tinker with LLMs. Cooking up new samplers is a fun sport (evaluating them is trickier), and you can finetune a 1.7 billion parameter model just on your CPU (see our example where we make one much stupider).

camilogarciabotero · November 27, 2024, 12:11am

This is such a cool discussion! It got me wondering—how far could we be from having protein language models (pLMs), like BioTransformers, implemented in Julia?

Noob · November 27, 2024, 12:19am

Many sequence-only pLMs just require replacing the tokenizer with a one-hot encoder, so they’re simpler than LLMs to port. My lab has a few (unreleased - we’ll get there) protein structure transformers, where we’ve got some available code for key components if others want to build (eg. GitHub - MurrellGroup/InvariantPointAttention.jl: Julia implementation of AlphaFold 2's Invariant Point Attention and GitHub - MurrellGroup/MessagePassingIPA.jl).

Noob · November 29, 2024, 1:12pm

Just added support for Qwen2.5 models, which come in a very nice range of sizes (starting as low as 0.5B) and variants (base, coder, math) and are among the best for their size. Theoretically this can also run the new QwQ “reasoning” model, if you have enough VRAM to do this in full Float32 (I don’t, so I haven’t tested it).

Noob · November 29, 2024, 6:47pm

I can now confirm that the QwQ 32B “reasoning” model runs - but I don’t have a 128gb GPU so it was checked on the CPU:

…at about 0.2 tokens per second.

svilupp · December 6, 2024, 9:17am

I love it!

I’d love to add it to the community GenAI project tracker: GitHub - svilupp/awesome-generative-ai-meets-julia-language: Comprehensive guide to generative AI projects and resources in Julia. !
Feel free to open the PR, otherwise I’ll try to do it over the weekend. There are many new projects that need to be added

How mature is the package to write a small tutorial for users on how to integrate it with PromptingTools? I have to look at your chat template support, but I can add the Llama3 INST template renderer on the PromptingTools side.

Noob · December 6, 2024, 9:27am

Thanks. It is new, and still under active development. I’d suggest waiting a bit before trying to integrate it with anything else? It needs one minor interface tweak (to preserve the KV-cache) for efficient back-and-forth chat so at least wait for that - I’ll try and remember to drop a note in here when that is pushed.

Topic		Replies	Views
Sequence language models in Julia Machine Learning	3	211	June 29, 2025
LLaMA in Julia? Offtopic	13	3649	August 7, 2023
Community Interest Check: LLMs from Scratch in Pure Julia Offtopic package	43	2709	January 31, 2025
Gradient of llama2 computed by Zygote seems to be incorrect Machine Learning	6	727	September 13, 2023
[ANN] Introducing LLMAccess: A Simple Julia Wrapper for LLM REST APIs Package Announcements package , llm	4	510	February 12, 2025

[ANN] Jjama3.jl (unregistered) - Llama3.1 and Llama3.2 (text) in Julia

Related topics