[ANN] Jjama3.jl (unregistered) - Llama3.1 and Llama3.2 (text) in Julia

What it says in the title. Is this the fastest transformer implementation? No. But is it flexible enough to handle large families of LLMs? Also no.

It is not heavily optimized, but it does have eg. KV-caching so sampling doesn’t grind to a halt as the sequence gets longer. It comfortably runs Llama3.2 1B and 3B on CPU on my laptop, and uncomfortably runs Llama3.1 8B. They’re much faster with CUDA on a GPU though.

Why did I make this? Because I’m too stupid to get the existing packages working for Llama3.2 models.

16 Likes

By “existing packages”, do you mean GitHub - cafaxo/Llama2.jl: Julia package for inference and training of Llama-style language models?

Jjama3.jl/ext/MetalExt.jl at main · MurrellGroup/Jjama3.jl · GitHub looks like it’d make a great addition to NNlib (much like how NNlib.jl/ext/NNlibAMDGPUExt/batched_mul.jl at master · FluxML/NNlib.jl · GitHub exists now).

Yes. And Transformers.jl. And some others I tried. Probably a “me” issue.

Yes there is this neglected issue from May: batched_mul doesn't work with MtlArrays · Issue #581 · FluxML/NNlib.jl · GitHub

A few additions to this today:

Switching tokenizers unlocked some fun small open models, like the SmolLM2 series (more open than Llama3.2, which is behind a permissions wall, so this might reduce a barrier to getting started). With the LoRA addition, this is at a fairly decent point for someone wanting to tinker with LLMs. Cooking up new samplers is a fun sport (evaluating them is trickier), and you can finetune a 1.7 billion parameter model just on your CPU (see our example where we make one much stupider).

4 Likes

This is such a cool discussion! It got me wondering—how far could we be from having protein language models (pLMs), like BioTransformers, implemented in Julia?

2 Likes

Many sequence-only pLMs just require replacing the tokenizer with a one-hot encoder, so they’re simpler than LLMs to port. My lab has a few (unreleased - we’ll get there) protein structure transformers, where we’ve got some available code for key components if others want to build (eg. GitHub - MurrellGroup/InvariantPointAttention.jl: Julia implementation of AlphaFold 2's Invariant Point Attention and GitHub - MurrellGroup/MessagePassingIPA.jl).

6 Likes

Just added support for Qwen2.5 models, which come in a very nice range of sizes (starting as low as 0.5B) and variants (base, coder, math) and are among the best for their size. Theoretically this can also run the new QwQ “reasoning” model, if you have enough VRAM to do this in full Float32 (I don’t, so I haven’t tested it).

6 Likes

I can now confirm that the QwQ 32B “reasoning” model runs - but I don’t have a 128gb GPU so it was checked on the CPU:



…at about 0.2 tokens per second.

4 Likes

I love it!

I’d love to add it to the community GenAI project tracker: GitHub - svilupp/awesome-generative-ai-meets-julia-language: Comprehensive guide to generative AI projects and resources in Julia. !
Feel free to open the PR, otherwise I’ll try to do it over the weekend. There are many new projects that need to be added :slight_smile:

How mature is the package to write a small tutorial for users on how to integrate it with PromptingTools? I have to look at your chat template support, but I can add the Llama3 INST template renderer on the PromptingTools side.

2 Likes

Thanks. It is new, and still under active development. I’d suggest waiting a bit before trying to integrate it with anything else? It needs one minor interface tweak (to preserve the KV-cache) for efficient back-and-forth chat so at least wait for that - I’ll try and remember to drop a note in here when that is pushed.

1 Like