What it says in the title. Is this the fastest transformer implementation? No. But is it flexible enough to handle large families of LLMs? Also no.
It is not heavily optimized, but it does have eg. KV-caching so sampling doesn’t grind to a halt as the sequence gets longer. It comfortably runs Llama3.2 1B and 3B on CPU on my laptop, and uncomfortably runs Llama3.1 8B. They’re much faster with CUDA on a GPU though.
Why did I make this? Because I’m too stupid to get the existing packages working for Llama3.2 models.
Switching tokenizers unlocked some fun small open models, like the SmolLM2 series (more open than Llama3.2, which is behind a permissions wall, so this might reduce a barrier to getting started). With the LoRA addition, this is at a fairly decent point for someone wanting to tinker with LLMs. Cooking up new samplers is a fun sport (evaluating them is trickier), and you can finetune a 1.7 billion parameter model just on your CPU (see our example where we make one much stupider).
This is such a cool discussion! It got me wondering—how far could we be from having protein language models (pLMs), like BioTransformers, implemented in Julia?
Just added support for Qwen2.5 models, which come in a very nice range of sizes (starting as low as 0.5B) and variants (base, coder, math) and are among the best for their size. Theoretically this can also run the new QwQ “reasoning” model, if you have enough VRAM to do this in full Float32 (I don’t, so I haven’t tested it).
How mature is the package to write a small tutorial for users on how to integrate it with PromptingTools? I have to look at your chat template support, but I can add the Llama3 INST template renderer on the PromptingTools side.
Thanks. It is new, and still under active development. I’d suggest waiting a bit before trying to integrate it with anything else? It needs one minor interface tweak (to preserve the KV-cache) for efficient back-and-forth chat so at least wait for that - I’ll try and remember to drop a note in here when that is pushed.