LLaMA.cpp [1] has been getting a lot of attention on Hacker News [2] for its ability to run a Large Language Model (LLM) on any recent CPU with modest memory requirements. I’ve been meaning to get a better understanding of LLMs, so porting LLaMA over to Julia and being able to run it on my laptop seems like a good way to do that.
Has anyone else already started a similar project or have any thoughts? I briefly went through the C++ code and it looks fairly straightforward and a good fit for Julia from what I can tell.
This is great endeavour. Please, do not take next comments as criticism, but as a suggestions of what to do next.
but on the end, the question will be speed. How does the implementation compares to that in llama.c?
Also, I have recently the pleasure the use Transformers.jl and was impressed how the package is. It implements the python counterpart with Julia flexibility, which is just nice. Unfortunately, it is about half the performance (speed) of Julia. So the next question would be, how the tiny implementation of llama compares to the implementation with Transformers.jl, which might not exist yet. I think this is important questions, because these tiny libraries are incredible for showing versatility, Transformers should be go-to package for LLMs and we should learn tricks from these small packages to improve it.
Private confession: for some experiments, I had to use the python counterpart, because Falcon class of models use tokenizer which is not supported in Julia.
I am the author of Llama2.jl. It currently runs the Llama2 7B model (q4_K_S GGML quantization) at 9 tokens/second (it slows down to about 7 tokens/second as it approaches sequence length 512).
llama.cpp runs the same model at about 14 tokens/second. (all on an M1 Air)
As far as I know, llama2.c does not support loading GGML weights at all yet.