LLaMA.cpp  has been getting a lot of attention on Hacker News  for its ability to run a Large Language Model (LLM) on any recent CPU with modest memory requirements. I’ve been meaning to get a better understanding of LLMs, so porting LLaMA over to Julia and being able to run it on my laptop seems like a good way to do that.
Has anyone else already started a similar project or have any thoughts? I briefly went through the C++ code and it looks fairly straightforward and a good fit for Julia from what I can tell.
 GitHub - ggerganov/llama.cpp: Port of Facebook's LLaMA model in C/C++
 Using LLaMA with M1 Mac and Python 3.11 | Hacker News
- Plain C/C++ implementation without dependencies
ggml.c: 10502 lines (8554 loc) · 316 KB
GGML Tensor Library
Perhaps picoGPT would be simpler?
picoGPT is an unnecessarily tiny and minimal implementation of GPT-2 in plain NumPy. The entire forward pass code is 40 lines of code.
any updates on this? I am also interested in using Julia for LLMs.
I haven’t had time to do anything, but looks like someone is working on a project in julia:
thank you - I have seen it!
This is great endeavour. Please, do not take next comments as criticism, but as a suggestions of what to do next.
but on the end, the question will be speed. How does the implementation compares to that in
Also, I have recently the pleasure the use
Transformers.jl and was impressed how the package is. It implements the python counterpart with Julia flexibility, which is just nice. Unfortunately, it is about half the performance (speed) of Julia. So the next question would be, how the tiny implementation of llama compares to the implementation with
Transformers.jl, which might not exist yet. I think this is important questions, because these tiny libraries are incredible for showing versatility, Transformers should be go-to package for LLMs and we should learn tricks from these small packages to improve it.
Private confession: for some experiments, I had to use the python counterpart, because Falcon class of models use tokenizer which is not supported in Julia.
I am the author of Llama2.jl. It currently runs the Llama2 7B model (q4_K_S GGML quantization) at 9 tokens/second (it slows down to about 7 tokens/second as it approaches sequence length 512).
llama.cpp runs the same model at about 14 tokens/second. (all on an M1 Air)
As far as I know, llama2.c does not support loading GGML weights at all yet.
Why there is about half speed comparing to the C version? What is their secret sauce they use?
I wanted to try to load LLama2 to transformers.jl, but I got the reply
GatedRepo, because I have not accepted the license.
-Ofast -march=native ....
I was comparing speed against llama.cpp, not llama2.c.
I do not yet know what causes the difference.
OPT = -Ofast
CFLAGS += -march=native -mtune=native
CXXFLAGS += -march=native -mtune=native