I’m ignorant about LLMs, machine learning, etc.; but I want to run the Gemma 2 9B large language model. I’m not interested in using it as a chat bot, rather the reason I want to use it is the fact that it defines the evaluation metric for Kaggle’s Santa 2024 optimization heuristic competition. I wonder if it’s possible to run the Gemma 2 9B LLM:
In pure Julia
On a CPU, without relying on a GPU. I know this would be much slower, if possible at all, but I feel like stepping through a debugger and using tools like Cthulhu.jl could help give me a better understanding of how the LLM works.
As I tried to explain above in short, the Santa 2024 competition result submissions are all evaluated with that specific LLM, in a well-defined manner. So it’s not my choice, and Llama won’t work, only Gemma 2 9B specifically.
TBH, I’m still a little confused about your purpose here. If you were to win the competition, I’d suggest calculating the perplexity in python code (with PythonCall.jl maybe) to avoid any potential inconsistency. But if you were to understand how the Gemma 2 model works, you’d better start with the llama models first. If you look into the model architecture, you’ll find that there are only very small differences.
I’m working on a tutorial to explain how to do fast LLM inference in pure Julia, but it has not yet been finished. For now, you may take a look at the following code snippets.
My hope was that examining the LLM as a white box might allow either simplifying the evaluator while retaining accuracy, or figuring out some upper or lower bounds on it. However I’m beginning to think I was overly optimistic in this regard