Run the Gemma 2 9B LLM in Julia and without a GPU?

nsajko · December 24, 2024, 1:45pm

I’m ignorant about LLMs, machine learning, etc.; but I want to run the Gemma 2 9B large language model. I’m not interested in using it as a chat bot, rather the reason I want to use it is the fact that it defines the evaluation metric for Kaggle’s Santa 2024 optimization heuristic competition. I wonder if it’s possible to run the Gemma 2 9B LLM:

In pure Julia
On a CPU, without relying on a GPU. I know this would be much slower, if possible at all, but I feel like stepping through a debugger and using tools like Cthulhu.jl could help give me a better understanding of how the LLM works.

Gemma 2 9B links:

Santa 2024:

Santa 2024 - The Perplexity Permutation Puzzle | Kaggle

mrufsvold · December 24, 2024, 7:00pm

I too am ignorant here, so hopefully someone else will weigh in, but what made you settle on that model?

I saw someone posting about getting Llama models working in pure Julia recently (I’m on my phone, but I could track it down later). Would those work?

nsajko · December 24, 2024, 7:49pm

As I tried to explain above in short, the Santa 2024 competition result submissions are all evaluated with that specific LLM, in a well-defined manner. So it’s not my choice, and Llama won’t work, only Gemma 2 9B specifically.

findmyway · December 26, 2024, 8:53am

That’s possible.

TBH, I’m still a little confused about your purpose here. If you were to win the competition, I’d suggest calculating the perplexity in python code (with PythonCall.jl maybe) to avoid any potential inconsistency. But if you were to understand how the Gemma 2 model works, you’d better start with the llama models first. If you look into the model architecture, you’ll find that there are only very small differences.

I’m working on a tutorial to explain how to do fast LLM inference in pure Julia, but it has not yet been finished. For now, you may take a look at the following code snippets.

gpt2
llama3

Source

nsajko · December 26, 2024, 9:37am

My hope was that examining the LLM as a white box might allow either simplifying the evaluator while retaining accuracy, or figuring out some upper or lower bounds on it. However I’m beginning to think I was overly optimistic in this regard

findmyway · April 2, 2025, 11:01am

Gemma3 implemented in pure Julia without GPU :

github.com/JuliaGenAI/Master-Deep-Generative-Models-in-Julia

04_gemma/gemma3.jl

main

using FileIO
using ImageCore: channelview
using SafeTensors: load_sharded_safetensors

import HuggingFaceTokenizers as HFT
using NNlib: batched_mul, gelu_tanh, conv, DenseConvDims, softmax, meanpool, PoolDims
using Statistics: mean, var
using LinearAlgebra: triu, tril
using ProgressMeter

#####
# Processor
#####
struct Processor
    sliding_window::Int
    boi_token::String
    eoi_token::String
    image_token::String
    mm_tokens_per_image::Int

This file has been truncated. show original

Topic		Replies	Views
LLaMA in Julia? Offtopic	13	3645	August 7, 2023
Gradient of llama2 computed by Zygote seems to be incorrect Machine Learning	6	727	September 13, 2023
Sequence language models in Julia Machine Learning	3	203	June 29, 2025
[ANN] Julia LLM Leaderboard - Help us make it more relevant for every day problems! Package Announcements announcement , generative-ai , prompting	22	3523	April 5, 2024
LLM AI just for Julia? A proposal: Julia plus science LLM? General Usage machine-learning	4	1616	June 24, 2023

Run the Gemma 2 9B LLM in Julia and without a GPU?

Related topics