Reactant would be the way to go if you want good performance for these workloads. If you want a starter code, we have some WIP versions scattered across PRs atm
- feat: nanoGPT implementation using Reactant by avik-pal · Pull Request #1062 · LuxDL/Lux.jl · GitHub
- feat: add a Llama2 model by avik-pal · Pull Request #88 · EnzymeAD/Reactant.jl · GitHub
Even the quantized ops needed for inference exist in the StableHLO land but we haven’t hooked them up yet on the Julia side but it is definitely doable.