It’s probably not better to go pure Julia (rather than use state-of-the-art code and algorithms), unless as a learning exercise. But if you do, consider “1-bit networks” (from 2023 and from this week):
It’s very likely if you redo some software, you reimplement an outdated way. E.g. transformers are likely going away in current form.
We’ve likely gone to the end of the line with quantization with such 1-, 2-bit networks, and it helps keep the size down. To stay competitive with training you need thousands of GPUs, and software that can target so many, so it seems out of the question to use pure Julia. But maybe you can go half way there, leave out some parts like distributing to many GPUs, use DeepSpeed or something for that.
Training from scratch is still very costly, so no need to, since you can finetune a model for Julia use. But then you need to choose the best model to start from and formats/quantization as in llama.cpp or this new bitnet.cpp from Microsoft. See on the former (and relation to Llama2.jl):
KAN networks (they can be drop-in replacement into MLP part of transformers, if I recall) are worth-while to reimplement in Julia:
KAN networks are likely not compatible with 1-bit networks, I mean their weights larger, but might still be a good thing, if you get away with fewer. Also I think not intirely contradictory, since you can still have a transformer and other parts with 1-bit weights, where KAN is not replaceing the MLP part. But isn’t the MLP part the largest part of the total?
I think also worthwhile to help with this:
Best models will likely use new ways of multiplying not yet in software (but you could emulate slowly(?) for compatibility until hardware catches up, or maybe just use Float8, of bflot, I don’t recall, might be compatible with it): https://arxiv.org/html/2410.00907v2#S2