If you’re asking me, there so much I can write on this, but I want to be confident in my answers, so I let others answer.
About training, I thought, that training scratch, would be basically an impossible task, and also missing infrastructure code (in Julia if we can not use already available). This seems like a huge deal from August:
This is the repository for DisTrO (Distributed Training Over-The-Internet), a family of low latency distributed optimizers that reduce inter-GPU communication requirements by three to four orders of magnitude.
This means people could help out together, I think with their home GPUs, but it’s unclear it lowers GPU requirements though, so we can likely not get 300,000 GPUs or something (or that many people) to help.
Either you fine-tune (doable) or from scratch, but I’m thinking do only those extreme exist? I suppose you do not start totally from scratch all the time, would be best to start from some early checkpoint.
I’ve not looked into DisTrO closely, is it a replacement for Adam, Lion etc. (likely not) or builds on such? Probably at least replaces Deepspeed.
DoReMi is also a very intriguing development, training a smaller model, then larger based on it:
https://neurips.cc/virtual/2023/poster/70588
DoReMi improves perplexity across all domains
Not to be confused with DoReMi there (looking it up again, got me confused…), this paper is on robotics:
On this:
What do you mean? I’m not sure I know enough about “attention”, it basically means transformers, right and we have them already?
https://h2o.ai/wiki/self-attention/
https://h2o.ai/wiki/attention-mechanism/
Do we have all the variants, for sure unlikely, do you mean like:
Better performance with lower precision: FlashAttention-3 can work with lower precision numbers (FP8) while maintaining accuracy.
Such lower precision requires recent hardware (GPUs), and I don’t think we can compete if we do not use/target such. We are also behind in even more quantized models. We have SafeTensors.jl but that format and code is limited to bfloat16 and FP8 smallest (or so it seems, maybe not inherently and will support smaller?). It uses DLFP8Types.jl so seems implemented in software (then slowly).
Did you mean we need Grouped Query Attention, I’m not sure we might have it already?
… Tri’s publication history has been leaning toward SSM and Mamba style architectures recently. Unlike Flash Attention which has quadratic time complexity wrt sequence length, these latest algorithms are subquadratic. Thus they do much less computation, instead of just doing it more efficiently a la Flash Attention.
Dao and Gu published a really long paper this year which demonstrated (among other things) how Mamba/SSM can be formulated such that it’s amenable to acceleration using the same hardware primitives that Transformers benefit from. …
…Until the strong exponential hypothesis is (dis-)proven, the quadratic cost is required or you have to give something up. Just the cost of exhaustive search.
As (dis-)proving SETH will resolve the P vs NP problem, I wouldn’t hold my breath. …
Maybe Mamba, SSM or Jamba or some linear transformer will take over, but it seems to me just sticking with quadratic transformer is a safe bet (or even if not good enough code can later be changed?).
We have many (most?) optimizers here, e.g. many Adam variants:
With version 0.4 the default update rule for AdamW has changed to match the pytorch implementation.
We even have API · Optimisers.jl (once the best I though, no longer?), but clicking on some docs such as it shows strange (placeholder?) text:
In addition to the main course, you may wish to order some of these condiments:
I thought we redundantly have them at SciML, but its docs actually link over to flux.ml:
Optimisers.RMSProp
: RMSProp optimizer
Optimisers.Adam
: Adam optimizer