I was talking out bfloat16 being dominant for training, not Float32, and then quantized used a lot. But not KANs, they are new. Sorry for the confusion, I just immediately continued with mentioning them, didn’t explicitly say they were popular yet, but I think they will be; in transformers. Quantization to 4-bit is I think mainstream, though I often see NONE used, i.e. models released first with e.g. bloat16, and then the quantization community post-quantizees models, and maybe fine-tunes.
I see from @ForceBru “Just use automatic differentiation” for backpropagation for KANs. And in a paper I linked “trained four KAN models using PyTorch, each sized 17x1x14, with G values of 7, 15, 30, and 60 corresponding to array sizes of 128, 256, 512, and 1024, respectively.” So I certainly think KANs would fit into Flux.jl. While I’m no expert on Flux or Lux, if KANs do not fit there then they should. An alternative, and an ok first step is to do independently of as with: “we use KANs as a nice opportunity to implement them from scratch in simple Python (no PyTorch / TensorFlow: just some good old numpy!).”
Amazing page:
Look at e.g.
Parallelism Concepts
And Julia is most likely behind.
It’s not too important to do everything from scratch (see e.g. Jjama3.jl, being impure, @noob, @AntonOresten, depending on Python/Rust code, but why not Rust directly? It’s other BytePairEncoding.jl is though a"Pure Julia implementation of the Byte Pair Encoding (BPE) method."):
I absolutely agree we shouldn’t bother implementing tokenizers in Julia, rather reuse, and even better get rid of (I also see Karpathy is now at a new AI company after, Eureka Labs, after leaving OpenAI, and Tesla before):
There is a whole separate stage with its own training and inference, and additional libraries. It complicates the ingest of additional modalities. Tokenization also has many subtle sharp edges. Few examples: […]
Tokenization creates attack surfaces, e.g. SolidGoldMagikarp […]
The list goes on, TLDR everyone should hope that tokenization could be thrown away. Maybe even more importantly, we may find general-purpose strategies for multi-scale training in the process.
Looking into “multi-scale training” I find a lot (most not directly on LLMs, on images, or time-series, not sure if the ideas translate to LLMs):
https://arxiv.org/html/2410.11674
See also there open-sora:
Something I was seeing but not looked at close enough to know if relevant for us:
This seems important and only 9 pages:
https://arxiv.org/pdf/2407.00952
https://arxiv.org/pdf/2405.09394
Experimental results demonstrate that SA-FedLoRA is an efficient FL, achieving superior performance to FedAvg and significantly reducing communication parameters by up to 93.62%"
I did not expect to see “wireless” and “jamming resistant” in relation to LLMs: