A. First, very intriguing, also with good background on traditional linear algebra and matmul:
B.
It’s intriguing you can get 10-20x speedup with randomizing, but now on what most use:
Most use traditional BLAS, i.e. OpenBLAS (or MKL, BLIS.jl etc.), that comes with Julia (and I want OpenBLAS out… of Julia, it’s a heavy dependency, also I’m thinking redundant with better options). Meaning on the CPU, only limited by main memory. But some use the GPU, e.g. cuBLAS. I’m thinking the only reason is because the GPU if faster, but it has less memory (potentially).
Do none do both? I.e. start matmul on the CPU, with some library that sends to the GPU in chunks? For just matricies that would fit on the GPU you would likely start there. Or I believe the GPU has some virtual memory management by now, so not limited to its memory you you would start there with as large as you want?
Matmul is O(n^3), or as implemented (better is possible, not done). But that’s counting calculation operations. I understand it’s actually O(n^2) in the memory traffic operations, that dominate, assuming the data fits in (L3) cache? How does that work for the GPU? I think they have caches by now; and/or HBM memory.
Whatever you do, CPU only, GPU only or some hybrid, I think it would also benefit from randomized, and smaller precision. How small potentially? You can work with Float64 only (on CPU, or GPU, there less likely), Float32 only, or mixed-precision, is that currently only done on the GPU? GPUs have [b]float16 (also more recent CPUs), and (fast, standardized by now) Float8, and latest Nvidia FP4. But that’s likely useless for most inputs, only for neural networks Also outdated there…