BLAS on CPU or GPU or both? And Randomized linear algebra

Palli · April 18, 2024, 10:45am

A. First, very intriguing, also with good background on traditional linear algebra and matmul:

B.
It’s intriguing you can get 10-20x speedup with randomizing, but now on what most use:

Most use traditional BLAS, i.e. OpenBLAS (or MKL, BLIS.jl etc.), that comes with Julia (and I want OpenBLAS out… of Julia, it’s a heavy dependency, also I’m thinking redundant with better options). Meaning on the CPU, only limited by main memory. But some use the GPU, e.g. cuBLAS. I’m thinking the only reason is because the GPU if faster, but it has less memory (potentially).

Do none do both? I.e. start matmul on the CPU, with some library that sends to the GPU in chunks? For just matricies that would fit on the GPU you would likely start there. Or I believe the GPU has some virtual memory management by now, so not limited to its memory you you would start there with as large as you want?

Matmul is O(n^3), or as implemented (better is possible, not done). But that’s counting calculation operations. I understand it’s actually O(n^2) in the memory traffic operations, that dominate, assuming the data fits in (L3) cache? How does that work for the GPU? I think they have caches by now; and/or HBM memory.

Whatever you do, CPU only, GPU only or some hybrid, I think it would also benefit from randomized, and smaller precision. How small potentially? You can work with Float64 only (on CPU, or GPU, there less likely), Float32 only, or mixed-precision, is that currently only done on the GPU? GPUs have [b]float16 (also more recent CPUs), and (fast, standardized by now) Float8, and latest Nvidia FP4. But that’s likely useless for most inputs, only for neural networks Also outdated there…

Topic		Replies	Views
BLAS vs CUBLAS benchmark Performance question , blas , cuda	13	5802	September 11, 2020
How much faster is GPU compare to CPU GPU	16	26656	November 24, 2018
Poor openBLAS performance for large matrix multiply? New to Julia openblas	17	1212	April 4, 2025
OpenBLAS + Flux bottleknecks? New to Julia blas , flux	4	798	October 8, 2020
We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4) Numerics	17	13557	October 30, 2019

BLAS on CPU or GPU or both? And Randomized linear algebra

Related topics