Realistically, how close is Gauis.jl to becoming a full replacement for BLAS in Julia?

I the above and is amazed. I always thought that BLAS, MKL etc are so complex and so well optimised that they are pretty much irreplaceable.

Even if Gauis is slower for cases, how close are we to doing away with BLAS altogether even as an experiment.

E.g can I compile Julia from scratch but remove all the Blas bits and call Gauis instead.

It might be slow, but I want to know how close are we to that? Is that even achievable in the foreseeable future?

2 Likes

It’s still missing a bunch of kernels, but it’s quite usable today for the kernels it has. https://github.com/YingboMa/MaBLAS.jl and https://github.com/chriselrod/PaddedMatrices.jl are two other examples as well. It’ll take quite awhile to make these into a full BLAS mostly because no one is working on that full time. That said, not all of the traditional BLAS kernels need to be implemented for a Julia BLAS, because things like BLAS1 are just broadcast so it’s somewhat better to not do them (so they can fuse), so you can probably cut it down to just a few core functions of which matrix multiplication is the key and it’s already done.

11 Likes

You might want to check out [ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays for some truly impressive benchmark results + discussion of PaddedMatrices.jl

2 Likes

Regarding Gaius.jl, I wrote a bit about this here: [ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays

TLDR: I’m not working on Gaius right now and I have no real plans to. It was never a serious project, more just an exploration of what was possible to do at a high level without nitty gritty knowledge of BLAS kernels.


To be honest, I have no real interest in writing the sort of super specialized low level code that would be required to do this right. My intention with Gaius was to see how well you can do with the sorts of high level tools available to me at the time: LoopVectorization.jl and julia’s composable multi-threading which allowed for a multi-threaded recursive divide and conquer strategy.

Projects like PaddedMatrices.jl and MaBLAS.jl are much more promising avenues with a greater likelyhood of materializing into something that could actually be used as a real BLAS library, but these not the sorts of projects I can see myself being well enough equipped or motivated to help out with. I just don’t know enough about computer hardware or fancy BLAS iteration schemes to be able to help with those projects.

My real hope is that someone can devise a index notation based multidimensional tensor contraction library where BLAS just comes out as a special case, rather than building up the tensor contractions out of BLAS kernels. As far as I know, such a library is pretty speculative and rather far off.

11 Likes