Regarding Gaius.jl, I wrote a bit about this here: [ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays - #28 by Mason
TLDR: I’m not working on Gaius right now and I have no real plans to. It was never a serious project, more just an exploration of what was possible to do at a high level without nitty gritty knowledge of BLAS kernels.
To be honest, I have no real interest in writing the sort of super specialized low level code that would be required to do this right. My intention with Gaius was to see how well you can do with the sorts of high level tools available to me at the time: LoopVectorization.jl and julia’s composable multi-threading which allowed for a multi-threaded recursive divide and conquer strategy.
Projects like PaddedMatrices.jl and MaBLAS.jl are much more promising avenues with a greater likelyhood of materializing into something that could actually be used as a real BLAS library, but these not the sorts of projects I can see myself being well enough equipped or motivated to help out with. I just don’t know enough about computer hardware or fancy BLAS iteration schemes to be able to help with those projects.
My real hope is that someone can devise a index notation based multidimensional tensor contraction library where BLAS just comes out as a special case, rather than building up the tensor contractions out of BLAS kernels. As far as I know, such a library is pretty speculative and rather far off.