@inbounds: is the compiler now so smart that this is no longer necessary?

Yes, I think we can write a BLAS in Julia.
I’ve been playing around with it for a little while. You can check out some work here:

Major issues:

  1. It only works for matrices that are a multiple of the kernel size right now. I haven’t filled in the missing edges yet.
  2. Only works for MMatrices from StaticArrays. Once I finish the rest, I’ll have a non-generated version that works for regular old arrays (and has probably just a tiny bit more overhead, because of calculating some things at runtime).
  3. I don’t have played with kernels involving matrix transposes yet.
  4. I haven’t implemented a recursive algorithm yet. I think if I load submatrices recursively, I might be able to scale better. Current scaling is a lot worse than O(n^3) on my Intel processor with avx512, but only a tiny bit worse with Ryzen. The avx512 has 4x the throughput as Ryzen per core, so throughput / memory speed is a lot worse with Intel.
  5. No threading!

That said, it looks good! Versus OpenBLAS with avx-512, modest sized matrices:

julia> using OhMyREPL, jBLAS, BenchmarkTools, LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> A128_128 = mrandn(128,128); # WAY faster than @MMatrix randn(M, N) for M,N > TINY_NUMBER

julia> B128_126 = mrandn(128,126);

julia> C128_126 = mrandn(128,126);

julia> jmul!(C128_126, A128_128, B128_126) # Confirming the answer is correct
128×126 StaticArrays.MArray{Tuple{128,126},Float64,2,16128}:
  -7.01447    -2.98439    -5.74117    -9.92775   -1.63882   -17.4866    …    2.50444   -15.7145     -2.52175    -9.72732     8.23362    -0.0480611
 -13.3085      8.11033    11.9918     -8.50466  -16.7253     10.7143       -18.5143     -2.83639    -7.44774    20.3162     21.9342     11.5031   
  -7.28199    -4.51168    -4.77706     1.99367   11.8847     -2.24961        6.93923    19.1472     -2.97989    11.0127     -9.27231   -12.1461   
   2.60148    -8.7429    -13.6254     -4.60391   -4.28139     9.45661       -8.99027    -8.84954    17.4719      5.47737    -0.729445   -0.889929 
   7.92216    16.7981    -25.3232      7.79537   -5.79957    15.1526         1.85937    -5.81069     6.26211     6.36528   -10.2083    -13.9899   
   0.979052   -3.23154    22.9183      1.58329   17.2893     -4.77779   …   -1.71329    -0.345357   10.1351     11.725      17.9994      4.13137  
 -12.4654     -0.792605   -6.28939    11.8703     5.06421   -13.3586        17.5039      4.64551     0.710493  -13.2521      5.19574     4.90359  
   8.91004    13.639       3.06454     6.00014  -12.2289     20.0813        -2.97452    -0.23808     4.63956    -2.57015     1.46739   -23.3573   
 -16.5464     -2.53216    -1.79439     8.16519   -1.01636    -7.23788        5.60839    11.4347      3.56887     2.96409    -9.67572    -7.79263  
  16.0071     43.5603      4.85504    23.9241   -18.9683     10.9092        20.7221    -22.2934     -6.03801     9.84488    -8.57055   -22.9826   
  -5.39735    -6.35583     5.50225    -7.39742   -4.2174     -3.1449    …    9.08321    17.6362    -24.2872    -12.1334     -9.25208    18.5631   
  -0.732309   -1.07189     1.65641    14.5927    -6.79831    15.3076        21.3605     -1.77275     7.6672    -16.556     -31.9151     -6.92279  
  12.066     -17.7769     -4.3234     26.788     16.565      -6.90403      -12.7472     20.2385     -8.46922     5.89168     1.0592      7.93109  
  11.4157     -5.31638    -0.804631   -5.265     -7.54981    12.7636         5.82927    -2.67509    13.0376     11.4717     -7.15938     4.45602  
 -12.1544      7.99333    13.4734    -18.5341    -3.25063     4.44092       16.3537      8.65397     6.89471    -5.06701     0.826184   -2.90417  
   ⋮                                                          ⋮         ⋱    ⋮                                                           ⋮        
   3.99748   -11.4429      2.66474   -24.4344    -2.51918    -1.79083        1.51211    15.0584    -23.3553      0.185678   13.1895     21.9612   
   6.95632    18.9656    -28.5299     12.5857    17.2015      8.48111       -7.30871    -0.775281   10.3668      2.97811     0.366781  -15.2664   
   0.95553   -11.2707      1.44373    10.3198   -15.4154     -4.585     …   17.7993     23.9214      3.86748     8.52783    17.1222      6.19255  
   7.9361      4.20605     9.85275   -11.5198     2.57681    -5.64149       -4.27823    -8.12214    -8.17309    -6.81521    -8.3202     -8.26316  
  -6.06597     3.94461    -9.33761     3.13537    8.77722     1.36968       13.7093     11.4851      1.92219    -3.03784   -16.6991     -4.27387  
  -5.22723   -11.1999     -3.65735    -0.39027   14.0611     -5.49591        8.30515     2.11561    -0.085943    2.98077    -9.54455    -1.59257  
 -17.5957     14.5879     -3.11054    -3.93381   -3.86949    15.4347        17.1275     16.5219     22.5019      5.06741   -12.1815      5.4818   
  14.2494      3.93928     0.326678   -1.06051    5.44846    23.2466    …   -9.01874    -3.71429    15.6083     16.0184     -4.47404    -3.55087  
  -1.22372     7.79477    21.0122     20.1065    12.6066     -4.94802        5.17584    10.2214    -16.1313      7.29539     4.35832     6.49918  
   8.16857     2.32976    -8.22107     8.23127  -10.5885     -1.98484        0.685584   -8.93469    13.0756     -5.6306      2.49534    -8.93337  
   9.36717   -10.5582    -12.3869      2.48226    0.942078  -14.5936        -4.79657   -15.6375     -1.2756      0.011122   22.9251     18.1269   
   0.171453   15.7406     -3.94504     6.73699  -27.3994     19.8036        19.7026     -5.57216    21.47       10.456      -7.53643     6.64545  
  -1.82776    -3.34233    -8.33704   -11.4722    -1.40182     0.569771  …    4.55985    13.976      -4.01863    -4.04004    -6.5058      1.75928  
 -13.3649     -0.879979    8.87948   -16.163      0.670733    0.160372      10.6509     14.8881      8.81195    -9.0987      1.53641    25.0194   
  -1.51149    12.2888     13.5032      4.0348   -10.9137     -2.1944        11.5437    -13.8045     15.9694     -5.84135   -14.469      -6.024    

julia> mul!(C128_126, A128_128, B128_126)
128×126 StaticArrays.MArray{Tuple{128,126},Float64,2,16128}:
  -7.01447    -2.98439    -5.74117    -9.92775   -1.63882   -17.4866    …    2.50444   -15.7145     -2.52175    -9.72732     8.23362    -0.0480611
 -13.3085      8.11033    11.9918     -8.50466  -16.7253     10.7143       -18.5143     -2.83639    -7.44774    20.3162     21.9342     11.5031   
  -7.28199    -4.51168    -4.77706     1.99367   11.8847     -2.24961        6.93923    19.1472     -2.97989    11.0127     -9.27231   -12.1461   
   2.60148    -8.7429    -13.6254     -4.60391   -4.28139     9.45661       -8.99027    -8.84954    17.4719      5.47737    -0.729445   -0.889929 
   7.92216    16.7981    -25.3232      7.79537   -5.79957    15.1526         1.85937    -5.81069     6.26211     6.36528   -10.2083    -13.9899   
   0.979052   -3.23154    22.9183      1.58329   17.2893     -4.77779   …   -1.71329    -0.345357   10.1351     11.725      17.9994      4.13137  
 -12.4654     -0.792605   -6.28939    11.8703     5.06421   -13.3586        17.5039      4.64551     0.710493  -13.2521      5.19574     4.90359  
   8.91004    13.639       3.06454     6.00014  -12.2289     20.0813        -2.97452    -0.23808     4.63956    -2.57015     1.46739   -23.3573   
 -16.5464     -2.53216    -1.79439     8.16519   -1.01636    -7.23788        5.60839    11.4347      3.56887     2.96409    -9.67572    -7.79263  
  16.0071     43.5603      4.85504    23.9241   -18.9683     10.9092        20.7221    -22.2934     -6.03801     9.84488    -8.57055   -22.9826   
  -5.39735    -6.35583     5.50225    -7.39742   -4.2174     -3.1449    …    9.08321    17.6362    -24.2872    -12.1334     -9.25208    18.5631   
  -0.732309   -1.07189     1.65641    14.5927    -6.79831    15.3076        21.3605     -1.77275     7.6672    -16.556     -31.9151     -6.92279  
  12.066     -17.7769     -4.3234     26.788     16.565      -6.90403      -12.7472     20.2385     -8.46922     5.89168     1.0592      7.93109  
  11.4157     -5.31638    -0.804631   -5.265     -7.54981    12.7636         5.82927    -2.67509    13.0376     11.4717     -7.15938     4.45602  
 -12.1544      7.99333    13.4734    -18.5341    -3.25063     4.44092       16.3537      8.65397     6.89471    -5.06701     0.826184   -2.90417  
   ⋮                                                          ⋮         ⋱    ⋮                                                           ⋮        
   3.99748   -11.4429      2.66474   -24.4344    -2.51918    -1.79083        1.51211    15.0584    -23.3553      0.185678   13.1895     21.9612   
   6.95632    18.9656    -28.5299     12.5857    17.2015      8.48111       -7.30871    -0.775281   10.3668      2.97811     0.366781  -15.2664   
   0.95553   -11.2707      1.44373    10.3198   -15.4154     -4.585     …   17.7993     23.9214      3.86748     8.52783    17.1222      6.19255  
   7.9361      4.20605     9.85275   -11.5198     2.57681    -5.64149       -4.27823    -8.12214    -8.17309    -6.81521    -8.3202     -8.26316  
  -6.06597     3.94461    -9.33761     3.13537    8.77722     1.36968       13.7093     11.4851      1.92219    -3.03784   -16.6991     -4.27387  
  -5.22723   -11.1999     -3.65735    -0.39027   14.0611     -5.49591        8.30515     2.11561    -0.085943    2.98077    -9.54455    -1.59257  
 -17.5957     14.5879     -3.11054    -3.93381   -3.86949    15.4347        17.1275     16.5219     22.5019      5.06741   -12.1815      5.4818   
  14.2494      3.93928     0.326678   -1.06051    5.44846    23.2466    …   -9.01874    -3.71429    15.6083     16.0184     -4.47404    -3.55087  
  -1.22372     7.79477    21.0122     20.1065    12.6066     -4.94802        5.17584    10.2214    -16.1313      7.29539     4.35832     6.49918  
   8.16857     2.32976    -8.22107     8.23127  -10.5885     -1.98484        0.685584   -8.93469    13.0756     -5.6306      2.49534    -8.93337  
   9.36717   -10.5582    -12.3869      2.48226    0.942078  -14.5936        -4.79657   -15.6375     -1.2756      0.011122   22.9251     18.1269   
   0.171453   15.7406     -3.94504     6.73699  -27.3994     19.8036        19.7026     -5.57216    21.47       10.456      -7.53643     6.64545  
  -1.82776    -3.34233    -8.33704   -11.4722    -1.40182     0.569771  …    4.55985    13.976      -4.01863    -4.04004    -6.5058      1.75928  
 -13.3649     -0.879979    8.87948   -16.163      0.670733    0.160372      10.6509     14.8881      8.81195    -9.0987      1.53641    25.0194   
  -1.51149    12.2888     13.5032      4.0348   -10.9137     -2.1944        11.5437    -13.8045     15.9694     -5.84135   -14.469      -6.024    

The benchmark:

julia> @benchmark jmul!($C128_126, $A128_128, $B128_126) #Pure Julia
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     40.205 μs (0.00% GC)
  median time:      40.398 μs (0.00% GC)
  mean time:        41.286 μs (0.00% GC)
  maximum time:     89.556 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C128_126, $A128_128, $B128_126) #OpenBLAS
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     100.334 μs (0.00% GC)
  median time:      100.649 μs (0.00% GC)
  mean time:        102.273 μs (0.00% GC)
  maximum time:     174.168 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Looking at a very small size, multiplying 16x32 * 32x14 matrices takes about 140ns. (I said 65 ns earlier, but that was because of a bug.)
Perfect O(n^3) scaling would thus give us

julia> (128 / 16) * (128 / 32) * (128 / 14) * 140 / 1e3
40.96

microseconds, so more or less perfect scaling.
MKL had a minimum of 36.549 μs, so not only are they a bit faster, but maybe my kernel is a little suboptimal?
That would imply roughly 125ns for the kernel, so they must be doing something a little different.

Bigger matrix set up (and answer confirmation):

julia> A800_900 = mrandn(16*50, 900);

julia> B900_840 = mrandn(900,14*60);

julia> C800_840 = mrandn(16*50,14*60);

julia> jmul!(C800_840, A800_900, B900_840)
800×840 StaticArrays.MArray{Tuple{800,840},Float64,2,672000}:
  38.0743    -37.3638     -23.3069     43.6012   -14.9798   -22.6108   …    6.68674   31.0339    22.0678    -25.335    -12.3134    34.2845    
  -0.771707  -24.4486     -46.6449     37.0407    24.6849   -22.056         4.19191    6.73854  -18.4685     90.6574    34.1527   -41.9439    
  -4.7973    -56.2395      -0.369894   21.4934   -30.8941   -20.8357       21.4338    11.461    -26.1999     -8.51833   22.5552   -52.3642    
  21.6411     -4.72307     47.6128     16.3918    29.8575   -22.1772        6.87733    3.47535  -41.9409     -3.14536   39.851      6.89992   
  16.9652     -0.0380062   90.3321    -48.3799    55.2641     5.03981      -8.4407     9.44483  -10.5455     11.4134    22.1736    28.6386    
   1.10867    32.7442      -9.571      -4.61297   26.8078    35.2818   …  -25.8284    -1.502      0.323376   50.3406    -9.49194  -12.6995    
  53.7033     -2.69979     29.6596     21.8752   -42.6504   -56.8507      -44.3737   -18.3475    -9.93175   -26.6298    -7.92967   33.2311    
 -31.6189     14.6708       7.66491    34.2681    22.136    -34.5229       20.1327     1.23482   11.1025     16.8833   -20.2968    51.6909    
 -15.412     -11.2317      23.9317     24.7645   -29.6448   -13.797         3.52927   -8.74718    4.65412    42.6386    -6.51217    0.00954396
   3.85964   -28.4584     -33.407     -20.6061   -10.6759    -0.94623       4.65821    7.54115  -36.1106    -11.7496    11.2209    -1.79995   
   2.96696    46.217       -1.40911    75.0949    27.5513    -3.66848  …  -36.9966   -45.3326    16.136     -21.0145     5.47982  -11.8452    
  41.6961     25.8745      29.335       4.15438  -45.8787   -44.9912       67.5224   -24.6367    56.7467     10.4153   -10.4572   -18.5692    
  21.3235    -30.0029     -31.4225    -38.153    -29.2       37.1059       46.3853     6.42264    1.21527   -61.9367     1.3304    18.854     
  -3.50268    12.5908      47.5534      4.62536  -62.8162    -5.87803      24.3344   -64.3619   -54.6748    -18.0393     1.03131   -7.18129   
   3.10441    17.5996     -10.1619      6.87622   -1.52725  -25.9976      -17.2381    44.9151   -30.6112     36.949     25.7487    -4.64311   
   ⋮                                                          ⋮        ⋱               ⋮                                                      
  -0.25848   -26.5611     -21.7325     28.1408    -1.25443   25.9933   …   15.8921    42.1656   -34.5233    -56.8323   -24.6462    47.475     
   2.97608   -20.6029      38.6915      6.64966   45.6482    19.4757       20.9733   -24.903     21.1787    -30.8994   -16.2787   -40.509     
   4.02883   -34.8826      -3.34805   -38.1582    40.5341    64.5054       50.0794   -34.1333    -2.95293    18.2688   -11.8533   -38.5013    
  12.4209     -3.19724    -33.9069     36.8744   -28.0702    53.2995       13.0724    53.7317    14.6668     49.303     10.8247    51.3291    
 -18.5631    -24.0589      -6.52054    34.7071    22.1703   -18.5829       -3.24598   10.1572   -38.5507    -34.5536   -59.9946   -20.7777    
 -25.0644     18.3918     -36.5186     10.7124    82.628    -44.118    …  -39.3714     1.20289  -21.4417     22.7627     3.55199  -36.0679    
  15.1521    -24.1379     -18.4929    -29.8979    26.7423    24.24        -18.1427     9.34076   81.6854      1.93507  -68.5226   -30.7109    
  12.0685      2.69166     15.5109    -10.6751    13.3148    13.6762      -22.8099    34.7844     6.12369   -56.9632    25.2193    72.3292    
 -27.2328    -45.9552      30.7724     37.1681     6.97637    7.16855      -7.0608   -31.0684   -20.0999     -6.20643   27.8918   -41.7873    
  24.3027     69.0994       1.83236   -10.791      1.16588   15.7833       14.2337   -37.2954    12.9956     65.0574    23.6132    65.8518    
 -22.6034     -3.46183     21.5065    -73.0987    41.7081   -50.3692   …   42.7244    14.4162   -55.0306     26.7348    94.2211   -24.3135    
  30.195       8.71624    -11.5629    -17.1744    24.1736   -77.5655       69.8622     1.56408   27.9262    -12.5686    64.3182   -13.0658    
   4.81069   -40.5611     -28.4828      6.60299   59.8823    16.0372        8.16483   53.9061    -6.52424     3.28285  -35.407    -56.1281    
 -46.878       8.07851     23.708     -18.5416    44.5693    -7.28137      -5.50011  -15.2032    17.4343     39.6803    17.2223    -5.02152   
  11.9364      8.46358    -33.1727    -18.1967   -36.1323     1.66343       6.10104   28.2232    14.3468     58.7046    -7.21755  -75.5267    

julia> mul!(C800_840, A800_900, B900_840)
800×840 StaticArrays.MArray{Tuple{800,840},Float64,2,672000}:
  38.0743    -37.3638     -23.3069     43.6012   -14.9798   -22.6108   …    6.68674   31.0339    22.0678    -25.335    -12.3134    34.2845    
  -0.771707  -24.4486     -46.6449     37.0407    24.6849   -22.056         4.19191    6.73854  -18.4685     90.6574    34.1527   -41.9439    
  -4.7973    -56.2395      -0.369894   21.4934   -30.8941   -20.8357       21.4338    11.461    -26.1999     -8.51833   22.5552   -52.3642    
  21.6411     -4.72307     47.6128     16.3918    29.8575   -22.1772        6.87733    3.47535  -41.9409     -3.14536   39.851      6.89992   
  16.9652     -0.0380062   90.3321    -48.3799    55.2641     5.03981      -8.4407     9.44483  -10.5455     11.4134    22.1736    28.6386    
   1.10867    32.7442      -9.571      -4.61297   26.8078    35.2818   …  -25.8284    -1.502      0.323376   50.3406    -9.49194  -12.6995    
  53.7033     -2.69979     29.6596     21.8752   -42.6504   -56.8507      -44.3737   -18.3475    -9.93175   -26.6298    -7.92967   33.2311    
 -31.6189     14.6708       7.66491    34.2681    22.136    -34.5229       20.1327     1.23482   11.1025     16.8833   -20.2968    51.6909    
 -15.412     -11.2317      23.9317     24.7645   -29.6448   -13.797         3.52927   -8.74718    4.65412    42.6386    -6.51217    0.00954396
   3.85964   -28.4584     -33.407     -20.6061   -10.6759    -0.94623       4.65821    7.54115  -36.1106    -11.7496    11.2209    -1.79995   
   2.96696    46.217       -1.40911    75.0949    27.5513    -3.66848  …  -36.9966   -45.3326    16.136     -21.0145     5.47982  -11.8452    
  41.6961     25.8745      29.335       4.15438  -45.8787   -44.9912       67.5224   -24.6367    56.7467     10.4153   -10.4572   -18.5692    
  21.3235    -30.0029     -31.4225    -38.153    -29.2       37.1059       46.3853     6.42264    1.21527   -61.9367     1.3304    18.854     
  -3.50268    12.5908      47.5534      4.62536  -62.8162    -5.87803      24.3344   -64.3619   -54.6748    -18.0393     1.03131   -7.18129   
   3.10441    17.5996     -10.1619      6.87622   -1.52725  -25.9976      -17.2381    44.9151   -30.6112     36.949     25.7487    -4.64311   
   ⋮                                                          ⋮        ⋱               ⋮                                                      
  -0.25848   -26.5611     -21.7325     28.1408    -1.25443   25.9933   …   15.8921    42.1656   -34.5233    -56.8323   -24.6462    47.475     
   2.97608   -20.6029      38.6915      6.64966   45.6482    19.4757       20.9733   -24.903     21.1787    -30.8994   -16.2787   -40.509     
   4.02883   -34.8826      -3.34805   -38.1582    40.5341    64.5054       50.0794   -34.1333    -2.95293    18.2688   -11.8533   -38.5013    
  12.4209     -3.19724    -33.9069     36.8744   -28.0702    53.2995       13.0724    53.7317    14.6668     49.303     10.8247    51.3291    
 -18.5631    -24.0589      -6.52054    34.7071    22.1703   -18.5829       -3.24598   10.1572   -38.5507    -34.5536   -59.9946   -20.7777    
 -25.0644     18.3918     -36.5186     10.7124    82.628    -44.118    …  -39.3714     1.20289  -21.4417     22.7627     3.55199  -36.0679    
  15.1521    -24.1379     -18.4929    -29.8979    26.7423    24.24        -18.1427     9.34076   81.6854      1.93507  -68.5226   -30.7109    
  12.0685      2.69166     15.5109    -10.6751    13.3148    13.6762      -22.8099    34.7844     6.12369   -56.9632    25.2193    72.3292    
 -27.2328    -45.9552      30.7724     37.1681     6.97637    7.16855      -7.0608   -31.0684   -20.0999     -6.20643   27.8918   -41.7873    
  24.3027     69.0994       1.83236   -10.791      1.16588   15.7833       14.2337   -37.2954    12.9956     65.0574    23.6132    65.8518    
 -22.6034     -3.46183     21.5065    -73.0987    41.7081   -50.3692   …   42.7244    14.4162   -55.0306     26.7348    94.2211   -24.3135    
  30.195       8.71624    -11.5629    -17.1744    24.1736   -77.5655       69.8622     1.56408   27.9262    -12.5686    64.3182   -13.0658    
   4.81069   -40.5611     -28.4828      6.60299   59.8823    16.0372        8.16483   53.9061    -6.52424     3.28285  -35.407    -56.1281    
 -46.878       8.07851     23.708     -18.5416    44.5693    -7.28137      -5.50011  -15.2032    17.4343     39.6803    17.2223    -5.02152   
  11.9364      8.46358    -33.1727    -18.1967   -36.1323     1.66343       6.10104   28.2232    14.3468     58.7046    -7.21755  -75.5267    

Benchmark:

julia> @benchmark jmul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.793 ms (0.00% GC)
  median time:      17.102 ms (0.00% GC)
  mean time:        17.165 ms (0.00% GC)
  maximum time:     19.265 ms (0.00% GC)
  --------------
  samples:          292
  evals/sample:     1

julia> @benchmark mul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     23.235 ms (0.00% GC)
  median time:      23.628 ms (0.00% GC)
  mean time:        23.698 ms (0.00% GC)
  maximum time:     25.558 ms (0.00% GC)
  --------------
  samples:          211
  evals/sample:     1

We’re still faster than OpenBLAS!
Perfect scaling would give us:

julia> (800 / 16) * (900 / 32) * (840 / 14) * 140 / 1e6
11.8125

Just over 11.8 milliseconds – we’re now around 50% slower.
MKL does it in 11.5 ms, so still better than the scaled up kernel, but only slightly so.

EDIT:
Ryzen results:

julia> @benchmark jmul!($C128_126, $A128_128, $B128_126)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     137.429 μs (0.00% GC)
  median time:      138.641 μs (0.00% GC)
  mean time:        139.844 μs (0.00% GC)
  maximum time:     262.424 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C128_126, $A128_128, $B128_126)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     158.137 μs (0.00% GC)
  median time:      163.022 μs (0.00% GC)
  mean time:        163.917 μs (0.00% GC)
  maximum time:     215.655 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark jmul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     43.583 ms (0.00% GC)
  median time:      44.016 ms (0.00% GC)
  mean time:        44.166 ms (0.00% GC)
  maximum time:     46.266 ms (0.00% GC)
  --------------
  samples:          114
  evals/sample:     1

julia> @benchmark mul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     41.944 ms (0.00% GC)
  median time:      42.194 ms (0.00% GC)
  mean time:        42.289 ms (0.00% GC)
  maximum time:     43.389 ms (0.00% GC)
  --------------
  samples:          119
  evals/sample:     1

A little slower than OpenBLAS with Ryzen at the larger size, but pretty close.

So:

So given the above, I am led to believe that a pure-Julia blas-like implementations might be realistic for things like matrix-matrix multiplication.

Definitely! Already highly competitive with OpenBLAS on the two architectures I tested. Significantly faster on one, and barely slower on the other. I think the memory can be much better optimized than it is now (ie, clever recursive loading of submatrices). However, I am worried that the optimal memory pre-fetching behavior may be very architecture dependent.
Threading would also probably be difficult? I have no idea.

The bottom curve is for the matrix multiplication carried out for MMatrices (up to size 10). EDIT: one of the matrices (gradN) is an MMatrix always.

Given that, you could definitely improve my function a lot. API-wise, get the number of columns from gradN, instead of passing it as a Val. For another, we could split the outer loop into a variable number of pieces based on Kedim, with L changing for each piece as appropriate.
Or, as a lazy hack, just have it call the other function when Kedim < some threshold.

6 Likes