Yes, I think we can write a BLAS in Julia.
I’ve been playing around with it for a little while. You can check out some work here:
Major issues:
- It only works for matrices that are a multiple of the kernel size right now. I haven’t filled in the missing edges yet.
- Only works for
MMatrices
from StaticArrays. Once I finish the rest, I’ll have a non-generated version that works for regular old arrays (and has probably just a tiny bit more overhead, because of calculating some things at runtime). - I don’t have played with kernels involving matrix transposes yet.
- I haven’t implemented a recursive algorithm yet. I think if I load submatrices recursively, I might be able to scale better. Current scaling is a lot worse than
O(n^3)
on my Intel processor with avx512, but only a tiny bit worse with Ryzen. The avx512 has 4x the throughput as Ryzen per core, so throughput / memory speed is a lot worse with Intel. - No threading!
That said, it looks good! Versus OpenBLAS with avx-512, modest sized matrices:
julia> using OhMyREPL, jBLAS, BenchmarkTools, LinearAlgebra
julia> BLAS.set_num_threads(1)
julia> A128_128 = mrandn(128,128); # WAY faster than @MMatrix randn(M, N) for M,N > TINY_NUMBER
julia> B128_126 = mrandn(128,126);
julia> C128_126 = mrandn(128,126);
julia> jmul!(C128_126, A128_128, B128_126) # Confirming the answer is correct
128×126 StaticArrays.MArray{Tuple{128,126},Float64,2,16128}:
-7.01447 -2.98439 -5.74117 -9.92775 -1.63882 -17.4866 … 2.50444 -15.7145 -2.52175 -9.72732 8.23362 -0.0480611
-13.3085 8.11033 11.9918 -8.50466 -16.7253 10.7143 -18.5143 -2.83639 -7.44774 20.3162 21.9342 11.5031
-7.28199 -4.51168 -4.77706 1.99367 11.8847 -2.24961 6.93923 19.1472 -2.97989 11.0127 -9.27231 -12.1461
2.60148 -8.7429 -13.6254 -4.60391 -4.28139 9.45661 -8.99027 -8.84954 17.4719 5.47737 -0.729445 -0.889929
7.92216 16.7981 -25.3232 7.79537 -5.79957 15.1526 1.85937 -5.81069 6.26211 6.36528 -10.2083 -13.9899
0.979052 -3.23154 22.9183 1.58329 17.2893 -4.77779 … -1.71329 -0.345357 10.1351 11.725 17.9994 4.13137
-12.4654 -0.792605 -6.28939 11.8703 5.06421 -13.3586 17.5039 4.64551 0.710493 -13.2521 5.19574 4.90359
8.91004 13.639 3.06454 6.00014 -12.2289 20.0813 -2.97452 -0.23808 4.63956 -2.57015 1.46739 -23.3573
-16.5464 -2.53216 -1.79439 8.16519 -1.01636 -7.23788 5.60839 11.4347 3.56887 2.96409 -9.67572 -7.79263
16.0071 43.5603 4.85504 23.9241 -18.9683 10.9092 20.7221 -22.2934 -6.03801 9.84488 -8.57055 -22.9826
-5.39735 -6.35583 5.50225 -7.39742 -4.2174 -3.1449 … 9.08321 17.6362 -24.2872 -12.1334 -9.25208 18.5631
-0.732309 -1.07189 1.65641 14.5927 -6.79831 15.3076 21.3605 -1.77275 7.6672 -16.556 -31.9151 -6.92279
12.066 -17.7769 -4.3234 26.788 16.565 -6.90403 -12.7472 20.2385 -8.46922 5.89168 1.0592 7.93109
11.4157 -5.31638 -0.804631 -5.265 -7.54981 12.7636 5.82927 -2.67509 13.0376 11.4717 -7.15938 4.45602
-12.1544 7.99333 13.4734 -18.5341 -3.25063 4.44092 16.3537 8.65397 6.89471 -5.06701 0.826184 -2.90417
⋮ ⋮ ⋱ ⋮ ⋮
3.99748 -11.4429 2.66474 -24.4344 -2.51918 -1.79083 1.51211 15.0584 -23.3553 0.185678 13.1895 21.9612
6.95632 18.9656 -28.5299 12.5857 17.2015 8.48111 -7.30871 -0.775281 10.3668 2.97811 0.366781 -15.2664
0.95553 -11.2707 1.44373 10.3198 -15.4154 -4.585 … 17.7993 23.9214 3.86748 8.52783 17.1222 6.19255
7.9361 4.20605 9.85275 -11.5198 2.57681 -5.64149 -4.27823 -8.12214 -8.17309 -6.81521 -8.3202 -8.26316
-6.06597 3.94461 -9.33761 3.13537 8.77722 1.36968 13.7093 11.4851 1.92219 -3.03784 -16.6991 -4.27387
-5.22723 -11.1999 -3.65735 -0.39027 14.0611 -5.49591 8.30515 2.11561 -0.085943 2.98077 -9.54455 -1.59257
-17.5957 14.5879 -3.11054 -3.93381 -3.86949 15.4347 17.1275 16.5219 22.5019 5.06741 -12.1815 5.4818
14.2494 3.93928 0.326678 -1.06051 5.44846 23.2466 … -9.01874 -3.71429 15.6083 16.0184 -4.47404 -3.55087
-1.22372 7.79477 21.0122 20.1065 12.6066 -4.94802 5.17584 10.2214 -16.1313 7.29539 4.35832 6.49918
8.16857 2.32976 -8.22107 8.23127 -10.5885 -1.98484 0.685584 -8.93469 13.0756 -5.6306 2.49534 -8.93337
9.36717 -10.5582 -12.3869 2.48226 0.942078 -14.5936 -4.79657 -15.6375 -1.2756 0.011122 22.9251 18.1269
0.171453 15.7406 -3.94504 6.73699 -27.3994 19.8036 19.7026 -5.57216 21.47 10.456 -7.53643 6.64545
-1.82776 -3.34233 -8.33704 -11.4722 -1.40182 0.569771 … 4.55985 13.976 -4.01863 -4.04004 -6.5058 1.75928
-13.3649 -0.879979 8.87948 -16.163 0.670733 0.160372 10.6509 14.8881 8.81195 -9.0987 1.53641 25.0194
-1.51149 12.2888 13.5032 4.0348 -10.9137 -2.1944 11.5437 -13.8045 15.9694 -5.84135 -14.469 -6.024
julia> mul!(C128_126, A128_128, B128_126)
128×126 StaticArrays.MArray{Tuple{128,126},Float64,2,16128}:
-7.01447 -2.98439 -5.74117 -9.92775 -1.63882 -17.4866 … 2.50444 -15.7145 -2.52175 -9.72732 8.23362 -0.0480611
-13.3085 8.11033 11.9918 -8.50466 -16.7253 10.7143 -18.5143 -2.83639 -7.44774 20.3162 21.9342 11.5031
-7.28199 -4.51168 -4.77706 1.99367 11.8847 -2.24961 6.93923 19.1472 -2.97989 11.0127 -9.27231 -12.1461
2.60148 -8.7429 -13.6254 -4.60391 -4.28139 9.45661 -8.99027 -8.84954 17.4719 5.47737 -0.729445 -0.889929
7.92216 16.7981 -25.3232 7.79537 -5.79957 15.1526 1.85937 -5.81069 6.26211 6.36528 -10.2083 -13.9899
0.979052 -3.23154 22.9183 1.58329 17.2893 -4.77779 … -1.71329 -0.345357 10.1351 11.725 17.9994 4.13137
-12.4654 -0.792605 -6.28939 11.8703 5.06421 -13.3586 17.5039 4.64551 0.710493 -13.2521 5.19574 4.90359
8.91004 13.639 3.06454 6.00014 -12.2289 20.0813 -2.97452 -0.23808 4.63956 -2.57015 1.46739 -23.3573
-16.5464 -2.53216 -1.79439 8.16519 -1.01636 -7.23788 5.60839 11.4347 3.56887 2.96409 -9.67572 -7.79263
16.0071 43.5603 4.85504 23.9241 -18.9683 10.9092 20.7221 -22.2934 -6.03801 9.84488 -8.57055 -22.9826
-5.39735 -6.35583 5.50225 -7.39742 -4.2174 -3.1449 … 9.08321 17.6362 -24.2872 -12.1334 -9.25208 18.5631
-0.732309 -1.07189 1.65641 14.5927 -6.79831 15.3076 21.3605 -1.77275 7.6672 -16.556 -31.9151 -6.92279
12.066 -17.7769 -4.3234 26.788 16.565 -6.90403 -12.7472 20.2385 -8.46922 5.89168 1.0592 7.93109
11.4157 -5.31638 -0.804631 -5.265 -7.54981 12.7636 5.82927 -2.67509 13.0376 11.4717 -7.15938 4.45602
-12.1544 7.99333 13.4734 -18.5341 -3.25063 4.44092 16.3537 8.65397 6.89471 -5.06701 0.826184 -2.90417
⋮ ⋮ ⋱ ⋮ ⋮
3.99748 -11.4429 2.66474 -24.4344 -2.51918 -1.79083 1.51211 15.0584 -23.3553 0.185678 13.1895 21.9612
6.95632 18.9656 -28.5299 12.5857 17.2015 8.48111 -7.30871 -0.775281 10.3668 2.97811 0.366781 -15.2664
0.95553 -11.2707 1.44373 10.3198 -15.4154 -4.585 … 17.7993 23.9214 3.86748 8.52783 17.1222 6.19255
7.9361 4.20605 9.85275 -11.5198 2.57681 -5.64149 -4.27823 -8.12214 -8.17309 -6.81521 -8.3202 -8.26316
-6.06597 3.94461 -9.33761 3.13537 8.77722 1.36968 13.7093 11.4851 1.92219 -3.03784 -16.6991 -4.27387
-5.22723 -11.1999 -3.65735 -0.39027 14.0611 -5.49591 8.30515 2.11561 -0.085943 2.98077 -9.54455 -1.59257
-17.5957 14.5879 -3.11054 -3.93381 -3.86949 15.4347 17.1275 16.5219 22.5019 5.06741 -12.1815 5.4818
14.2494 3.93928 0.326678 -1.06051 5.44846 23.2466 … -9.01874 -3.71429 15.6083 16.0184 -4.47404 -3.55087
-1.22372 7.79477 21.0122 20.1065 12.6066 -4.94802 5.17584 10.2214 -16.1313 7.29539 4.35832 6.49918
8.16857 2.32976 -8.22107 8.23127 -10.5885 -1.98484 0.685584 -8.93469 13.0756 -5.6306 2.49534 -8.93337
9.36717 -10.5582 -12.3869 2.48226 0.942078 -14.5936 -4.79657 -15.6375 -1.2756 0.011122 22.9251 18.1269
0.171453 15.7406 -3.94504 6.73699 -27.3994 19.8036 19.7026 -5.57216 21.47 10.456 -7.53643 6.64545
-1.82776 -3.34233 -8.33704 -11.4722 -1.40182 0.569771 … 4.55985 13.976 -4.01863 -4.04004 -6.5058 1.75928
-13.3649 -0.879979 8.87948 -16.163 0.670733 0.160372 10.6509 14.8881 8.81195 -9.0987 1.53641 25.0194
-1.51149 12.2888 13.5032 4.0348 -10.9137 -2.1944 11.5437 -13.8045 15.9694 -5.84135 -14.469 -6.024
The benchmark:
julia> @benchmark jmul!($C128_126, $A128_128, $B128_126) #Pure Julia
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 40.205 μs (0.00% GC)
median time: 40.398 μs (0.00% GC)
mean time: 41.286 μs (0.00% GC)
maximum time: 89.556 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($C128_126, $A128_128, $B128_126) #OpenBLAS
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 100.334 μs (0.00% GC)
median time: 100.649 μs (0.00% GC)
mean time: 102.273 μs (0.00% GC)
maximum time: 174.168 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Looking at a very small size, multiplying 16x32 * 32x14 matrices takes about 140ns. (I said 65 ns earlier, but that was because of a bug.)
Perfect O(n^3) scaling would thus give us
julia> (128 / 16) * (128 / 32) * (128 / 14) * 140 / 1e3
40.96
microseconds, so more or less perfect scaling.
MKL had a minimum of 36.549 μs, so not only are they a bit faster, but maybe my kernel is a little suboptimal?
That would imply roughly 125ns for the kernel, so they must be doing something a little different.
Bigger matrix set up (and answer confirmation):
julia> A800_900 = mrandn(16*50, 900);
julia> B900_840 = mrandn(900,14*60);
julia> C800_840 = mrandn(16*50,14*60);
julia> jmul!(C800_840, A800_900, B900_840)
800×840 StaticArrays.MArray{Tuple{800,840},Float64,2,672000}:
38.0743 -37.3638 -23.3069 43.6012 -14.9798 -22.6108 … 6.68674 31.0339 22.0678 -25.335 -12.3134 34.2845
-0.771707 -24.4486 -46.6449 37.0407 24.6849 -22.056 4.19191 6.73854 -18.4685 90.6574 34.1527 -41.9439
-4.7973 -56.2395 -0.369894 21.4934 -30.8941 -20.8357 21.4338 11.461 -26.1999 -8.51833 22.5552 -52.3642
21.6411 -4.72307 47.6128 16.3918 29.8575 -22.1772 6.87733 3.47535 -41.9409 -3.14536 39.851 6.89992
16.9652 -0.0380062 90.3321 -48.3799 55.2641 5.03981 -8.4407 9.44483 -10.5455 11.4134 22.1736 28.6386
1.10867 32.7442 -9.571 -4.61297 26.8078 35.2818 … -25.8284 -1.502 0.323376 50.3406 -9.49194 -12.6995
53.7033 -2.69979 29.6596 21.8752 -42.6504 -56.8507 -44.3737 -18.3475 -9.93175 -26.6298 -7.92967 33.2311
-31.6189 14.6708 7.66491 34.2681 22.136 -34.5229 20.1327 1.23482 11.1025 16.8833 -20.2968 51.6909
-15.412 -11.2317 23.9317 24.7645 -29.6448 -13.797 3.52927 -8.74718 4.65412 42.6386 -6.51217 0.00954396
3.85964 -28.4584 -33.407 -20.6061 -10.6759 -0.94623 4.65821 7.54115 -36.1106 -11.7496 11.2209 -1.79995
2.96696 46.217 -1.40911 75.0949 27.5513 -3.66848 … -36.9966 -45.3326 16.136 -21.0145 5.47982 -11.8452
41.6961 25.8745 29.335 4.15438 -45.8787 -44.9912 67.5224 -24.6367 56.7467 10.4153 -10.4572 -18.5692
21.3235 -30.0029 -31.4225 -38.153 -29.2 37.1059 46.3853 6.42264 1.21527 -61.9367 1.3304 18.854
-3.50268 12.5908 47.5534 4.62536 -62.8162 -5.87803 24.3344 -64.3619 -54.6748 -18.0393 1.03131 -7.18129
3.10441 17.5996 -10.1619 6.87622 -1.52725 -25.9976 -17.2381 44.9151 -30.6112 36.949 25.7487 -4.64311
⋮ ⋮ ⋱ ⋮
-0.25848 -26.5611 -21.7325 28.1408 -1.25443 25.9933 … 15.8921 42.1656 -34.5233 -56.8323 -24.6462 47.475
2.97608 -20.6029 38.6915 6.64966 45.6482 19.4757 20.9733 -24.903 21.1787 -30.8994 -16.2787 -40.509
4.02883 -34.8826 -3.34805 -38.1582 40.5341 64.5054 50.0794 -34.1333 -2.95293 18.2688 -11.8533 -38.5013
12.4209 -3.19724 -33.9069 36.8744 -28.0702 53.2995 13.0724 53.7317 14.6668 49.303 10.8247 51.3291
-18.5631 -24.0589 -6.52054 34.7071 22.1703 -18.5829 -3.24598 10.1572 -38.5507 -34.5536 -59.9946 -20.7777
-25.0644 18.3918 -36.5186 10.7124 82.628 -44.118 … -39.3714 1.20289 -21.4417 22.7627 3.55199 -36.0679
15.1521 -24.1379 -18.4929 -29.8979 26.7423 24.24 -18.1427 9.34076 81.6854 1.93507 -68.5226 -30.7109
12.0685 2.69166 15.5109 -10.6751 13.3148 13.6762 -22.8099 34.7844 6.12369 -56.9632 25.2193 72.3292
-27.2328 -45.9552 30.7724 37.1681 6.97637 7.16855 -7.0608 -31.0684 -20.0999 -6.20643 27.8918 -41.7873
24.3027 69.0994 1.83236 -10.791 1.16588 15.7833 14.2337 -37.2954 12.9956 65.0574 23.6132 65.8518
-22.6034 -3.46183 21.5065 -73.0987 41.7081 -50.3692 … 42.7244 14.4162 -55.0306 26.7348 94.2211 -24.3135
30.195 8.71624 -11.5629 -17.1744 24.1736 -77.5655 69.8622 1.56408 27.9262 -12.5686 64.3182 -13.0658
4.81069 -40.5611 -28.4828 6.60299 59.8823 16.0372 8.16483 53.9061 -6.52424 3.28285 -35.407 -56.1281
-46.878 8.07851 23.708 -18.5416 44.5693 -7.28137 -5.50011 -15.2032 17.4343 39.6803 17.2223 -5.02152
11.9364 8.46358 -33.1727 -18.1967 -36.1323 1.66343 6.10104 28.2232 14.3468 58.7046 -7.21755 -75.5267
julia> mul!(C800_840, A800_900, B900_840)
800×840 StaticArrays.MArray{Tuple{800,840},Float64,2,672000}:
38.0743 -37.3638 -23.3069 43.6012 -14.9798 -22.6108 … 6.68674 31.0339 22.0678 -25.335 -12.3134 34.2845
-0.771707 -24.4486 -46.6449 37.0407 24.6849 -22.056 4.19191 6.73854 -18.4685 90.6574 34.1527 -41.9439
-4.7973 -56.2395 -0.369894 21.4934 -30.8941 -20.8357 21.4338 11.461 -26.1999 -8.51833 22.5552 -52.3642
21.6411 -4.72307 47.6128 16.3918 29.8575 -22.1772 6.87733 3.47535 -41.9409 -3.14536 39.851 6.89992
16.9652 -0.0380062 90.3321 -48.3799 55.2641 5.03981 -8.4407 9.44483 -10.5455 11.4134 22.1736 28.6386
1.10867 32.7442 -9.571 -4.61297 26.8078 35.2818 … -25.8284 -1.502 0.323376 50.3406 -9.49194 -12.6995
53.7033 -2.69979 29.6596 21.8752 -42.6504 -56.8507 -44.3737 -18.3475 -9.93175 -26.6298 -7.92967 33.2311
-31.6189 14.6708 7.66491 34.2681 22.136 -34.5229 20.1327 1.23482 11.1025 16.8833 -20.2968 51.6909
-15.412 -11.2317 23.9317 24.7645 -29.6448 -13.797 3.52927 -8.74718 4.65412 42.6386 -6.51217 0.00954396
3.85964 -28.4584 -33.407 -20.6061 -10.6759 -0.94623 4.65821 7.54115 -36.1106 -11.7496 11.2209 -1.79995
2.96696 46.217 -1.40911 75.0949 27.5513 -3.66848 … -36.9966 -45.3326 16.136 -21.0145 5.47982 -11.8452
41.6961 25.8745 29.335 4.15438 -45.8787 -44.9912 67.5224 -24.6367 56.7467 10.4153 -10.4572 -18.5692
21.3235 -30.0029 -31.4225 -38.153 -29.2 37.1059 46.3853 6.42264 1.21527 -61.9367 1.3304 18.854
-3.50268 12.5908 47.5534 4.62536 -62.8162 -5.87803 24.3344 -64.3619 -54.6748 -18.0393 1.03131 -7.18129
3.10441 17.5996 -10.1619 6.87622 -1.52725 -25.9976 -17.2381 44.9151 -30.6112 36.949 25.7487 -4.64311
⋮ ⋮ ⋱ ⋮
-0.25848 -26.5611 -21.7325 28.1408 -1.25443 25.9933 … 15.8921 42.1656 -34.5233 -56.8323 -24.6462 47.475
2.97608 -20.6029 38.6915 6.64966 45.6482 19.4757 20.9733 -24.903 21.1787 -30.8994 -16.2787 -40.509
4.02883 -34.8826 -3.34805 -38.1582 40.5341 64.5054 50.0794 -34.1333 -2.95293 18.2688 -11.8533 -38.5013
12.4209 -3.19724 -33.9069 36.8744 -28.0702 53.2995 13.0724 53.7317 14.6668 49.303 10.8247 51.3291
-18.5631 -24.0589 -6.52054 34.7071 22.1703 -18.5829 -3.24598 10.1572 -38.5507 -34.5536 -59.9946 -20.7777
-25.0644 18.3918 -36.5186 10.7124 82.628 -44.118 … -39.3714 1.20289 -21.4417 22.7627 3.55199 -36.0679
15.1521 -24.1379 -18.4929 -29.8979 26.7423 24.24 -18.1427 9.34076 81.6854 1.93507 -68.5226 -30.7109
12.0685 2.69166 15.5109 -10.6751 13.3148 13.6762 -22.8099 34.7844 6.12369 -56.9632 25.2193 72.3292
-27.2328 -45.9552 30.7724 37.1681 6.97637 7.16855 -7.0608 -31.0684 -20.0999 -6.20643 27.8918 -41.7873
24.3027 69.0994 1.83236 -10.791 1.16588 15.7833 14.2337 -37.2954 12.9956 65.0574 23.6132 65.8518
-22.6034 -3.46183 21.5065 -73.0987 41.7081 -50.3692 … 42.7244 14.4162 -55.0306 26.7348 94.2211 -24.3135
30.195 8.71624 -11.5629 -17.1744 24.1736 -77.5655 69.8622 1.56408 27.9262 -12.5686 64.3182 -13.0658
4.81069 -40.5611 -28.4828 6.60299 59.8823 16.0372 8.16483 53.9061 -6.52424 3.28285 -35.407 -56.1281
-46.878 8.07851 23.708 -18.5416 44.5693 -7.28137 -5.50011 -15.2032 17.4343 39.6803 17.2223 -5.02152
11.9364 8.46358 -33.1727 -18.1967 -36.1323 1.66343 6.10104 28.2232 14.3468 58.7046 -7.21755 -75.5267
Benchmark:
julia> @benchmark jmul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 16.793 ms (0.00% GC)
median time: 17.102 ms (0.00% GC)
mean time: 17.165 ms (0.00% GC)
maximum time: 19.265 ms (0.00% GC)
--------------
samples: 292
evals/sample: 1
julia> @benchmark mul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 23.235 ms (0.00% GC)
median time: 23.628 ms (0.00% GC)
mean time: 23.698 ms (0.00% GC)
maximum time: 25.558 ms (0.00% GC)
--------------
samples: 211
evals/sample: 1
We’re still faster than OpenBLAS!
Perfect scaling would give us:
julia> (800 / 16) * (900 / 32) * (840 / 14) * 140 / 1e6
11.8125
Just over 11.8 milliseconds – we’re now around 50% slower.
MKL does it in 11.5 ms, so still better than the scaled up kernel, but only slightly so.
EDIT:
Ryzen results:
julia> @benchmark jmul!($C128_126, $A128_128, $B128_126)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 137.429 μs (0.00% GC)
median time: 138.641 μs (0.00% GC)
mean time: 139.844 μs (0.00% GC)
maximum time: 262.424 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($C128_126, $A128_128, $B128_126)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 158.137 μs (0.00% GC)
median time: 163.022 μs (0.00% GC)
mean time: 163.917 μs (0.00% GC)
maximum time: 215.655 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark jmul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 43.583 ms (0.00% GC)
median time: 44.016 ms (0.00% GC)
mean time: 44.166 ms (0.00% GC)
maximum time: 46.266 ms (0.00% GC)
--------------
samples: 114
evals/sample: 1
julia> @benchmark mul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 41.944 ms (0.00% GC)
median time: 42.194 ms (0.00% GC)
mean time: 42.289 ms (0.00% GC)
maximum time: 43.389 ms (0.00% GC)
--------------
samples: 119
evals/sample: 1
A little slower than OpenBLAS with Ryzen at the larger size, but pretty close.
So:
So given the above, I am led to believe that a pure-Julia blas-like implementations might be realistic for things like matrix-matrix multiplication.
Definitely! Already highly competitive with OpenBLAS on the two architectures I tested. Significantly faster on one, and barely slower on the other. I think the memory can be much better optimized than it is now (ie, clever recursive loading of submatrices). However, I am worried that the optimal memory pre-fetching behavior may be very architecture dependent.
Threading would also probably be difficult? I have no idea.
The bottom curve is for the matrix multiplication carried out for MMatrices (up to size 10). EDIT: one of the matrices (gradN) is an MMatrix always.
Given that, you could definitely improve my function a lot. API-wise, get the number of columns from gradN
, instead of passing it as a Val. For another, we could split the outer loop into a variable number of pieces based on Kedim
, with L changing for each piece as appropriate.
Or, as a lazy hack, just have it call the other function when Kedim
< some threshold.