We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4)

There’s a readme with a lot of output dumped on it here:
https://github.com/chriselrod/TriangularMatrices.jl

But the highlights are that,
I found an fairly optimal “compute kernel” (at least for my architecture on Julia 0.7; performance is poor on Julia 0.6):

julia> using StaticArrays, TriangularMatrices, BenchmarkTools

julia> a8_16_t = srandmat(8,16); # static matrix type from TriangularMatrices

julia> x16_4_t = srandmat(16,4);

julia> a8_16_s = SMatrix{8,16}(a8_16_t.data); #convert to a StaticMatrix

julia> x16_4_s = SMatrix{16,4}(x16_4_t.data);

julia> @btime $a8_16_t * $x16_4_t
  32.495 ns (0 allocations: 0 bytes)
8×4 TriangularMatrices.StaticRecursiveMatrix{Float64,8,4,32}:
 10.1893    -5.8437    2.80989    0.279144 
 -5.93568    9.09442  -3.71548   -2.54874  
  1.17599    1.50367  -3.56322    7.24623  
 -9.47634    2.39815  -2.40341   -1.34133  
  3.54455    3.98109   8.13257   -1.42904  
  0.505972  -4.51573  -2.23145   -1.92717  
  2.26573    2.61974  -0.296927  -0.0811224
  3.19529    1.09088   1.15119   -3.62086  

julia> @btime $a8_16_s * $x16_4_s
  72.788 ns (0 allocations: 0 bytes)
8×4 SArray{Tuple{8,4},Float64,2,32}:
 10.1893    -5.8437    2.80989    0.279144 
 -5.93568    9.09442  -3.71548   -2.54874  
  1.17599    1.50367  -3.56322    7.24623  
 -9.47634    2.39815  -2.40341   -1.34133  
  3.54455    3.98109   8.13257   -1.42904  
  0.505972  -4.51573  -2.23145   -1.92717  
  2.26573    2.61974  -0.296927  -0.0811224
  3.19529    1.09088   1.15119   -3.62086  

The short of how this works given registers of size 4 (eg, AVX) is that given an 8xN matrix A and an NxP matrix X, the compiler uses a vbroadcastsd instruction to fill a register with the first value of X, fills two registers with the first values of A, and multiplies them to create output d_1 (initial first 4 elements of output) and d_2 (initial next 4 elements of output)…
Broadcasting the next element of X, loading the next 8 elements of A, then two fmas update d_1, and d_2.

To get this to happen on Julia 0.7, I just create a generated function that looks similar to this:

            D[1] = X[1] * A[1]
            D[2] = X[1] * A[2]
            D[3] = X[1] * A[3]
            D[4] = X[1] * A[4]
            D[5] = X[1] * A[5]
            D[6] = X[1] * A[6]
            D[7] = X[1] * A[7]
            D[8] = X[1] * A[8]
            D[1] += X[2] * A[9]
            D[2] += X[2] * A[10]
            D[3] += X[2] * A[11]
            D[4] += X[2] * A[12]
            D[5] += X[2] * A[13]
            D[6] += X[2] * A[14]
            D[7] += X[2] * A[15]
            D[8] += X[2] * A[16]
            D[1] += X[3] * A[17]

and Julia takes care of the rest. :grinning:
My tests were done with:

julia> versioninfo()
Julia Version 0.7.0-alpha.94
Commit 4420717* (2018-06-12 20:26 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, znver1)

Performance will almost certainly vary with other architectures.

To try and scale this up to larger matrices, I went with a recursive memory layout.
These are the linear indices of a 50x50 matrix:

julia> rm = RecursiveMatrix{Float64,50,50,50^2}();

julia> function set_to_ind!(A)
           @inbounds for i = 1:length(A)
               A[i] = i
           end
           A
       end
set_to_ind! (generic function with 1 method)

julia> set_to_ind!(rm)
50×50 TriangularMatrices.RecursiveMatrix{Float64,50,50,2500}:
   1.0    9.0   17.0   25.0   33.0   41.0   49.0   57.0  129.0  137.0  145.0  153.0  161.0  169.0  …  1377.0  1385.0  1585.0  1593.0  1601.0  1609.0  1617.0  1625.0  1633.0  1641.0  1649.0  1657.0
   2.0   10.0   18.0   26.0   34.0   42.0   50.0   58.0  130.0  138.0  146.0  154.0  162.0  170.0     1378.0  1386.0  1586.0  1594.0  1602.0  1610.0  1618.0  1626.0  1634.0  1642.0  1650.0  1658.0
   3.0   11.0   19.0   27.0   35.0   43.0   51.0   59.0  131.0  139.0  147.0  155.0  163.0  171.0     1379.0  1387.0  1587.0  1595.0  1603.0  1611.0  1619.0  1627.0  1635.0  1643.0  1651.0  1659.0
   4.0   12.0   20.0   28.0   36.0   44.0   52.0   60.0  132.0  140.0  148.0  156.0  164.0  172.0     1380.0  1388.0  1588.0  1596.0  1604.0  1612.0  1620.0  1628.0  1636.0  1644.0  1652.0  1660.0
   5.0   13.0   21.0   29.0   37.0   45.0   53.0   61.0  133.0  141.0  149.0  157.0  165.0  173.0     1381.0  1389.0  1589.0  1597.0  1605.0  1613.0  1621.0  1629.0  1637.0  1645.0  1653.0  1661.0
   6.0   14.0   22.0   30.0   38.0   46.0   54.0   62.0  134.0  142.0  150.0  158.0  166.0  174.0  …  1382.0  1390.0  1590.0  1598.0  1606.0  1614.0  1622.0  1630.0  1638.0  1646.0  1654.0  1662.0
   7.0   15.0   23.0   31.0   39.0   47.0   55.0   63.0  135.0  143.0  151.0  159.0  167.0  175.0     1383.0  1391.0  1591.0  1599.0  1607.0  1615.0  1623.0  1631.0  1639.0  1647.0  1655.0  1663.0
   8.0   16.0   24.0   32.0   40.0   48.0   56.0   64.0  136.0  144.0  152.0  160.0  168.0  176.0     1384.0  1392.0  1592.0  1600.0  1608.0  1616.0  1624.0  1632.0  1640.0  1648.0  1656.0  1664.0
  65.0   73.0   81.0   89.0   97.0  105.0  113.0  121.0  193.0  201.0  209.0  217.0  225.0  233.0     1441.0  1449.0  1665.0  1673.0  1681.0  1689.0  1697.0  1705.0  1713.0  1721.0  1729.0  1737.0
  66.0   74.0   82.0   90.0   98.0  106.0  114.0  122.0  194.0  202.0  210.0  218.0  226.0  234.0     1442.0  1450.0  1666.0  1674.0  1682.0  1690.0  1698.0  1706.0  1714.0  1722.0  1730.0  1738.0
  67.0   75.0   83.0   91.0   99.0  107.0  115.0  123.0  195.0  203.0  211.0  219.0  227.0  235.0  …  1443.0  1451.0  1667.0  1675.0  1683.0  1691.0  1699.0  1707.0  1715.0  1723.0  1731.0  1739.0
  68.0   76.0   84.0   92.0  100.0  108.0  116.0  124.0  196.0  204.0  212.0  220.0  228.0  236.0     1444.0  1452.0  1668.0  1676.0  1684.0  1692.0  1700.0  1708.0  1716.0  1724.0  1732.0  1740.0
  69.0   77.0   85.0   93.0  101.0  109.0  117.0  125.0  197.0  205.0  213.0  221.0  229.0  237.0     1445.0  1453.0  1669.0  1677.0  1685.0  1693.0  1701.0  1709.0  1717.0  1725.0  1733.0  1741.0
  70.0   78.0   86.0   94.0  102.0  110.0  118.0  126.0  198.0  206.0  214.0  222.0  230.0  238.0     1446.0  1454.0  1670.0  1678.0  1686.0  1694.0  1702.0  1710.0  1718.0  1726.0  1734.0  1742.0
  71.0   79.0   87.0   95.0  103.0  111.0  119.0  127.0  199.0  207.0  215.0  223.0  231.0  239.0     1447.0  1455.0  1671.0  1679.0  1687.0  1695.0  1703.0  1711.0  1719.0  1727.0  1735.0  1743.0
  72.0   80.0   88.0   96.0  104.0  112.0  120.0  128.0  200.0  208.0  216.0  224.0  232.0  240.0  …  1448.0  1456.0  1672.0  1680.0  1688.0  1696.0  1704.0  1712.0  1720.0  1728.0  1736.0  1744.0
 257.0  265.0  273.0  281.0  289.0  297.0  305.0  313.0  321.0  329.0  337.0  345.0  353.0  361.0     1569.0  1577.0  1745.0  1753.0  1761.0  1769.0  1777.0  1785.0  1793.0  1801.0  1809.0  1817.0
 258.0  266.0  274.0  282.0  290.0  298.0  306.0  314.0  322.0  330.0  338.0  346.0  354.0  362.0     1570.0  1578.0  1746.0  1754.0  1762.0  1770.0  1778.0  1786.0  1794.0  1802.0  1810.0  1818.0
 259.0  267.0  275.0  283.0  291.0  299.0  307.0  315.0  323.0  331.0  339.0  347.0  355.0  363.0     1571.0  1579.0  1747.0  1755.0  1763.0  1771.0  1779.0  1787.0  1795.0  1803.0  1811.0  1819.0
 260.0  268.0  276.0  284.0  292.0  300.0  308.0  316.0  324.0  332.0  340.0  348.0  356.0  364.0     1572.0  1580.0  1748.0  1756.0  1764.0  1772.0  1780.0  1788.0  1796.0  1804.0  1812.0  1820.0
 261.0  269.0  277.0  285.0  293.0  301.0  309.0  317.0  325.0  333.0  341.0  349.0  357.0  365.0  …  1573.0  1581.0  1749.0  1757.0  1765.0  1773.0  1781.0  1789.0  1797.0  1805.0  1813.0  1821.0
   ⋮                                  ⋮                                  ⋮                         ⋱                     ⋮                                       ⋮                                  
 582.0  590.0  598.0  606.0  614.0  622.0  630.0  638.0  710.0  718.0  726.0  734.0  742.0  750.0     2006.0  2014.0  2246.0  2254.0  2262.0  2270.0  2278.0  2286.0  2294.0  2302.0  2310.0  2318.0
 583.0  591.0  599.0  607.0  615.0  623.0  631.0  639.0  711.0  719.0  727.0  735.0  743.0  751.0  …  2007.0  2015.0  2247.0  2255.0  2263.0  2271.0  2279.0  2287.0  2295.0  2303.0  2311.0  2319.0
 584.0  592.0  600.0  608.0  616.0  624.0  632.0  640.0  712.0  720.0  728.0  736.0  744.0  752.0     2008.0  2016.0  2248.0  2256.0  2264.0  2272.0  2280.0  2288.0  2296.0  2304.0  2312.0  2320.0
 641.0  649.0  657.0  665.0  673.0  681.0  689.0  697.0  769.0  777.0  785.0  793.0  801.0  809.0     2065.0  2073.0  2321.0  2329.0  2337.0  2345.0  2353.0  2361.0  2369.0  2377.0  2385.0  2393.0
 642.0  650.0  658.0  666.0  674.0  682.0  690.0  698.0  770.0  778.0  786.0  794.0  802.0  810.0     2066.0  2074.0  2322.0  2330.0  2338.0  2346.0  2354.0  2362.0  2370.0  2378.0  2386.0  2394.0
 643.0  651.0  659.0  667.0  675.0  683.0  691.0  699.0  771.0  779.0  787.0  795.0  803.0  811.0     2067.0  2075.0  2323.0  2331.0  2339.0  2347.0  2355.0  2363.0  2371.0  2379.0  2387.0  2395.0
 644.0  652.0  660.0  668.0  676.0  684.0  692.0  700.0  772.0  780.0  788.0  796.0  804.0  812.0  …  2068.0  2076.0  2324.0  2332.0  2340.0  2348.0  2356.0  2364.0  2372.0  2380.0  2388.0  2396.0
 645.0  653.0  661.0  669.0  677.0  685.0  693.0  701.0  773.0  781.0  789.0  797.0  805.0  813.0     2069.0  2077.0  2325.0  2333.0  2341.0  2349.0  2357.0  2365.0  2373.0  2381.0  2389.0  2397.0
 646.0  654.0  662.0  670.0  678.0  686.0  694.0  702.0  774.0  782.0  790.0  798.0  806.0  814.0     2070.0  2078.0  2326.0  2334.0  2342.0  2350.0  2358.0  2366.0  2374.0  2382.0  2390.0  2398.0
 647.0  655.0  663.0  671.0  679.0  687.0  695.0  703.0  775.0  783.0  791.0  799.0  807.0  815.0     2071.0  2079.0  2327.0  2335.0  2343.0  2351.0  2359.0  2367.0  2375.0  2383.0  2391.0  2399.0
 648.0  656.0  664.0  672.0  680.0  688.0  696.0  704.0  776.0  784.0  792.0  800.0  808.0  816.0     2072.0  2080.0  2328.0  2336.0  2344.0  2352.0  2360.0  2368.0  2376.0  2384.0  2392.0  2400.0
 833.0  841.0  849.0  857.0  865.0  873.0  881.0  889.0  913.0  921.0  929.0  937.0  945.0  953.0  …  2209.0  2217.0  2401.0  2409.0  2417.0  2425.0  2433.0  2441.0  2449.0  2457.0  2481.0  2489.0
 834.0  842.0  850.0  858.0  866.0  874.0  882.0  890.0  914.0  922.0  930.0  938.0  946.0  954.0     2210.0  2218.0  2402.0  2410.0  2418.0  2426.0  2434.0  2442.0  2450.0  2458.0  2482.0  2490.0
 835.0  843.0  851.0  859.0  867.0  875.0  883.0  891.0  915.0  923.0  931.0  939.0  947.0  955.0     2211.0  2219.0  2403.0  2411.0  2419.0  2427.0  2435.0  2443.0  2451.0  2459.0  2483.0  2491.0
 836.0  844.0  852.0  860.0  868.0  876.0  884.0  892.0  916.0  924.0  932.0  940.0  948.0  956.0     2212.0  2220.0  2404.0  2412.0  2420.0  2428.0  2436.0  2444.0  2452.0  2460.0  2484.0  2492.0
 837.0  845.0  853.0  861.0  869.0  877.0  885.0  893.0  917.0  925.0  933.0  941.0  949.0  957.0     2213.0  2221.0  2405.0  2413.0  2421.0  2429.0  2437.0  2445.0  2453.0  2461.0  2485.0  2493.0
 838.0  846.0  854.0  862.0  870.0  878.0  886.0  894.0  918.0  926.0  934.0  942.0  950.0  958.0  …  2214.0  2222.0  2406.0  2414.0  2422.0  2430.0  2438.0  2446.0  2454.0  2462.0  2486.0  2494.0
 839.0  847.0  855.0  863.0  871.0  879.0  887.0  895.0  919.0  927.0  935.0  943.0  951.0  959.0     2215.0  2223.0  2407.0  2415.0  2423.0  2431.0  2439.0  2447.0  2455.0  2463.0  2487.0  2495.0
 840.0  848.0  856.0  864.0  872.0  880.0  888.0  896.0  920.0  928.0  936.0  944.0  952.0  960.0     2216.0  2224.0  2408.0  2416.0  2424.0  2432.0  2440.0  2448.0  2456.0  2464.0  2488.0  2496.0
 897.0  899.0  901.0  903.0  905.0  907.0  909.0  911.0  977.0  979.0  981.0  983.0  985.0  987.0     2237.0  2239.0  2465.0  2467.0  2469.0  2471.0  2473.0  2475.0  2477.0  2479.0  2497.0  2499.0
 898.0  900.0  902.0  904.0  906.0  908.0  910.0  912.0  978.0  980.0  982.0  984.0  986.0  988.0     2238.0  2240.0  2466.0  2468.0  2470.0  2472.0  2474.0  2476.0  2478.0  2480.0  2498.0  2500.0

This makes it easier to load blocks of the matrix and apply our kernel.

Finally, a dump of random benchmarks:

julia> using TriangularMatrices, BenchmarkTools, Compat, Compat.LinearAlgebra

julia> BLAS.set_num_threads(1);

julia> a24 = randmat(24); x24 = randmat(24); d24 = randmat(24);

julia> ma24 = Matrix(a24); mx24 = Matrix(x24); md24 = Matrix(d24);

julia> @benchmark TriangularMatrices.mul!($d24, $a24, $x24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.034 μs (0.00% GC)
  median time:      1.067 μs (0.00% GC)
  mean time:        1.072 μs (0.00% GC)
  maximum time:     3.270 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark mul!($md24, $ma24, $mx24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.843 μs (0.00% GC)
  median time:      1.890 μs (0.00% GC)
  mean time:        1.898 μs (0.00% GC)
  maximum time:     3.617 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> a32 = randmat(32); x32 = randmat(32); d32 = randmat(32);

julia> ma32 = Matrix(a32); mx32 = Matrix(x32); md32 = Matrix(d32);

julia> @benchmark TriangularMatrices.mul!($d32, $a32, $x32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.315 μs (0.00% GC)
  median time:      2.393 μs (0.00% GC)
  mean time:        2.402 μs (0.00% GC)
  maximum time:     6.209 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark mul!($md32, $ma32, $mx32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.388 μs (0.00% GC)
  median time:      4.493 μs (0.00% GC)
  mean time:        4.511 μs (0.00% GC)
  maximum time:     7.435 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> a64 = randmat(64); x64 = randmat(64); d64 = randmat(64);

julia> ma64 = Matrix(a64); mx64 = Matrix(x64); md64 = Matrix(d64);

julia> @benchmark TriangularMatrices.mul!($d64, $a64, $x64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.236 μs (0.00% GC)
  median time:      19.647 μs (0.00% GC)
  mean time:        19.791 μs (0.00% GC)
  maximum time:     36.639 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md64, $ma64, $mx64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.466 μs (0.00% GC)
  median time:      25.017 μs (0.00% GC)
  mean time:        25.129 μs (0.00% GC)
  maximum time:     46.377 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a71 = randmat(71); x71 = randmat(71); d71 = randmat(71);  # performance currently drops a lot

julia> ma71 = Matrix(a71); mx71 = Matrix(x71); md71 = Matrix(d71); # when sizes aren't divisible by 8

julia> @benchmark TriangularMatrices.mul!($d71, $a71, $x71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.815 μs (0.00% GC)
  median time:      35.888 μs (0.00% GC)
  mean time:        35.997 μs (0.00% GC)
  maximum time:     66.545 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md71, $ma71, $mx71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     33.643 μs (0.00% GC)
  median time:      34.225 μs (0.00% GC)
  mean time:        34.370 μs (0.00% GC)
  maximum time:     59.883 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a72 = randmat(72); x72 = randmat(72); d72 = randmat(72);

julia> ma72 = Matrix(a72); mx72 = Matrix(x72); md72 = Matrix(d72);

julia> @benchmark TriangularMatrices.mul!($d72, $a72, $x72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.282 μs (0.00% GC)
  median time:      28.073 μs (0.00% GC)
  mean time:        28.121 μs (0.00% GC)
  maximum time:     48.521 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md72, $ma72, $mx72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     30.858 μs (0.00% GC)
  median time:      31.550 μs (0.00% GC)
  mean time:        31.676 μs (0.00% GC)
  maximum time:     56.095 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a128 = randmat(128); x128 = randmat(128); d128 = randmat(128);

julia> ma128 = Matrix(a128); mx128 = Matrix(x128); md128 = Matrix(d128);

julia> @benchmark TriangularMatrices.mul!($d128, $a128, $x128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     153.478 μs (0.00% GC)
  median time:      157.587 μs (0.00% GC)
  mean time:        158.263 μs (0.00% GC)
  maximum time:     180.910 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md128, $ma128, $mx128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     159.730 μs (0.00% GC)
  median time:      162.826 μs (0.00% GC)
  mean time:        163.159 μs (0.00% GC)
  maximum time:     219.483 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a256 = randmat(256); x256 = randmat(256); d256 = randmat(256); # can't convert matrices this "large"...

julia> ma256, mx256, md256 = randn(256,256), randn(256,256), randn(256,256);

julia> @benchmark TriangularMatrices.mul!($d256, $a256, $x256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.241 ms (0.00% GC)
  median time:      1.273 ms (0.00% GC)
  mean time:        1.276 ms (0.00% GC)
  maximum time:     1.406 ms (0.00% GC)
  --------------
  samples:          3916
  evals/sample:     1

julia> @benchmark mul!($md256, $ma256, $mx256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.180 ms (0.00% GC)
  median time:      1.208 ms (0.00% GC)
  mean time:        1.212 ms (0.00% GC)
  maximum time:     1.376 ms (0.00% GC)
  --------------
  samples:          4123
  evals/sample:     1

Disclaimer: the library needs a LOT of polish.
There are definite issues at the moment, and because the recursive matrices doesn’t mix too well with column or row major matrices, it’d need enough so that for most purposes someone could work mostly with these.
Tests segfault eventually. Printing “large” matrices doesn’t work, nor does converting them into regular matrices. The compiler seems to hang on compiling the getindex function?
Performance drop when matrix sizes aren’t divisible by 8.
No support yet for multiplication with matrices where sqrt(2)*min(size(A)...) > max(size(A)...).

I am taking a break from this, possibly through early August, because I’m running into deadlines (and maybe I’ll write a blog post; fyi, gfortran could not compile the kernel – it took in the area of 120ns! :wink: {although Flang, a Fortran front end to LLVM, could} – otherwise, it could be worth statically compiling a few kernels ).

Thought some members would be interested in seeing this though, because Julia was able to hold on for quite some time with an optimized BLAS library. I think with some work, it should be possible to get a pure-Julia BLAS well supported.

25 Likes

That looks exciting! See also this earlier thread.

3 Likes

There’s definitely a lot of performance left on the table. I’ll comb through Goto’s paper, and the links in that thread when I turn back to this.

For now, tests on Haswell vs OpenBLAS and MKL:

julia> versioninfo()
Julia Version 0.7.0-alpha.126
Commit 0e4493d* (2018-06-14 13:46 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

The gist of the results are that OpenBLAS did a little better relative to my recursive implementation on Haswell, and MKL crushed – evidence there’s a lot of room for performance optimizations, even before multithreading.

First, OpenBLAS:

julia> using TriangularMatrices, BenchmarkTools, Compat, Compat.LinearAlgebra

julia> BLAS.set_num_threads(1);

julia> a24 = randmat(24); x24 = randmat(24); d24 = randmat(24);

julia> ma24 = Matrix(a24); mx24 = Matrix(x24); md24 = Matrix(d24);

julia> @benchmark TriangularMatrices.mul!($d24, $a24, $x24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     709.936 ns (0.00% GC)
  median time:      710.721 ns (0.00% GC)
  mean time:        713.279 ns (0.00% GC)
  maximum time:     1.110 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     140

julia> @benchmark mul!($md24, $ma24, $mx24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.697 μs (0.00% GC)
  median time:      1.720 μs (0.00% GC)
  mean time:        1.721 μs (0.00% GC)
  maximum time:     3.519 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> a32 = randmat(32); x32 = randmat(32); d32 = randmat(32);

julia> ma32 = Matrix(a32); mx32 = Matrix(x32); md32 = Matrix(d32);

julia> @benchmark TriangularMatrices.mul!($d32, $a32, $x32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.819 μs (0.00% GC)
  median time:      1.823 μs (0.00% GC)
  mean time:        1.825 μs (0.00% GC)
  maximum time:     3.026 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark mul!($md32, $ma32, $mx32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.579 μs (0.00% GC)
  median time:      3.609 μs (0.00% GC)
  mean time:        3.612 μs (0.00% GC)
  maximum time:     7.539 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> a64 = randmat(64); x64 = randmat(64); d64 = randmat(64);

julia> ma64 = Matrix(a64); mx64 = Matrix(x64); md64 = Matrix(d64);

julia> @benchmark TriangularMatrices.mul!($d64, $a64, $x64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.067 μs (0.00% GC)
  median time:      15.128 μs (0.00% GC)
  mean time:        15.145 μs (0.00% GC)
  maximum time:     31.756 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md64, $ma64, $mx64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.137 μs (0.00% GC)
  median time:      19.261 μs (0.00% GC)
  mean time:        19.278 μs (0.00% GC)
  maximum time:     46.949 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a71 = randmat(71); x71 = randmat(71); d71 = randmat(71);  # performance currently drops a lot

julia> ma71 = Matrix(a71); mx71 = Matrix(x71); md71 = Matrix(d71); # when sizes aren't divisible by 8

julia> @benchmark TriangularMatrices.mul!($d71, $a71, $x71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.554 μs (0.00% GC)
  median time:      31.715 μs (0.00% GC)
  mean time:        31.747 μs (0.00% GC)
  maximum time:     71.165 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md71, $ma71, $mx71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.082 μs (0.00% GC)
  median time:      29.231 μs (0.00% GC)
  mean time:        29.360 μs (0.00% GC)
  maximum time:     67.414 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a72 = randmat(72); x72 = randmat(72); d72 = randmat(72);

julia> ma72 = Matrix(a72); mx72 = Matrix(x72); md72 = Matrix(d72);

julia> @benchmark TriangularMatrices.mul!($d72, $a72, $x72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.027 μs (0.00% GC)
  median time:      22.117 μs (0.00% GC)
  mean time:        22.138 μs (0.00% GC)
  maximum time:     45.415 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md72, $ma72, $mx72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.559 μs (0.00% GC)
  median time:      24.680 μs (0.00% GC)
  mean time:        24.702 μs (0.00% GC)
  maximum time:     52.463 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a128 = randmat(128); x128 = randmat(128); d128 = randmat(128);

julia> ma128 = Matrix(a128); mx128 = Matrix(x128); md128 = Matrix(d128);

julia> @benchmark TriangularMatrices.mul!($d128, $a128, $x128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     123.301 μs (0.00% GC)
  median time:      123.649 μs (0.00% GC)
  mean time:        124.203 μs (0.00% GC)
  maximum time:     195.992 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md128, $ma128, $mx128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     117.952 μs (0.00% GC)
  median time:      118.316 μs (0.00% GC)
  mean time:        118.858 μs (0.00% GC)
  maximum time:     215.866 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a256 = randmat(256); x256 = randmat(256); d256 = randmat(256); # can't convert matrices this "large"...

julia> ma256, mx256, md256 = randn(256,256), randn(256,256), randn(256,256);

julia> @benchmark TriangularMatrices.mul!($d256, $a256, $x256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.004 ms (0.00% GC)
  median time:      1.005 ms (0.00% GC)
  mean time:        1.007 ms (0.00% GC)
  maximum time:     1.209 ms (0.00% GC)
  --------------
  samples:          4960
  evals/sample:     1

julia> @benchmark mul!($md256, $ma256, $mx256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     793.676 μs (0.00% GC)
  median time:      794.787 μs (0.00% GC)
  mean time:        799.578 μs (0.00% GC)
  maximum time:     1.524 ms (0.00% GC)
  --------------
  samples:          6246
  evals/sample:     1

and now, MKL:

julia> using TriangularMatrices, BenchmarkTools, Compat, Compat.LinearAlgebra

julia> BLAS.set_num_threads(1);

julia> a24 = randmat(24); x24 = randmat(24); d24 = randmat(24);

julia> ma24 = Matrix(a24); mx24 = Matrix(x24); md24 = Matrix(d24);

julia> @benchmark TriangularMatrices.mul!($d24, $a24, $x24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     716.145 ns (0.00% GC)
  median time:      717.210 ns (0.00% GC)
  mean time:        717.799 ns (0.00% GC)
  maximum time:     1.063 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     138

julia> @benchmark mul!($md24, $ma24, $mx24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     775.528 ns (0.00% GC)
  median time:      778.704 ns (0.00% GC)
  mean time:        781.224 ns (0.00% GC)
  maximum time:     1.283 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     108

julia> a32 = randmat(32); x32 = randmat(32); d32 = randmat(32);

julia> ma32 = Matrix(a32); mx32 = Matrix(x32); md32 = Matrix(d32);

julia> @benchmark TriangularMatrices.mul!($d32, $a32, $x32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.808 μs (0.00% GC)
  median time:      1.818 μs (0.00% GC)
  mean time:        1.826 μs (0.00% GC)
  maximum time:     4.082 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark mul!($md32, $ma32, $mx32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.620 μs (0.00% GC)
  median time:      1.635 μs (0.00% GC)
  mean time:        1.640 μs (0.00% GC)
  maximum time:     4.065 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> a64 = randmat(64); x64 = randmat(64); d64 = randmat(64);

julia> ma64 = Matrix(a64); mx64 = Matrix(x64); md64 = Matrix(d64);

julia> @benchmark TriangularMatrices.mul!($d64, $a64, $x64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.166 μs (0.00% GC)
  median time:      15.234 μs (0.00% GC)
  mean time:        15.285 μs (0.00% GC)
  maximum time:     47.069 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md64, $ma64, $mx64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.037 μs (0.00% GC)
  median time:      11.142 μs (0.00% GC)
  mean time:        11.183 μs (0.00% GC)
  maximum time:     38.379 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a71 = randmat(71); x71 = randmat(71); d71 = randmat(71);  # performance currently drops a lot

julia> ma71 = Matrix(a71); mx71 = Matrix(x71); md71 = Matrix(d71); # when sizes aren't divisible by 8

julia> @benchmark TriangularMatrices.mul!($d71, $a71, $x71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.784 μs (0.00% GC)
  median time:      31.974 μs (0.00% GC)
  mean time:        32.073 μs (0.00% GC)
  maximum time:     70.863 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md71, $ma71, $mx71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.998 μs (0.00% GC)
  median time:      18.165 μs (0.00% GC)
  mean time:        18.340 μs (0.00% GC)
  maximum time:     55.012 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a72 = randmat(72); x72 = randmat(72); d72 = randmat(72);

julia> ma72 = Matrix(a72); mx72 = Matrix(x72); md72 = Matrix(d72);

julia> @benchmark TriangularMatrices.mul!($d72, $a72, $x72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.862 μs (0.00% GC)
  median time:      21.953 μs (0.00% GC)
  mean time:        22.026 μs (0.00% GC)
  maximum time:     44.255 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md72, $ma72, $mx72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.762 μs (0.00% GC)
  median time:      16.854 μs (0.00% GC)
  mean time:        16.898 μs (0.00% GC)
  maximum time:     52.588 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a128 = randmat(128); x128 = randmat(128); d128 = randmat(128);

julia> ma128 = Matrix(a128); mx128 = Matrix(x128); md128 = Matrix(d128);

julia> @benchmark TriangularMatrices.mul!($d128, $a128, $x128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     123.192 μs (0.00% GC)
  median time:      123.682 μs (0.00% GC)
  mean time:        123.911 μs (0.00% GC)
  maximum time:     170.757 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md128, $ma128, $mx128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     88.973 μs (0.00% GC)
  median time:      89.356 μs (0.00% GC)
  mean time:        89.715 μs (0.00% GC)
  maximum time:     157.206 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a256 = randmat(256); x256 = randmat(256); d256 = randmat(256); # can't convert matrices this "large"...

julia> ma256, mx256, md256 = randn(256,256), randn(256,256), randn(256,256);

julia> @benchmark TriangularMatrices.mul!($d256, $a256, $x256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     999.054 μs (0.00% GC)
  median time:      1.000 ms (0.00% GC)
  mean time:        1.003 ms (0.00% GC)
  maximum time:     1.109 ms (0.00% GC)
  --------------
  samples:          4981
  evals/sample:     1

julia> @benchmark mul!($md256, $ma256, $mx256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     674.136 μs (0.00% GC)
  median time:      678.050 μs (0.00% GC)
  mean time:        680.462 μs (0.00% GC)
  maximum time:     948.319 μs (0.00% GC)
  --------------
  samples:          7339
  evals/sample:     1

This is the same old story of OpenBLAS vs MKL. They both do well when matrices are big, but MKL does better for small matrices (and factorizations).

2 Likes

Serious updates here!

I’ve now abandoned the recursive memory layout, now using:

jBLAS.jl
SIMDArrays.jl

They require Julia 1.0 or 0.7.

If anyone wants to try, you may have to ]dev SIMDArrays, and then after activating SIMDArrays, dev or add jBLAS from github. Not sure if it will point to the correct place as is.

Following a suggestion by @RoyiAvital in Strange performance of a loop - #36 by RoyiAvital, I am padding the number of elements in the first axis to be a multiple of SIMDVector length.
I use CpuId.jl for info on SIMD vector length, cache sizes, etc. However, I haven’t actually tested it on other architectures yet to make sure performance is as it should be.
Before moving onto benchmarks vs StaticArray’s MMatrices and Intel MKL, I want to point out that this: Julia one third slower than ccall-ing `@code_native` assembly compiled with gcc? is seemingly solved by declaring the matrices const and not interpolating into the expression. When I do that, the Julia functions become as fast as the equivalent ccall. So, on that note, a few helper functions:

using SIMDArrays, StaticArrays, BenchmarkTools, LinearAlgebra
function copyloop!(x, y) # Broadcasts slow to compile at the moment; will fix when I implement broadcasting!
    @boundscheck size(x) == size(y) || throw(BoundsError())
    @inbounds for i ∈ eachindex(x,y)
        x[i] = y[i]
    end
end
function Base.convert(::Type{MMatrix}, A::SIMDArrays.SizedSIMDMatrix{M,N,T}) where {M,N,T}
    out = MMatrix{M,N,T}(undef)
    copyloop!(out, A)
    out
end
function create_constants(M,N,P)
    dsym, asym, xsym = Symbol(:D_, M,:_,P), Symbol(:A_, M,:_,N), Symbol(:X_, N,:_,P)
    @eval const $dsym, $asym, $xsym = SIMDArrays.randsimd($M,$P), SIMDArrays.randsimd($M,$N), SIMDArrays.randsimd($N,$P)
    @eval const $(Symbol(:m,dsym)), $(Symbol(:m,asym)), $(Symbol(:m,xsym)) = convert(MMatrix, $dsym), convert(MMatrix, $asym), convert(MMatrix, $xsym)
    nothing
end

Notationally, let D = A * X, where D is M x P, A is M x N, and X is N x P.
Now onto the benchmarks (on a computer with avx 512):

julia> create_constants(5,5,5)

julia> @btime mul!(mD_5_5, mA_5_5, mX_5_5) #multiplying StaticArrays MMatrices
  20.477 ns (0 allocations: 0 bytes)
5×5 MArray{Tuple{5,5},Float64,2,25}:
 1.66744  1.35667  1.34789  0.93313   1.13065 
 1.74005  1.59392  1.21913  0.906912  0.814635
 1.41563  1.76972  1.90589  0.88914   1.40555 
 1.79611  1.4072   1.09164  0.97643   1.03881 
 1.59758  1.87103  2.00119  0.914708  1.40032 

julia> @btime mA_5_5 * mX_5_5 #multiplying StaticArrays MMatrices creating stack allocated StaticMatrix
  23.261 ns (0 allocations: 0 bytes)
5×5 SArray{Tuple{5,5},Float64,2,25}:
 1.66744  1.35667  1.34789  0.93313   1.13065 
 1.74005  1.59392  1.21913  0.906912  0.814635
 1.41563  1.76972  1.90589  0.88914   1.40555 
 1.79611  1.4072   1.09164  0.97643   1.03881 
 1.59758  1.87103  2.00119  0.914708  1.40032 

julia> @btime mul!(D_5_5, A_5_5, X_5_5); D_5_5 #multiplying SIMDArrays
  6.582 ns (0 allocations: 0 bytes)
5×5 SIMDArrays.SizedSIMDArray{Tuple{5,5},Float64,2,8,40}:
 1.66744  1.35667  1.34789  0.93313   1.13065 
 1.74005  1.59392  1.21913  0.906912  0.814635
 1.41563  1.76972  1.90589  0.88914   1.40555 
 1.79611  1.4072   1.09164  0.97643   1.03881 
 1.59758  1.87103  2.00119  0.914708  1.40032 

julia> D_5_5.data
(1.66744138262064, 1.740048608459989, 1.415625303408632, 1.7961078018359855, 1.5975750881580377, 0.0, 0.0, 0.0, 1.356668103422378, 1.5939210629707454, 1.769720982419278, 1.4072025619380528, 1.8710277860227025, 0.0, 0.0, 0.0, 1.3478852456667783, 1.2191309243468347, 1.9058890396186112, 1.091636798493075, 2.001191775015244, 0.0, 0.0, 0.0, 0.9331304568989827, 0.9069123309829086, 0.8891401784342474, 0.9764296003231432, 0.9147076423296147, 0.0, 0.0, 0.0, 1.130651372750223, 0.814634811020856, 1.4055450158840463, 1.0388080685580772, 1.4003242549924564, 0.0, 0.0, 0.0)

In the last line, I print the tuple of elements inside a 5 x 5 SIMDMatrix, so you can see that each column is padded with 3 0s. I need to initialize the padded memory, otherwise subnormals will make multiplication very slow. I didn’t try set_zero_subnormals(true), because I didn’t want to have to change the global state.
Thanks to this padding, performance remains stable as we increase M. MMatrices do a little better once we reach 8 x 5, probably because it is easier to vectorize:

julia> create_constants(8,5,5)
WARNING: redefining constant X_5_5
WARNING: redefining constant mX_5_5

julia> @btime mul!(mD_8_5, mA_8_5, mX_5_5); #multiplying StaticArrays MMatrices
  17.613 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_8_5, A_8_5, X_5_5); #multiplying SIMDArrays
  6.587 ns (0 allocations: 0 bytes)

Make the matrices a little larger, and StaticArrays begins to struggle:

julia> create_constants(5,8,8)

julia> @btime mul!(mD_5_8, mA_5_8, mX_8_8); #multiplying StaticArrays MMatrices
  59.367 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_5_8, A_5_8, X_8_8); #multiplying SIMDArrays
  14.363 ns (0 allocations: 0 bytes)

Now at the boundary of where MMatrices fall back to a BLAS call:

julia> BLAS.set_num_threads(1); BLAS.vendor()
:mkl

julia> create_constants(14,13,14)

julia> @btime mul!(mD_14_14, mA_14_13, mX_13_14); #multiplying StaticArrays MMatrices
  441.764 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_14_14, A_14_13, X_13_14); #multiplying SIMDArrays
  55.935 ns (0 allocations: 0 bytes)

julia> create_constants(14,14,14)
WARNING: redefining constant D_14_14
WARNING: redefining constant mD_14_14

julia> @btime mul!(mD_14_14, mA_14_14, mX_14_14); #multiplying with MKL
  227.986 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_14_14, A_14_14, X_14_14); #multiplying SIMDArrays
  59.304 ns (0 allocations: 0 bytes)

Now, trying a few different sizes:

julia> create_constants(24,24,24);

julia> @btime mul!(mD_24_24, mA_24_24, mX_24_24);#multiplying with MKL
  404.270 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_24_24, A_24_24, X_24_24); #multiplying SIMDArrays
  247.178 ns (0 allocations: 0 bytes)

julia> create_constants(49,49,49);

julia> @btime mul!(mD_49_49, mA_49_49, mX_49_49);#multiplying with MKL
  2.858 μs (0 allocations: 0 bytes)

julia> @btime mul!(D_49_49, A_49_49, X_49_49); #multiplying SIMDArrays
  2.348 μs (0 allocations: 0 bytes)

julia> create_constants(128,128,128);

julia> @btime mul!(mD_128_128, mA_128_128, mX_128_128);#multiplying with MKL
  36.124 μs (0 allocations: 0 bytes)

julia> @btime mul!(D_128_128, A_128_128, X_128_128); #multiplying SIMDArrays
  35.616 μs (0 allocations: 0 bytes)

julia> create_constants(200,200,200);

julia> @btime mul!(mD_200_200, mA_200_200, mX_200_200);#multiplying with MKL
  161.158 μs (0 allocations: 0 bytes)

julia> @btime mul!(D_200_200, A_200_200, X_200_200); #multiplying SIMDArrays
  147.202 μs (0 allocations: 0 bytes)

To put these times in perspective, this is what I get from OpenBLAS:

julia> using LinearAlgebra, BenchmarkTools

julia> const D, A, X = randn(200,200), randn(200,200), randn(200,200);

julia> BLAS.set_num_threads(1); BLAS.vendor()
:openblas64

julia> @benchmark mul!(D, A, X)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     335.363 μs (0.00% GC)
  median time:      339.197 μs (0.00% GC)
  mean time:        345.639 μs (0.00% GC)
  maximum time:     488.796 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Beyond this size, this computer starts running out of L2 cache, and I haven’t tested that nearly as much. It’s likely I’ll need to implement prefetching.

julia> using jBLAS: blocking_structure, CACHE_SIZE, REGISTER_SIZE

julia> CACHE_SIZE .÷ sizeof(Float64)
(4096, 131072, 1802240)

julia> 3 * 204 * 200
122400

It uses the cache sizes to determine the blocking structure. For example, choosing letting M,N,P = 749,777,832 we get:

julia> simd_len = REGISTER_SIZE ÷ sizeof(Float64)
8

julia> blocking_structure(cld(749,simd_len)*simd_len,777,832,Float64) # cld(x,vl)*vl rounds up to nearest vl
(((16, 129, 14), (256, 129, 252), (752, 777, 796)), 3)

describes the blocking structure. The first triple describes the shape of the compute kernel, as 16 x 129 x 14. This is small enough to fit in the L1 cache. The next tripple describes the size of blocks for the L2 cache, which it will iterate over, replacing only one chunk at a time. Ie, (1,1),(2,1),(3,1),(3,2),(1,2),(2,2)... order, not (1,1),(2,1),(3,1),(1,2),(2,2),(3,2)..., to help reduce how much swapping of memory we have to do between the L1 and L2 caches.

Anyway, I think we can get a lot of benefits from a Julia BLAS library.
As I hopefully demonstrated here, it offers the possibility of being dramatically faster than current fast choices (StaticArrays, OpenBLAS) across small and so far medium sizes.
I think it also has the possibility of adopting to new architectures quickly, so long as they’re supported by CpuId.jl. And of also ease things like

  1. Supporting different types. How fast could we get complex matrix multiplication be when using StructsOfArrays.jl? Note that things aren’t fully generic. A matrix where elements are organized real1, imaginary1, real2, imaginary2... still isn’t an efficient organization for vectorization. But tricks like StructsOfArrays solve that.
  2. We can easily implement a diversity of kernels, like (A - B) * (X + Y), which could let us try an efficient go at the Strassen algorithm.

While the matrices are all currently typed based on size, all these matrix operations are quick to compile. I haven’t benchmarked the functions that figure out the blocking structure, but that sort of stuff ought to be optimized if it’s all done at runtime.

39 Likes

This is beautiful!
The ease one can take advantage of the CPU with Julia is one of its best prowess.

I really want to see comparison to MKL Low Overhead mode.
I bet that you’d beat them as well.

1 Like

Multiple dispatch and meta programming even help when you go very low level. For example, here is what an 8x6 kernel for avx2 looks like in C:

#include<immintrin.h>

void kernel8xNx6(double* D, double* A, double* X, long N){
    __m256d D11, D12, D21, D22, D31, D32, D41, D42, D51, D52, D61, D62;
    __m256d A1, A2, X1, X2, X3, X4, X5, X6;
    D11 = _mm256_loadu_pd(D);
    D12 = _mm256_loadu_pd(D + 4);
    D21 = _mm256_loadu_pd(D + 8);
    D22 = _mm256_loadu_pd(D + 12);
    D31 = _mm256_loadu_pd(D + 16);
    D32 = _mm256_loadu_pd(D + 20);
    D41 = _mm256_loadu_pd(D + 24);
    D42 = _mm256_loadu_pd(D + 28);
    D51 = _mm256_loadu_pd(D + 32);
    D52 = _mm256_loadu_pd(D + 36);
    D61 = _mm256_loadu_pd(D + 40);
    D62 = _mm256_loadu_pd(D + 44);
    for(long i = 0; i < N; i++){
        A1 = _mm256_loadu_pd(A + 8*i);
        A2 = _mm256_loadu_pd(A + 8*i + 4);
        X1 = _mm256_broadcast_sd(X+i);
        D11 = _mm256_fmadd_pd(A1, X1, D11);
        D12 = _mm256_fmadd_pd(A2, X1, D12);
        X2 = _mm256_broadcast_sd(X+(N+i));
        D21 = _mm256_fmadd_pd(A1, X2, D21);
        D22 = _mm256_fmadd_pd(A2, X2, D22);
        X3 = _mm256_broadcast_sd(X+(2*N+i));
        D31 = _mm256_fmadd_pd(A1, X3, D31);
        D32 = _mm256_fmadd_pd(A2, X3, D32);
        X4 = _mm256_broadcast_sd(X+(3*N+i));
        D41 = _mm256_fmadd_pd(A1, X4, D41);
        D42 = _mm256_fmadd_pd(A2, X4, D42);
        X5 = _mm256_broadcast_sd(X+(4*N+i));
        D51 = _mm256_fmadd_pd(A1, X5, D51);
        D52 = _mm256_fmadd_pd(A2, X5, D52);
        X6 = _mm256_broadcast_sd(X+(5*N+i));
        D61 = _mm256_fmadd_pd(A1, X6, D61);
        D62 = _mm256_fmadd_pd(A2, X6, D62);
    }
    _mm256_storeu_pd(D, D11);
    _mm256_storeu_pd(D + 4, D12);
    _mm256_storeu_pd(D + 8, D21);
    _mm256_storeu_pd(D + 12, D22);
    _mm256_storeu_pd(D + 16, D31);
    _mm256_storeu_pd(D + 20, D32);
    _mm256_storeu_pd(D + 24, D41);
    _mm256_storeu_pd(D + 28, D42);
    _mm256_storeu_pd(D + 32, D51);
    _mm256_storeu_pd(D + 36, D52);
    _mm256_storeu_pd(D + 40, D61);
    _mm256_storeu_pd(D + 44, D62);
}

You can compile that with
gcc -O2 -march=haswell -shared -fPIC loopkernel.c -o libcloopkernel.so
I tested that on Ryzen, and -march=native -mprefer-vector-width=256 was slow and the assembly looked like garbage. Maybe I should file an issue with gcc. march=haswell worked as intended. I’d probably stick with that on recent computers.
Not something I’ve had to worry about in Julia!

In C, you’d have to either at run time choose what’s most appropriate, and have all the combinations of correct sizes (8xNx6 is an optimal base, but for edges, eg if D and X in D = A * X each have 200 columns, you’d also need calls for a 8xNx2 kernel, because 200 % 6 = 2.
So between all the SIMD vector lengths you support, that ends up being a heck of a lot of kernels to generate – changing number of repetitions of function calls, changing the function calls themselves, and the types.
Maybe you can use a build system to do that at compile time just for the architecture.

In Julia, type inference means no need for declarations, multiple dispatch means no need to change the functions we’re calling, and @generated functions mean we just need to write one thing that will compile into whatever kernels happen to be needed, the first time they are needed.

This is the Julia kernel I am using:

function kernel_quote(Mₖ,Pₖ,stride_AD,stride_X,N,T)
    T_size = sizeof(T)
    AD_stride = stride_AD * T_size
    X_stride = stride_X * T_size
    L = REGISTER_SIZE ÷ T_size
    Q, r = divrem(Mₖ, L) #Assuming Mₖ is a multiple of L
    if Q > 0
        r == 0 || throw("Number of rows $Mₖ not a multiple of register size: $REGISTER_SIZE.")
    else
        L = r
        Q = 1
    end
    V = Vec{L,T}
    quote
        @nexprs $Pₖ p -> @nexprs $Q q -> Dx_p_q = vload($V, pD + $REGISTER_SIZE*(q-1) + $AD_stride*(p-1))
        for n ∈ 0:$(N-1)
            @nexprs $Pₖ p -> begin
                vX = $V(unsafe_load(pX + n*$T_size + (p-1)*$X_stride))
                @nexprs $Q q -> begin
                    vA_q = vload($V, pA + n*$AD_stride + $REGISTER_SIZE*(q-1))
                    Dx_p_q = fma(vA_q, vX, Dx_p_q)
                end
            end
        end
        @nexprs $Pₖ p -> @nexprs $Q q -> vstore(Dx_p_q, pD + $REGISTER_SIZE*(q-1) + $AD_stride*(p-1))
        nothing
    end
end
@generated function kernel!(pD::Ptr{T}, pA::Ptr{T}, pX::Ptr{T}, ::Kernel{Mₖ,Pₖ,stride_AD,stride_X,N}) where {Mₖ,Pₖ,stride_AD,stride_X,N,T}
    kernel_quote(Mₖ,Pₖ,stride_AD,stride_X,N,T)
end

and it follows the appropriate pattern. Then it’s just a matter of defining the correct constants (variables written in all caps; done while building jBLAS), and then calling with the correct size.

Dependencies are SIMD.jl and Base.Cartesian. The type Kernel has no fields. REGISTER_SIZE is a global constant, that gets defined while building jBLAS.jl. That, plus automatically compiling for each individual computer are two of the big wins.
The above kernel is extremely flexible. I could experiment with making things like stride not known while compiling the kernel. No idea if that makes things harder on automatic prefetching or not.

As for MKL Low Overhead Mode, being parameterized by matrix size cuts down on runtime overhead, (which let it be faster than StaticArrays already for very small matrices). Things like the sizes of the kernels to call, and in what pattern are handled at compile time.
That’s something MKL can’t do. If this were to involve into a registered library - while compilation is satisfying fast at the moment - I’d still benchmark that, and then try and a version that works for matrices whose size isn’t known at compile time.

FWIW, Webinar on MKL_INLINE MKL 11.2 is several years old, and MKL_INLINE has been renamed MKL_DIRECT_CALL…so performance has likely improved. But at that time, for 8x8 matrices, it had between 50% and 100% more GFLOPs.

I think if a real effort went into make a Julia BLAS library, with multithreading, possible Strassen’s algorithm, etc, it could be extremely fast across sizes. I think operations with composite types are especially something Julia will perform well in with matrix operations thanks to StructsOfArrays, etc, allowing for easy SIMD across object instances.

15 Likes

Did you manage to compile Julia with MKL and the Sequential and Direct Call on? How?
Could you share some numbers?

Your work is really inspiring.
I really hope it will bring speed improvements to many use cases.

6 Likes

Potentially of interest to parties here: I shared a pure-Julia, BLIS-style gemm demo on slack earlier this summer, which is now on github: GitHub - Sacha0/GemmDemo.jl: A pure-Julia, BLIS-style dgemm demo.. Best! S

10 Likes

Cool! I see you’re also listed as a BLIS retreat participant, so I should see you there Monday and Tuesday. Yingbo Ma and a few others will be there, too.

http://www.cs.utexas.edu/users/flame/BLISRetreat2018/participants.html

1 Like

Fantastic job !
I think this effort is likely to drive the HPC community to consider Julia more seriously (should be presented at SC !).

Could this be mix with your accurate arithmetic efforts (Julia equivalent of Python's "fsum" for floating point summation - #54 by ffevotte) to obtain fast and accurate gemms ?

8 Likes

I would definitely be interested in accurate QR & Cholesky decompositions, and triangular equation solvers.

I don’t know much about it, but @ffevotte mentioned xblas.
When I see BLAS on netlib though, I worry about how well optimized it is.

Because the hardest part about BLAS is not getting memory throttled* (the kernels are easy, especially in Julia), I think optimized accurate BLAS implementations will be simpler.

*I’ve not done much there – your recent code is much faster than mine for 4096 xx 4096 matrices, despite all the memory allocations. When I get around to this again, I’ll

  • Actually use packing.
  • Rethink the blocking pattern. Right now it does a bunch of fancy stuff related to the sizes of different caches, without actually seeing any improvement – a failed experiment.
  • Experiment more with prefetching.

If you’d like to experiment and want to get rid of memory allocations, you could try UnsafeArrays like @baggepinnen suggested, or define your own pointer-based array types.
I did the latter in PaddedMatrices, here for fixed sizes (ideal for kernels), and here with dynamic sizes.

I’m fairly happy with my kernel generating code, but I would dub the rest of it a failed experiment / not worth looking at too closely. I think @YingboMa’s JuliaBLAS is better beyond very modest sizes. :slight_smile:

7 Likes

Hi Chris,
Thank you for the links. I did not succeed to use UnsafeArrays. The API seems to differ slightly from the standard views and a direct replacement lead to an error. I will try PaddedMatrices. JuliaBlas is indeed impressive !

1 Like

I have only really looked at the DDOT implementation in XBLAS, but it was not really optimized: no vectorization, no parallelization, no fancy stuff. I would say that the real interest lies in the algorithms used (which are easy to understand given how they are coded), not really in the implementation itself.

3 Likes

Most of the BLAS optimization effort is in BLAS-3 (matrix–matrix) operations, which have the greatest potential speedup because they do O(n^3) work for O(n^2) data. BLAS-1 operations like ddot, in contrast, benefit much less from “fancy stuff” because they are ultimately memory-bound.

4 Likes

Yes.

However, XBLAS puts the focus on “extra precision”, internally using David Bailey’s double-double precision. This means that the arithmetic intensity of BLAS-1 operations increases, to the point where a naive implementation becomes CPU-bound when it shouldn’t. As far as I can tell, this is the case with the ddot implementation in XBLAS. Which is why I’m saying that, as far as ddot_x (XBLAS’ extra-precision ddot) is concerned, the implementation should be considered a proof-of-concept showcasing adequate algorithms, rather than something optimized.

5 Likes

Hello, I was brought in there by the Nim v1.0 post. As someone who also wrote a BLAS from scratch in Nim (relevant details here Version 1.0 released of Nim Programming Language - #47 by mratsim), here are some notes:

  • packing is a very important performance enabler, according to libxsmm over 128x128 inputs, without it you can feed the SIMD units fast enough because data will not be contiguous.
  • prefetching helped my reach the last 15% of perf gap with OpenBLAS/MKL

Now on automatic cache size computation, it was also a failed experiment for me but:

  • Goto and BLIS paper recommend using half of L1 cache for a panel of 1 packed matrix and half of L2 cache for another, this way the other half can be used for data movement, details here.
  • On a typical PC, you share the core with plenty of applications that compete with you for cache
  • Hyperthreading will mess your algorithm if 2 sibling threads try to load different panel and they will full the full L1/L2 and that will have to be discard on the next data movement.
  • The BLAS implementations are all about optimizing memory-bandwidth so that it can keep up with CPU compute, but hyperthreading will double the memory bandwidth requirements to feed both HT cores.

In conclusion, you might vastly different results on a server with Hyperthreading disabled.
Note that I’m not saying hyperthreading is bad, but GEMM is an algorithm where there are known ways to fully occupy the CPU pipeline so you don’t need hardware concurrency.

28 Likes

@mratsim do you mind me to study your code and replicate what you have done there? I don’t know how the licensing works. Can I license my code under the MIT license?

Yes of course.

My code is licensed, my code is Apache which is basically MIT + an explicit patent grant if I held any patent related to the code I wrote (not that I have any + software patents are invalid in Europe).

Furthermore, your work will be a derivative work written from scratch so you are free to use any license you want when doing derivative work from Apache.

The Apache License is permissive; unlike the so-called copyleft licenses, it does not require a derivative work of the software, or modifications to the original, to be distributed using the same license. It still requires application of the same license to all unmodified parts.

As you will modify everything to rewrite in Julia, you can apply any license you want to the new code, including commercial licenses if you so choose.

If you are still in doubt.

I hereby declare that I am OK with @yingboma doing derivative work under the MIT license of my matrix multiplication code in the Laser project at the following link: https://github.com/numforge/laser/tree/e660eeeb723426e80a7b1187864323d85527d18c/laser/primitives/matrix_multiplication .


Now for study, you can have a look into my research here: https://github.com/numforge/laser/blob/master/research/matrix_multiplication_optimisation_resources.md

The first step you can do is probably following this tutorial in Julia: http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html

Then the BLIS paper is excellent (this is the one the tutorial and I follow) and I used the Goto paper to choose my tile size.

Note that BLIS folks have very interesting things planned in the future from a discussion in a BLIS backend for Pytorch for better AMD support.

24 Likes