We can write an optimized BLAS library in pure Julia


#1

There’s a readme with a lot of output dumped on it here:

But the highlights are that,
I found an fairly optimal “compute kernel” (at least for my architecture on Julia 0.7; performance is poor on Julia 0.6):

julia> using StaticArrays, TriangularMatrices, BenchmarkTools

julia> a8_16_t = srandmat(8,16); # static matrix type from TriangularMatrices

julia> x16_4_t = srandmat(16,4);

julia> a8_16_s = SMatrix{8,16}(a8_16_t.data); #convert to a StaticMatrix

julia> x16_4_s = SMatrix{16,4}(x16_4_t.data);

julia> @btime $a8_16_t * $x16_4_t
  32.495 ns (0 allocations: 0 bytes)
8×4 TriangularMatrices.StaticRecursiveMatrix{Float64,8,4,32}:
 10.1893    -5.8437    2.80989    0.279144 
 -5.93568    9.09442  -3.71548   -2.54874  
  1.17599    1.50367  -3.56322    7.24623  
 -9.47634    2.39815  -2.40341   -1.34133  
  3.54455    3.98109   8.13257   -1.42904  
  0.505972  -4.51573  -2.23145   -1.92717  
  2.26573    2.61974  -0.296927  -0.0811224
  3.19529    1.09088   1.15119   -3.62086  

julia> @btime $a8_16_s * $x16_4_s
  72.788 ns (0 allocations: 0 bytes)
8×4 SArray{Tuple{8,4},Float64,2,32}:
 10.1893    -5.8437    2.80989    0.279144 
 -5.93568    9.09442  -3.71548   -2.54874  
  1.17599    1.50367  -3.56322    7.24623  
 -9.47634    2.39815  -2.40341   -1.34133  
  3.54455    3.98109   8.13257   -1.42904  
  0.505972  -4.51573  -2.23145   -1.92717  
  2.26573    2.61974  -0.296927  -0.0811224
  3.19529    1.09088   1.15119   -3.62086  

The short of how this works given registers of size 4 (eg, AVX) is that given an 8xN matrix A and an NxP matrix X, the compiler uses a vbroadcastsd instruction to fill a register with the first value of X, fills two registers with the first values of A, and multiplies them to create output d_1 (initial first 4 elements of output) and d_2 (initial next 4 elements of output)…
Broadcasting the next element of X, loading the next 8 elements of A, then two fmas update d_1, and d_2.

To get this to happen on Julia 0.7, I just create a generated function that looks similar to this:

            D[1] = X[1] * A[1]
            D[2] = X[1] * A[2]
            D[3] = X[1] * A[3]
            D[4] = X[1] * A[4]
            D[5] = X[1] * A[5]
            D[6] = X[1] * A[6]
            D[7] = X[1] * A[7]
            D[8] = X[1] * A[8]
            D[1] += X[2] * A[9]
            D[2] += X[2] * A[10]
            D[3] += X[2] * A[11]
            D[4] += X[2] * A[12]
            D[5] += X[2] * A[13]
            D[6] += X[2] * A[14]
            D[7] += X[2] * A[15]
            D[8] += X[2] * A[16]
            D[1] += X[3] * A[17]

and Julia takes care of the rest. :grinning:
My tests were done with:

julia> versioninfo()
Julia Version 0.7.0-alpha.94
Commit 4420717* (2018-06-12 20:26 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, znver1)

Performance will almost certainly vary with other architectures.

To try and scale this up to larger matrices, I went with a recursive memory layout.
These are the linear indices of a 50x50 matrix:

julia> rm = RecursiveMatrix{Float64,50,50,50^2}();

julia> function set_to_ind!(A)
           @inbounds for i = 1:length(A)
               A[i] = i
           end
           A
       end
set_to_ind! (generic function with 1 method)

julia> set_to_ind!(rm)
50×50 TriangularMatrices.RecursiveMatrix{Float64,50,50,2500}:
   1.0    9.0   17.0   25.0   33.0   41.0   49.0   57.0  129.0  137.0  145.0  153.0  161.0  169.0  …  1377.0  1385.0  1585.0  1593.0  1601.0  1609.0  1617.0  1625.0  1633.0  1641.0  1649.0  1657.0
   2.0   10.0   18.0   26.0   34.0   42.0   50.0   58.0  130.0  138.0  146.0  154.0  162.0  170.0     1378.0  1386.0  1586.0  1594.0  1602.0  1610.0  1618.0  1626.0  1634.0  1642.0  1650.0  1658.0
   3.0   11.0   19.0   27.0   35.0   43.0   51.0   59.0  131.0  139.0  147.0  155.0  163.0  171.0     1379.0  1387.0  1587.0  1595.0  1603.0  1611.0  1619.0  1627.0  1635.0  1643.0  1651.0  1659.0
   4.0   12.0   20.0   28.0   36.0   44.0   52.0   60.0  132.0  140.0  148.0  156.0  164.0  172.0     1380.0  1388.0  1588.0  1596.0  1604.0  1612.0  1620.0  1628.0  1636.0  1644.0  1652.0  1660.0
   5.0   13.0   21.0   29.0   37.0   45.0   53.0   61.0  133.0  141.0  149.0  157.0  165.0  173.0     1381.0  1389.0  1589.0  1597.0  1605.0  1613.0  1621.0  1629.0  1637.0  1645.0  1653.0  1661.0
   6.0   14.0   22.0   30.0   38.0   46.0   54.0   62.0  134.0  142.0  150.0  158.0  166.0  174.0  …  1382.0  1390.0  1590.0  1598.0  1606.0  1614.0  1622.0  1630.0  1638.0  1646.0  1654.0  1662.0
   7.0   15.0   23.0   31.0   39.0   47.0   55.0   63.0  135.0  143.0  151.0  159.0  167.0  175.0     1383.0  1391.0  1591.0  1599.0  1607.0  1615.0  1623.0  1631.0  1639.0  1647.0  1655.0  1663.0
   8.0   16.0   24.0   32.0   40.0   48.0   56.0   64.0  136.0  144.0  152.0  160.0  168.0  176.0     1384.0  1392.0  1592.0  1600.0  1608.0  1616.0  1624.0  1632.0  1640.0  1648.0  1656.0  1664.0
  65.0   73.0   81.0   89.0   97.0  105.0  113.0  121.0  193.0  201.0  209.0  217.0  225.0  233.0     1441.0  1449.0  1665.0  1673.0  1681.0  1689.0  1697.0  1705.0  1713.0  1721.0  1729.0  1737.0
  66.0   74.0   82.0   90.0   98.0  106.0  114.0  122.0  194.0  202.0  210.0  218.0  226.0  234.0     1442.0  1450.0  1666.0  1674.0  1682.0  1690.0  1698.0  1706.0  1714.0  1722.0  1730.0  1738.0
  67.0   75.0   83.0   91.0   99.0  107.0  115.0  123.0  195.0  203.0  211.0  219.0  227.0  235.0  …  1443.0  1451.0  1667.0  1675.0  1683.0  1691.0  1699.0  1707.0  1715.0  1723.0  1731.0  1739.0
  68.0   76.0   84.0   92.0  100.0  108.0  116.0  124.0  196.0  204.0  212.0  220.0  228.0  236.0     1444.0  1452.0  1668.0  1676.0  1684.0  1692.0  1700.0  1708.0  1716.0  1724.0  1732.0  1740.0
  69.0   77.0   85.0   93.0  101.0  109.0  117.0  125.0  197.0  205.0  213.0  221.0  229.0  237.0     1445.0  1453.0  1669.0  1677.0  1685.0  1693.0  1701.0  1709.0  1717.0  1725.0  1733.0  1741.0
  70.0   78.0   86.0   94.0  102.0  110.0  118.0  126.0  198.0  206.0  214.0  222.0  230.0  238.0     1446.0  1454.0  1670.0  1678.0  1686.0  1694.0  1702.0  1710.0  1718.0  1726.0  1734.0  1742.0
  71.0   79.0   87.0   95.0  103.0  111.0  119.0  127.0  199.0  207.0  215.0  223.0  231.0  239.0     1447.0  1455.0  1671.0  1679.0  1687.0  1695.0  1703.0  1711.0  1719.0  1727.0  1735.0  1743.0
  72.0   80.0   88.0   96.0  104.0  112.0  120.0  128.0  200.0  208.0  216.0  224.0  232.0  240.0  …  1448.0  1456.0  1672.0  1680.0  1688.0  1696.0  1704.0  1712.0  1720.0  1728.0  1736.0  1744.0
 257.0  265.0  273.0  281.0  289.0  297.0  305.0  313.0  321.0  329.0  337.0  345.0  353.0  361.0     1569.0  1577.0  1745.0  1753.0  1761.0  1769.0  1777.0  1785.0  1793.0  1801.0  1809.0  1817.0
 258.0  266.0  274.0  282.0  290.0  298.0  306.0  314.0  322.0  330.0  338.0  346.0  354.0  362.0     1570.0  1578.0  1746.0  1754.0  1762.0  1770.0  1778.0  1786.0  1794.0  1802.0  1810.0  1818.0
 259.0  267.0  275.0  283.0  291.0  299.0  307.0  315.0  323.0  331.0  339.0  347.0  355.0  363.0     1571.0  1579.0  1747.0  1755.0  1763.0  1771.0  1779.0  1787.0  1795.0  1803.0  1811.0  1819.0
 260.0  268.0  276.0  284.0  292.0  300.0  308.0  316.0  324.0  332.0  340.0  348.0  356.0  364.0     1572.0  1580.0  1748.0  1756.0  1764.0  1772.0  1780.0  1788.0  1796.0  1804.0  1812.0  1820.0
 261.0  269.0  277.0  285.0  293.0  301.0  309.0  317.0  325.0  333.0  341.0  349.0  357.0  365.0  …  1573.0  1581.0  1749.0  1757.0  1765.0  1773.0  1781.0  1789.0  1797.0  1805.0  1813.0  1821.0
   ⋮                                  ⋮                                  ⋮                         ⋱                     ⋮                                       ⋮                                  
 582.0  590.0  598.0  606.0  614.0  622.0  630.0  638.0  710.0  718.0  726.0  734.0  742.0  750.0     2006.0  2014.0  2246.0  2254.0  2262.0  2270.0  2278.0  2286.0  2294.0  2302.0  2310.0  2318.0
 583.0  591.0  599.0  607.0  615.0  623.0  631.0  639.0  711.0  719.0  727.0  735.0  743.0  751.0  …  2007.0  2015.0  2247.0  2255.0  2263.0  2271.0  2279.0  2287.0  2295.0  2303.0  2311.0  2319.0
 584.0  592.0  600.0  608.0  616.0  624.0  632.0  640.0  712.0  720.0  728.0  736.0  744.0  752.0     2008.0  2016.0  2248.0  2256.0  2264.0  2272.0  2280.0  2288.0  2296.0  2304.0  2312.0  2320.0
 641.0  649.0  657.0  665.0  673.0  681.0  689.0  697.0  769.0  777.0  785.0  793.0  801.0  809.0     2065.0  2073.0  2321.0  2329.0  2337.0  2345.0  2353.0  2361.0  2369.0  2377.0  2385.0  2393.0
 642.0  650.0  658.0  666.0  674.0  682.0  690.0  698.0  770.0  778.0  786.0  794.0  802.0  810.0     2066.0  2074.0  2322.0  2330.0  2338.0  2346.0  2354.0  2362.0  2370.0  2378.0  2386.0  2394.0
 643.0  651.0  659.0  667.0  675.0  683.0  691.0  699.0  771.0  779.0  787.0  795.0  803.0  811.0     2067.0  2075.0  2323.0  2331.0  2339.0  2347.0  2355.0  2363.0  2371.0  2379.0  2387.0  2395.0
 644.0  652.0  660.0  668.0  676.0  684.0  692.0  700.0  772.0  780.0  788.0  796.0  804.0  812.0  …  2068.0  2076.0  2324.0  2332.0  2340.0  2348.0  2356.0  2364.0  2372.0  2380.0  2388.0  2396.0
 645.0  653.0  661.0  669.0  677.0  685.0  693.0  701.0  773.0  781.0  789.0  797.0  805.0  813.0     2069.0  2077.0  2325.0  2333.0  2341.0  2349.0  2357.0  2365.0  2373.0  2381.0  2389.0  2397.0
 646.0  654.0  662.0  670.0  678.0  686.0  694.0  702.0  774.0  782.0  790.0  798.0  806.0  814.0     2070.0  2078.0  2326.0  2334.0  2342.0  2350.0  2358.0  2366.0  2374.0  2382.0  2390.0  2398.0
 647.0  655.0  663.0  671.0  679.0  687.0  695.0  703.0  775.0  783.0  791.0  799.0  807.0  815.0     2071.0  2079.0  2327.0  2335.0  2343.0  2351.0  2359.0  2367.0  2375.0  2383.0  2391.0  2399.0
 648.0  656.0  664.0  672.0  680.0  688.0  696.0  704.0  776.0  784.0  792.0  800.0  808.0  816.0     2072.0  2080.0  2328.0  2336.0  2344.0  2352.0  2360.0  2368.0  2376.0  2384.0  2392.0  2400.0
 833.0  841.0  849.0  857.0  865.0  873.0  881.0  889.0  913.0  921.0  929.0  937.0  945.0  953.0  …  2209.0  2217.0  2401.0  2409.0  2417.0  2425.0  2433.0  2441.0  2449.0  2457.0  2481.0  2489.0
 834.0  842.0  850.0  858.0  866.0  874.0  882.0  890.0  914.0  922.0  930.0  938.0  946.0  954.0     2210.0  2218.0  2402.0  2410.0  2418.0  2426.0  2434.0  2442.0  2450.0  2458.0  2482.0  2490.0
 835.0  843.0  851.0  859.0  867.0  875.0  883.0  891.0  915.0  923.0  931.0  939.0  947.0  955.0     2211.0  2219.0  2403.0  2411.0  2419.0  2427.0  2435.0  2443.0  2451.0  2459.0  2483.0  2491.0
 836.0  844.0  852.0  860.0  868.0  876.0  884.0  892.0  916.0  924.0  932.0  940.0  948.0  956.0     2212.0  2220.0  2404.0  2412.0  2420.0  2428.0  2436.0  2444.0  2452.0  2460.0  2484.0  2492.0
 837.0  845.0  853.0  861.0  869.0  877.0  885.0  893.0  917.0  925.0  933.0  941.0  949.0  957.0     2213.0  2221.0  2405.0  2413.0  2421.0  2429.0  2437.0  2445.0  2453.0  2461.0  2485.0  2493.0
 838.0  846.0  854.0  862.0  870.0  878.0  886.0  894.0  918.0  926.0  934.0  942.0  950.0  958.0  …  2214.0  2222.0  2406.0  2414.0  2422.0  2430.0  2438.0  2446.0  2454.0  2462.0  2486.0  2494.0
 839.0  847.0  855.0  863.0  871.0  879.0  887.0  895.0  919.0  927.0  935.0  943.0  951.0  959.0     2215.0  2223.0  2407.0  2415.0  2423.0  2431.0  2439.0  2447.0  2455.0  2463.0  2487.0  2495.0
 840.0  848.0  856.0  864.0  872.0  880.0  888.0  896.0  920.0  928.0  936.0  944.0  952.0  960.0     2216.0  2224.0  2408.0  2416.0  2424.0  2432.0  2440.0  2448.0  2456.0  2464.0  2488.0  2496.0
 897.0  899.0  901.0  903.0  905.0  907.0  909.0  911.0  977.0  979.0  981.0  983.0  985.0  987.0     2237.0  2239.0  2465.0  2467.0  2469.0  2471.0  2473.0  2475.0  2477.0  2479.0  2497.0  2499.0
 898.0  900.0  902.0  904.0  906.0  908.0  910.0  912.0  978.0  980.0  982.0  984.0  986.0  988.0     2238.0  2240.0  2466.0  2468.0  2470.0  2472.0  2474.0  2476.0  2478.0  2480.0  2498.0  2500.0

This makes it easier to load blocks of the matrix and apply our kernel.

Finally, a dump of random benchmarks:

julia> using TriangularMatrices, BenchmarkTools, Compat, Compat.LinearAlgebra

julia> BLAS.set_num_threads(1);

julia> a24 = randmat(24); x24 = randmat(24); d24 = randmat(24);

julia> ma24 = Matrix(a24); mx24 = Matrix(x24); md24 = Matrix(d24);

julia> @benchmark TriangularMatrices.mul!($d24, $a24, $x24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.034 μs (0.00% GC)
  median time:      1.067 μs (0.00% GC)
  mean time:        1.072 μs (0.00% GC)
  maximum time:     3.270 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark mul!($md24, $ma24, $mx24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.843 μs (0.00% GC)
  median time:      1.890 μs (0.00% GC)
  mean time:        1.898 μs (0.00% GC)
  maximum time:     3.617 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> a32 = randmat(32); x32 = randmat(32); d32 = randmat(32);

julia> ma32 = Matrix(a32); mx32 = Matrix(x32); md32 = Matrix(d32);

julia> @benchmark TriangularMatrices.mul!($d32, $a32, $x32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.315 μs (0.00% GC)
  median time:      2.393 μs (0.00% GC)
  mean time:        2.402 μs (0.00% GC)
  maximum time:     6.209 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark mul!($md32, $ma32, $mx32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.388 μs (0.00% GC)
  median time:      4.493 μs (0.00% GC)
  mean time:        4.511 μs (0.00% GC)
  maximum time:     7.435 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> a64 = randmat(64); x64 = randmat(64); d64 = randmat(64);

julia> ma64 = Matrix(a64); mx64 = Matrix(x64); md64 = Matrix(d64);

julia> @benchmark TriangularMatrices.mul!($d64, $a64, $x64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.236 μs (0.00% GC)
  median time:      19.647 μs (0.00% GC)
  mean time:        19.791 μs (0.00% GC)
  maximum time:     36.639 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md64, $ma64, $mx64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.466 μs (0.00% GC)
  median time:      25.017 μs (0.00% GC)
  mean time:        25.129 μs (0.00% GC)
  maximum time:     46.377 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a71 = randmat(71); x71 = randmat(71); d71 = randmat(71);  # performance currently drops a lot

julia> ma71 = Matrix(a71); mx71 = Matrix(x71); md71 = Matrix(d71); # when sizes aren't divisible by 8

julia> @benchmark TriangularMatrices.mul!($d71, $a71, $x71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.815 μs (0.00% GC)
  median time:      35.888 μs (0.00% GC)
  mean time:        35.997 μs (0.00% GC)
  maximum time:     66.545 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md71, $ma71, $mx71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     33.643 μs (0.00% GC)
  median time:      34.225 μs (0.00% GC)
  mean time:        34.370 μs (0.00% GC)
  maximum time:     59.883 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a72 = randmat(72); x72 = randmat(72); d72 = randmat(72);

julia> ma72 = Matrix(a72); mx72 = Matrix(x72); md72 = Matrix(d72);

julia> @benchmark TriangularMatrices.mul!($d72, $a72, $x72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.282 μs (0.00% GC)
  median time:      28.073 μs (0.00% GC)
  mean time:        28.121 μs (0.00% GC)
  maximum time:     48.521 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md72, $ma72, $mx72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     30.858 μs (0.00% GC)
  median time:      31.550 μs (0.00% GC)
  mean time:        31.676 μs (0.00% GC)
  maximum time:     56.095 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a128 = randmat(128); x128 = randmat(128); d128 = randmat(128);

julia> ma128 = Matrix(a128); mx128 = Matrix(x128); md128 = Matrix(d128);

julia> @benchmark TriangularMatrices.mul!($d128, $a128, $x128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     153.478 μs (0.00% GC)
  median time:      157.587 μs (0.00% GC)
  mean time:        158.263 μs (0.00% GC)
  maximum time:     180.910 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md128, $ma128, $mx128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     159.730 μs (0.00% GC)
  median time:      162.826 μs (0.00% GC)
  mean time:        163.159 μs (0.00% GC)
  maximum time:     219.483 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a256 = randmat(256); x256 = randmat(256); d256 = randmat(256); # can't convert matrices this "large"...

julia> ma256, mx256, md256 = randn(256,256), randn(256,256), randn(256,256);

julia> @benchmark TriangularMatrices.mul!($d256, $a256, $x256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.241 ms (0.00% GC)
  median time:      1.273 ms (0.00% GC)
  mean time:        1.276 ms (0.00% GC)
  maximum time:     1.406 ms (0.00% GC)
  --------------
  samples:          3916
  evals/sample:     1

julia> @benchmark mul!($md256, $ma256, $mx256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.180 ms (0.00% GC)
  median time:      1.208 ms (0.00% GC)
  mean time:        1.212 ms (0.00% GC)
  maximum time:     1.376 ms (0.00% GC)
  --------------
  samples:          4123
  evals/sample:     1

Disclaimer: the library needs a LOT of polish.
There are definite issues at the moment, and because the recursive matrices doesn’t mix too well with column or row major matrices, it’d need enough so that for most purposes someone could work mostly with these.
Tests segfault eventually. Printing “large” matrices doesn’t work, nor does converting them into regular matrices. The compiler seems to hang on compiling the getindex function?
Performance drop when matrix sizes aren’t divisible by 8.
No support yet for multiplication with matrices where sqrt(2)*min(size(A)...) > max(size(A)...).

I am taking a break from this, possibly through early August, because I’m running into deadlines (and maybe I’ll write a blog post; fyi, gfortran could not compile the kernel – it took in the area of 120ns! :wink: {although Flang, a Fortran front end to LLVM, could} – otherwise, it could be worth statically compiling a few kernels ).

Thought some members would be interested in seeing this though, because Julia was able to hold on for quite some time with an optimized BLAS library. I think with some work, it should be possible to get a pure-Julia BLAS well supported.


#2

That looks exciting! See also this earlier thread.


#3

There’s definitely a lot of performance left on the table. I’ll comb through Goto’s paper, and the links in that thread when I turn back to this.

For now, tests on Haswell vs OpenBLAS and MKL:

julia> versioninfo()
Julia Version 0.7.0-alpha.126
Commit 0e4493d* (2018-06-14 13:46 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

The gist of the results are that OpenBLAS did a little better relative to my recursive implementation on Haswell, and MKL crushed – evidence there’s a lot of room for performance optimizations, even before multithreading.

First, OpenBLAS:

julia> using TriangularMatrices, BenchmarkTools, Compat, Compat.LinearAlgebra

julia> BLAS.set_num_threads(1);

julia> a24 = randmat(24); x24 = randmat(24); d24 = randmat(24);

julia> ma24 = Matrix(a24); mx24 = Matrix(x24); md24 = Matrix(d24);

julia> @benchmark TriangularMatrices.mul!($d24, $a24, $x24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     709.936 ns (0.00% GC)
  median time:      710.721 ns (0.00% GC)
  mean time:        713.279 ns (0.00% GC)
  maximum time:     1.110 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     140

julia> @benchmark mul!($md24, $ma24, $mx24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.697 μs (0.00% GC)
  median time:      1.720 μs (0.00% GC)
  mean time:        1.721 μs (0.00% GC)
  maximum time:     3.519 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> a32 = randmat(32); x32 = randmat(32); d32 = randmat(32);

julia> ma32 = Matrix(a32); mx32 = Matrix(x32); md32 = Matrix(d32);

julia> @benchmark TriangularMatrices.mul!($d32, $a32, $x32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.819 μs (0.00% GC)
  median time:      1.823 μs (0.00% GC)
  mean time:        1.825 μs (0.00% GC)
  maximum time:     3.026 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark mul!($md32, $ma32, $mx32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.579 μs (0.00% GC)
  median time:      3.609 μs (0.00% GC)
  mean time:        3.612 μs (0.00% GC)
  maximum time:     7.539 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> a64 = randmat(64); x64 = randmat(64); d64 = randmat(64);

julia> ma64 = Matrix(a64); mx64 = Matrix(x64); md64 = Matrix(d64);

julia> @benchmark TriangularMatrices.mul!($d64, $a64, $x64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.067 μs (0.00% GC)
  median time:      15.128 μs (0.00% GC)
  mean time:        15.145 μs (0.00% GC)
  maximum time:     31.756 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md64, $ma64, $mx64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.137 μs (0.00% GC)
  median time:      19.261 μs (0.00% GC)
  mean time:        19.278 μs (0.00% GC)
  maximum time:     46.949 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a71 = randmat(71); x71 = randmat(71); d71 = randmat(71);  # performance currently drops a lot

julia> ma71 = Matrix(a71); mx71 = Matrix(x71); md71 = Matrix(d71); # when sizes aren't divisible by 8

julia> @benchmark TriangularMatrices.mul!($d71, $a71, $x71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.554 μs (0.00% GC)
  median time:      31.715 μs (0.00% GC)
  mean time:        31.747 μs (0.00% GC)
  maximum time:     71.165 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md71, $ma71, $mx71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.082 μs (0.00% GC)
  median time:      29.231 μs (0.00% GC)
  mean time:        29.360 μs (0.00% GC)
  maximum time:     67.414 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a72 = randmat(72); x72 = randmat(72); d72 = randmat(72);

julia> ma72 = Matrix(a72); mx72 = Matrix(x72); md72 = Matrix(d72);

julia> @benchmark TriangularMatrices.mul!($d72, $a72, $x72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.027 μs (0.00% GC)
  median time:      22.117 μs (0.00% GC)
  mean time:        22.138 μs (0.00% GC)
  maximum time:     45.415 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md72, $ma72, $mx72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.559 μs (0.00% GC)
  median time:      24.680 μs (0.00% GC)
  mean time:        24.702 μs (0.00% GC)
  maximum time:     52.463 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a128 = randmat(128); x128 = randmat(128); d128 = randmat(128);

julia> ma128 = Matrix(a128); mx128 = Matrix(x128); md128 = Matrix(d128);

julia> @benchmark TriangularMatrices.mul!($d128, $a128, $x128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     123.301 μs (0.00% GC)
  median time:      123.649 μs (0.00% GC)
  mean time:        124.203 μs (0.00% GC)
  maximum time:     195.992 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md128, $ma128, $mx128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     117.952 μs (0.00% GC)
  median time:      118.316 μs (0.00% GC)
  mean time:        118.858 μs (0.00% GC)
  maximum time:     215.866 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a256 = randmat(256); x256 = randmat(256); d256 = randmat(256); # can't convert matrices this "large"...

julia> ma256, mx256, md256 = randn(256,256), randn(256,256), randn(256,256);

julia> @benchmark TriangularMatrices.mul!($d256, $a256, $x256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.004 ms (0.00% GC)
  median time:      1.005 ms (0.00% GC)
  mean time:        1.007 ms (0.00% GC)
  maximum time:     1.209 ms (0.00% GC)
  --------------
  samples:          4960
  evals/sample:     1

julia> @benchmark mul!($md256, $ma256, $mx256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     793.676 μs (0.00% GC)
  median time:      794.787 μs (0.00% GC)
  mean time:        799.578 μs (0.00% GC)
  maximum time:     1.524 ms (0.00% GC)
  --------------
  samples:          6246
  evals/sample:     1

and now, MKL:

julia> using TriangularMatrices, BenchmarkTools, Compat, Compat.LinearAlgebra

julia> BLAS.set_num_threads(1);

julia> a24 = randmat(24); x24 = randmat(24); d24 = randmat(24);

julia> ma24 = Matrix(a24); mx24 = Matrix(x24); md24 = Matrix(d24);

julia> @benchmark TriangularMatrices.mul!($d24, $a24, $x24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     716.145 ns (0.00% GC)
  median time:      717.210 ns (0.00% GC)
  mean time:        717.799 ns (0.00% GC)
  maximum time:     1.063 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     138

julia> @benchmark mul!($md24, $ma24, $mx24)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     775.528 ns (0.00% GC)
  median time:      778.704 ns (0.00% GC)
  mean time:        781.224 ns (0.00% GC)
  maximum time:     1.283 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     108

julia> a32 = randmat(32); x32 = randmat(32); d32 = randmat(32);

julia> ma32 = Matrix(a32); mx32 = Matrix(x32); md32 = Matrix(d32);

julia> @benchmark TriangularMatrices.mul!($d32, $a32, $x32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.808 μs (0.00% GC)
  median time:      1.818 μs (0.00% GC)
  mean time:        1.826 μs (0.00% GC)
  maximum time:     4.082 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark mul!($md32, $ma32, $mx32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.620 μs (0.00% GC)
  median time:      1.635 μs (0.00% GC)
  mean time:        1.640 μs (0.00% GC)
  maximum time:     4.065 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> a64 = randmat(64); x64 = randmat(64); d64 = randmat(64);

julia> ma64 = Matrix(a64); mx64 = Matrix(x64); md64 = Matrix(d64);

julia> @benchmark TriangularMatrices.mul!($d64, $a64, $x64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.166 μs (0.00% GC)
  median time:      15.234 μs (0.00% GC)
  mean time:        15.285 μs (0.00% GC)
  maximum time:     47.069 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md64, $ma64, $mx64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.037 μs (0.00% GC)
  median time:      11.142 μs (0.00% GC)
  mean time:        11.183 μs (0.00% GC)
  maximum time:     38.379 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a71 = randmat(71); x71 = randmat(71); d71 = randmat(71);  # performance currently drops a lot

julia> ma71 = Matrix(a71); mx71 = Matrix(x71); md71 = Matrix(d71); # when sizes aren't divisible by 8

julia> @benchmark TriangularMatrices.mul!($d71, $a71, $x71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.784 μs (0.00% GC)
  median time:      31.974 μs (0.00% GC)
  mean time:        32.073 μs (0.00% GC)
  maximum time:     70.863 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md71, $ma71, $mx71)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.998 μs (0.00% GC)
  median time:      18.165 μs (0.00% GC)
  mean time:        18.340 μs (0.00% GC)
  maximum time:     55.012 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a72 = randmat(72); x72 = randmat(72); d72 = randmat(72);

julia> ma72 = Matrix(a72); mx72 = Matrix(x72); md72 = Matrix(d72);

julia> @benchmark TriangularMatrices.mul!($d72, $a72, $x72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.862 μs (0.00% GC)
  median time:      21.953 μs (0.00% GC)
  mean time:        22.026 μs (0.00% GC)
  maximum time:     44.255 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md72, $ma72, $mx72)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.762 μs (0.00% GC)
  median time:      16.854 μs (0.00% GC)
  mean time:        16.898 μs (0.00% GC)
  maximum time:     52.588 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a128 = randmat(128); x128 = randmat(128); d128 = randmat(128);

julia> ma128 = Matrix(a128); mx128 = Matrix(x128); md128 = Matrix(d128);

julia> @benchmark TriangularMatrices.mul!($d128, $a128, $x128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     123.192 μs (0.00% GC)
  median time:      123.682 μs (0.00% GC)
  mean time:        123.911 μs (0.00% GC)
  maximum time:     170.757 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($md128, $ma128, $mx128)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     88.973 μs (0.00% GC)
  median time:      89.356 μs (0.00% GC)
  mean time:        89.715 μs (0.00% GC)
  maximum time:     157.206 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> a256 = randmat(256); x256 = randmat(256); d256 = randmat(256); # can't convert matrices this "large"...

julia> ma256, mx256, md256 = randn(256,256), randn(256,256), randn(256,256);

julia> @benchmark TriangularMatrices.mul!($d256, $a256, $x256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     999.054 μs (0.00% GC)
  median time:      1.000 ms (0.00% GC)
  mean time:        1.003 ms (0.00% GC)
  maximum time:     1.109 ms (0.00% GC)
  --------------
  samples:          4981
  evals/sample:     1

julia> @benchmark mul!($md256, $ma256, $mx256)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     674.136 μs (0.00% GC)
  median time:      678.050 μs (0.00% GC)
  mean time:        680.462 μs (0.00% GC)
  maximum time:     948.319 μs (0.00% GC)
  --------------
  samples:          7339
  evals/sample:     1

This is the same old story of OpenBLAS vs MKL. They both do well when matrices are big, but MKL does better for small matrices (and factorizations).