There’s a readme with a lot of output dumped on it here:
https://github.com/chriselrod/TriangularMatrices.jl
But the highlights are that,
I found an fairly optimal “compute kernel” (at least for my architecture on Julia 0.7; performance is poor on Julia 0.6):
julia> using StaticArrays, TriangularMatrices, BenchmarkTools
julia> a8_16_t = srandmat(8,16); # static matrix type from TriangularMatrices
julia> x16_4_t = srandmat(16,4);
julia> a8_16_s = SMatrix{8,16}(a8_16_t.data); #convert to a StaticMatrix
julia> x16_4_s = SMatrix{16,4}(x16_4_t.data);
julia> @btime $a8_16_t * $x16_4_t
32.495 ns (0 allocations: 0 bytes)
8×4 TriangularMatrices.StaticRecursiveMatrix{Float64,8,4,32}:
10.1893 -5.8437 2.80989 0.279144
-5.93568 9.09442 -3.71548 -2.54874
1.17599 1.50367 -3.56322 7.24623
-9.47634 2.39815 -2.40341 -1.34133
3.54455 3.98109 8.13257 -1.42904
0.505972 -4.51573 -2.23145 -1.92717
2.26573 2.61974 -0.296927 -0.0811224
3.19529 1.09088 1.15119 -3.62086
julia> @btime $a8_16_s * $x16_4_s
72.788 ns (0 allocations: 0 bytes)
8×4 SArray{Tuple{8,4},Float64,2,32}:
10.1893 -5.8437 2.80989 0.279144
-5.93568 9.09442 -3.71548 -2.54874
1.17599 1.50367 -3.56322 7.24623
-9.47634 2.39815 -2.40341 -1.34133
3.54455 3.98109 8.13257 -1.42904
0.505972 -4.51573 -2.23145 -1.92717
2.26573 2.61974 -0.296927 -0.0811224
3.19529 1.09088 1.15119 -3.62086
The short of how this works given registers of size 4 (eg, AVX) is that given an 8xN
matrix A and an NxP
matrix X, the compiler uses a vbroadcastsd
instruction to fill a register with the first value of X, fills two registers with the first values of A, and multiplies them to create output d_1 (initial first 4 elements of output) and d_2 (initial next 4 elements of output)…
Broadcasting the next element of X, loading the next 8 elements of A, then two fmas update d_1, and d_2.
To get this to happen on Julia 0.7, I just create a generated function that looks similar to this:
D[1] = X[1] * A[1]
D[2] = X[1] * A[2]
D[3] = X[1] * A[3]
D[4] = X[1] * A[4]
D[5] = X[1] * A[5]
D[6] = X[1] * A[6]
D[7] = X[1] * A[7]
D[8] = X[1] * A[8]
D[1] += X[2] * A[9]
D[2] += X[2] * A[10]
D[3] += X[2] * A[11]
D[4] += X[2] * A[12]
D[5] += X[2] * A[13]
D[6] += X[2] * A[14]
D[7] += X[2] * A[15]
D[8] += X[2] * A[16]
D[1] += X[3] * A[17]
and Julia takes care of the rest.
My tests were done with:
julia> versioninfo()
Julia Version 0.7.0-alpha.94
Commit 4420717* (2018-06-12 20:26 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, znver1)
Performance will almost certainly vary with other architectures.
To try and scale this up to larger matrices, I went with a recursive memory layout.
These are the linear indices of a 50x50 matrix:
julia> rm = RecursiveMatrix{Float64,50,50,50^2}();
julia> function set_to_ind!(A)
@inbounds for i = 1:length(A)
A[i] = i
end
A
end
set_to_ind! (generic function with 1 method)
julia> set_to_ind!(rm)
50×50 TriangularMatrices.RecursiveMatrix{Float64,50,50,2500}:
1.0 9.0 17.0 25.0 33.0 41.0 49.0 57.0 129.0 137.0 145.0 153.0 161.0 169.0 … 1377.0 1385.0 1585.0 1593.0 1601.0 1609.0 1617.0 1625.0 1633.0 1641.0 1649.0 1657.0
2.0 10.0 18.0 26.0 34.0 42.0 50.0 58.0 130.0 138.0 146.0 154.0 162.0 170.0 1378.0 1386.0 1586.0 1594.0 1602.0 1610.0 1618.0 1626.0 1634.0 1642.0 1650.0 1658.0
3.0 11.0 19.0 27.0 35.0 43.0 51.0 59.0 131.0 139.0 147.0 155.0 163.0 171.0 1379.0 1387.0 1587.0 1595.0 1603.0 1611.0 1619.0 1627.0 1635.0 1643.0 1651.0 1659.0
4.0 12.0 20.0 28.0 36.0 44.0 52.0 60.0 132.0 140.0 148.0 156.0 164.0 172.0 1380.0 1388.0 1588.0 1596.0 1604.0 1612.0 1620.0 1628.0 1636.0 1644.0 1652.0 1660.0
5.0 13.0 21.0 29.0 37.0 45.0 53.0 61.0 133.0 141.0 149.0 157.0 165.0 173.0 1381.0 1389.0 1589.0 1597.0 1605.0 1613.0 1621.0 1629.0 1637.0 1645.0 1653.0 1661.0
6.0 14.0 22.0 30.0 38.0 46.0 54.0 62.0 134.0 142.0 150.0 158.0 166.0 174.0 … 1382.0 1390.0 1590.0 1598.0 1606.0 1614.0 1622.0 1630.0 1638.0 1646.0 1654.0 1662.0
7.0 15.0 23.0 31.0 39.0 47.0 55.0 63.0 135.0 143.0 151.0 159.0 167.0 175.0 1383.0 1391.0 1591.0 1599.0 1607.0 1615.0 1623.0 1631.0 1639.0 1647.0 1655.0 1663.0
8.0 16.0 24.0 32.0 40.0 48.0 56.0 64.0 136.0 144.0 152.0 160.0 168.0 176.0 1384.0 1392.0 1592.0 1600.0 1608.0 1616.0 1624.0 1632.0 1640.0 1648.0 1656.0 1664.0
65.0 73.0 81.0 89.0 97.0 105.0 113.0 121.0 193.0 201.0 209.0 217.0 225.0 233.0 1441.0 1449.0 1665.0 1673.0 1681.0 1689.0 1697.0 1705.0 1713.0 1721.0 1729.0 1737.0
66.0 74.0 82.0 90.0 98.0 106.0 114.0 122.0 194.0 202.0 210.0 218.0 226.0 234.0 1442.0 1450.0 1666.0 1674.0 1682.0 1690.0 1698.0 1706.0 1714.0 1722.0 1730.0 1738.0
67.0 75.0 83.0 91.0 99.0 107.0 115.0 123.0 195.0 203.0 211.0 219.0 227.0 235.0 … 1443.0 1451.0 1667.0 1675.0 1683.0 1691.0 1699.0 1707.0 1715.0 1723.0 1731.0 1739.0
68.0 76.0 84.0 92.0 100.0 108.0 116.0 124.0 196.0 204.0 212.0 220.0 228.0 236.0 1444.0 1452.0 1668.0 1676.0 1684.0 1692.0 1700.0 1708.0 1716.0 1724.0 1732.0 1740.0
69.0 77.0 85.0 93.0 101.0 109.0 117.0 125.0 197.0 205.0 213.0 221.0 229.0 237.0 1445.0 1453.0 1669.0 1677.0 1685.0 1693.0 1701.0 1709.0 1717.0 1725.0 1733.0 1741.0
70.0 78.0 86.0 94.0 102.0 110.0 118.0 126.0 198.0 206.0 214.0 222.0 230.0 238.0 1446.0 1454.0 1670.0 1678.0 1686.0 1694.0 1702.0 1710.0 1718.0 1726.0 1734.0 1742.0
71.0 79.0 87.0 95.0 103.0 111.0 119.0 127.0 199.0 207.0 215.0 223.0 231.0 239.0 1447.0 1455.0 1671.0 1679.0 1687.0 1695.0 1703.0 1711.0 1719.0 1727.0 1735.0 1743.0
72.0 80.0 88.0 96.0 104.0 112.0 120.0 128.0 200.0 208.0 216.0 224.0 232.0 240.0 … 1448.0 1456.0 1672.0 1680.0 1688.0 1696.0 1704.0 1712.0 1720.0 1728.0 1736.0 1744.0
257.0 265.0 273.0 281.0 289.0 297.0 305.0 313.0 321.0 329.0 337.0 345.0 353.0 361.0 1569.0 1577.0 1745.0 1753.0 1761.0 1769.0 1777.0 1785.0 1793.0 1801.0 1809.0 1817.0
258.0 266.0 274.0 282.0 290.0 298.0 306.0 314.0 322.0 330.0 338.0 346.0 354.0 362.0 1570.0 1578.0 1746.0 1754.0 1762.0 1770.0 1778.0 1786.0 1794.0 1802.0 1810.0 1818.0
259.0 267.0 275.0 283.0 291.0 299.0 307.0 315.0 323.0 331.0 339.0 347.0 355.0 363.0 1571.0 1579.0 1747.0 1755.0 1763.0 1771.0 1779.0 1787.0 1795.0 1803.0 1811.0 1819.0
260.0 268.0 276.0 284.0 292.0 300.0 308.0 316.0 324.0 332.0 340.0 348.0 356.0 364.0 1572.0 1580.0 1748.0 1756.0 1764.0 1772.0 1780.0 1788.0 1796.0 1804.0 1812.0 1820.0
261.0 269.0 277.0 285.0 293.0 301.0 309.0 317.0 325.0 333.0 341.0 349.0 357.0 365.0 … 1573.0 1581.0 1749.0 1757.0 1765.0 1773.0 1781.0 1789.0 1797.0 1805.0 1813.0 1821.0
⋮ ⋮ ⋮ ⋱ ⋮ ⋮
582.0 590.0 598.0 606.0 614.0 622.0 630.0 638.0 710.0 718.0 726.0 734.0 742.0 750.0 2006.0 2014.0 2246.0 2254.0 2262.0 2270.0 2278.0 2286.0 2294.0 2302.0 2310.0 2318.0
583.0 591.0 599.0 607.0 615.0 623.0 631.0 639.0 711.0 719.0 727.0 735.0 743.0 751.0 … 2007.0 2015.0 2247.0 2255.0 2263.0 2271.0 2279.0 2287.0 2295.0 2303.0 2311.0 2319.0
584.0 592.0 600.0 608.0 616.0 624.0 632.0 640.0 712.0 720.0 728.0 736.0 744.0 752.0 2008.0 2016.0 2248.0 2256.0 2264.0 2272.0 2280.0 2288.0 2296.0 2304.0 2312.0 2320.0
641.0 649.0 657.0 665.0 673.0 681.0 689.0 697.0 769.0 777.0 785.0 793.0 801.0 809.0 2065.0 2073.0 2321.0 2329.0 2337.0 2345.0 2353.0 2361.0 2369.0 2377.0 2385.0 2393.0
642.0 650.0 658.0 666.0 674.0 682.0 690.0 698.0 770.0 778.0 786.0 794.0 802.0 810.0 2066.0 2074.0 2322.0 2330.0 2338.0 2346.0 2354.0 2362.0 2370.0 2378.0 2386.0 2394.0
643.0 651.0 659.0 667.0 675.0 683.0 691.0 699.0 771.0 779.0 787.0 795.0 803.0 811.0 2067.0 2075.0 2323.0 2331.0 2339.0 2347.0 2355.0 2363.0 2371.0 2379.0 2387.0 2395.0
644.0 652.0 660.0 668.0 676.0 684.0 692.0 700.0 772.0 780.0 788.0 796.0 804.0 812.0 … 2068.0 2076.0 2324.0 2332.0 2340.0 2348.0 2356.0 2364.0 2372.0 2380.0 2388.0 2396.0
645.0 653.0 661.0 669.0 677.0 685.0 693.0 701.0 773.0 781.0 789.0 797.0 805.0 813.0 2069.0 2077.0 2325.0 2333.0 2341.0 2349.0 2357.0 2365.0 2373.0 2381.0 2389.0 2397.0
646.0 654.0 662.0 670.0 678.0 686.0 694.0 702.0 774.0 782.0 790.0 798.0 806.0 814.0 2070.0 2078.0 2326.0 2334.0 2342.0 2350.0 2358.0 2366.0 2374.0 2382.0 2390.0 2398.0
647.0 655.0 663.0 671.0 679.0 687.0 695.0 703.0 775.0 783.0 791.0 799.0 807.0 815.0 2071.0 2079.0 2327.0 2335.0 2343.0 2351.0 2359.0 2367.0 2375.0 2383.0 2391.0 2399.0
648.0 656.0 664.0 672.0 680.0 688.0 696.0 704.0 776.0 784.0 792.0 800.0 808.0 816.0 2072.0 2080.0 2328.0 2336.0 2344.0 2352.0 2360.0 2368.0 2376.0 2384.0 2392.0 2400.0
833.0 841.0 849.0 857.0 865.0 873.0 881.0 889.0 913.0 921.0 929.0 937.0 945.0 953.0 … 2209.0 2217.0 2401.0 2409.0 2417.0 2425.0 2433.0 2441.0 2449.0 2457.0 2481.0 2489.0
834.0 842.0 850.0 858.0 866.0 874.0 882.0 890.0 914.0 922.0 930.0 938.0 946.0 954.0 2210.0 2218.0 2402.0 2410.0 2418.0 2426.0 2434.0 2442.0 2450.0 2458.0 2482.0 2490.0
835.0 843.0 851.0 859.0 867.0 875.0 883.0 891.0 915.0 923.0 931.0 939.0 947.0 955.0 2211.0 2219.0 2403.0 2411.0 2419.0 2427.0 2435.0 2443.0 2451.0 2459.0 2483.0 2491.0
836.0 844.0 852.0 860.0 868.0 876.0 884.0 892.0 916.0 924.0 932.0 940.0 948.0 956.0 2212.0 2220.0 2404.0 2412.0 2420.0 2428.0 2436.0 2444.0 2452.0 2460.0 2484.0 2492.0
837.0 845.0 853.0 861.0 869.0 877.0 885.0 893.0 917.0 925.0 933.0 941.0 949.0 957.0 2213.0 2221.0 2405.0 2413.0 2421.0 2429.0 2437.0 2445.0 2453.0 2461.0 2485.0 2493.0
838.0 846.0 854.0 862.0 870.0 878.0 886.0 894.0 918.0 926.0 934.0 942.0 950.0 958.0 … 2214.0 2222.0 2406.0 2414.0 2422.0 2430.0 2438.0 2446.0 2454.0 2462.0 2486.0 2494.0
839.0 847.0 855.0 863.0 871.0 879.0 887.0 895.0 919.0 927.0 935.0 943.0 951.0 959.0 2215.0 2223.0 2407.0 2415.0 2423.0 2431.0 2439.0 2447.0 2455.0 2463.0 2487.0 2495.0
840.0 848.0 856.0 864.0 872.0 880.0 888.0 896.0 920.0 928.0 936.0 944.0 952.0 960.0 2216.0 2224.0 2408.0 2416.0 2424.0 2432.0 2440.0 2448.0 2456.0 2464.0 2488.0 2496.0
897.0 899.0 901.0 903.0 905.0 907.0 909.0 911.0 977.0 979.0 981.0 983.0 985.0 987.0 2237.0 2239.0 2465.0 2467.0 2469.0 2471.0 2473.0 2475.0 2477.0 2479.0 2497.0 2499.0
898.0 900.0 902.0 904.0 906.0 908.0 910.0 912.0 978.0 980.0 982.0 984.0 986.0 988.0 2238.0 2240.0 2466.0 2468.0 2470.0 2472.0 2474.0 2476.0 2478.0 2480.0 2498.0 2500.0
This makes it easier to load blocks of the matrix and apply our kernel.
Finally, a dump of random benchmarks:
julia> using TriangularMatrices, BenchmarkTools, Compat, Compat.LinearAlgebra
julia> BLAS.set_num_threads(1);
julia> a24 = randmat(24); x24 = randmat(24); d24 = randmat(24);
julia> ma24 = Matrix(a24); mx24 = Matrix(x24); md24 = Matrix(d24);
julia> @benchmark TriangularMatrices.mul!($d24, $a24, $x24)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.034 μs (0.00% GC)
median time: 1.067 μs (0.00% GC)
mean time: 1.072 μs (0.00% GC)
maximum time: 3.270 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark mul!($md24, $ma24, $mx24)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.843 μs (0.00% GC)
median time: 1.890 μs (0.00% GC)
mean time: 1.898 μs (0.00% GC)
maximum time: 3.617 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> a32 = randmat(32); x32 = randmat(32); d32 = randmat(32);
julia> ma32 = Matrix(a32); mx32 = Matrix(x32); md32 = Matrix(d32);
julia> @benchmark TriangularMatrices.mul!($d32, $a32, $x32)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.315 μs (0.00% GC)
median time: 2.393 μs (0.00% GC)
mean time: 2.402 μs (0.00% GC)
maximum time: 6.209 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark mul!($md32, $ma32, $mx32)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.388 μs (0.00% GC)
median time: 4.493 μs (0.00% GC)
mean time: 4.511 μs (0.00% GC)
maximum time: 7.435 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
julia> a64 = randmat(64); x64 = randmat(64); d64 = randmat(64);
julia> ma64 = Matrix(a64); mx64 = Matrix(x64); md64 = Matrix(d64);
julia> @benchmark TriangularMatrices.mul!($d64, $a64, $x64)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 19.236 μs (0.00% GC)
median time: 19.647 μs (0.00% GC)
mean time: 19.791 μs (0.00% GC)
maximum time: 36.639 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($md64, $ma64, $mx64)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 24.466 μs (0.00% GC)
median time: 25.017 μs (0.00% GC)
mean time: 25.129 μs (0.00% GC)
maximum time: 46.377 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> a71 = randmat(71); x71 = randmat(71); d71 = randmat(71); # performance currently drops a lot
julia> ma71 = Matrix(a71); mx71 = Matrix(x71); md71 = Matrix(d71); # when sizes aren't divisible by 8
julia> @benchmark TriangularMatrices.mul!($d71, $a71, $x71)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 34.815 μs (0.00% GC)
median time: 35.888 μs (0.00% GC)
mean time: 35.997 μs (0.00% GC)
maximum time: 66.545 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($md71, $ma71, $mx71)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 33.643 μs (0.00% GC)
median time: 34.225 μs (0.00% GC)
mean time: 34.370 μs (0.00% GC)
maximum time: 59.883 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> a72 = randmat(72); x72 = randmat(72); d72 = randmat(72);
julia> ma72 = Matrix(a72); mx72 = Matrix(x72); md72 = Matrix(d72);
julia> @benchmark TriangularMatrices.mul!($d72, $a72, $x72)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 27.282 μs (0.00% GC)
median time: 28.073 μs (0.00% GC)
mean time: 28.121 μs (0.00% GC)
maximum time: 48.521 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($md72, $ma72, $mx72)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 30.858 μs (0.00% GC)
median time: 31.550 μs (0.00% GC)
mean time: 31.676 μs (0.00% GC)
maximum time: 56.095 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> a128 = randmat(128); x128 = randmat(128); d128 = randmat(128);
julia> ma128 = Matrix(a128); mx128 = Matrix(x128); md128 = Matrix(d128);
julia> @benchmark TriangularMatrices.mul!($d128, $a128, $x128)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 153.478 μs (0.00% GC)
median time: 157.587 μs (0.00% GC)
mean time: 158.263 μs (0.00% GC)
maximum time: 180.910 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($md128, $ma128, $mx128)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 159.730 μs (0.00% GC)
median time: 162.826 μs (0.00% GC)
mean time: 163.159 μs (0.00% GC)
maximum time: 219.483 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> a256 = randmat(256); x256 = randmat(256); d256 = randmat(256); # can't convert matrices this "large"...
julia> ma256, mx256, md256 = randn(256,256), randn(256,256), randn(256,256);
julia> @benchmark TriangularMatrices.mul!($d256, $a256, $x256)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.241 ms (0.00% GC)
median time: 1.273 ms (0.00% GC)
mean time: 1.276 ms (0.00% GC)
maximum time: 1.406 ms (0.00% GC)
--------------
samples: 3916
evals/sample: 1
julia> @benchmark mul!($md256, $ma256, $mx256)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.180 ms (0.00% GC)
median time: 1.208 ms (0.00% GC)
mean time: 1.212 ms (0.00% GC)
maximum time: 1.376 ms (0.00% GC)
--------------
samples: 4123
evals/sample: 1
Disclaimer: the library needs a LOT of polish.
There are definite issues at the moment, and because the recursive matrices doesn’t mix too well with column or row major matrices, it’d need enough so that for most purposes someone could work mostly with these.
Tests segfault eventually. Printing “large” matrices doesn’t work, nor does converting them into regular matrices. The compiler seems to hang on compiling the getindex function?
Performance drop when matrix sizes aren’t divisible by 8.
No support yet for multiplication with matrices where sqrt(2)*min(size(A)...) > max(size(A)...)
.
I am taking a break from this, possibly through early August, because I’m running into deadlines (and maybe I’ll write a blog post; fyi, gfortran could not compile the kernel – it took in the area of 120ns! {although Flang, a Fortran front end to LLVM, could} – otherwise, it could be worth statically compiling a few kernels ).
Thought some members would be interested in seeing this though, because Julia was able to hold on for quite some time with an optimized BLAS library. I think with some work, it should be possible to get a pure-Julia BLAS well supported.