Multiply dense arrays

mcabbott · October 4, 2020, 6:38pm

My guess is that it multi-threads better, as many of the individual slice-multiplications aren’t big enough to benefit much, but the total problem is. On my 6-core desktop (without AVX512) I see similar numbers to Chris’s (although with OpenBLAS). While on my 2-core laptop, sequential mul! is fastest from size 150 on.

Edit – now I tried with MKL, and sequential mul! catches up. So I guess it’s about OpenBLAS being bad at threads.

julia> BLAS.set_num_threads(18); BLAS.vendor()
:openblas64

dim = 50
  625.519 μs (74 allocations: 24.64 KiB)
  9.591 ms (4099 allocations: 39.36 MiB)
  2.991 ms (2 allocations: 19.64 KiB)
  2.997 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 100
  2.253 ms (74 allocations: 83.20 KiB)
  41.021 ms (4099 allocations: 156.43 MiB)
  9.755 ms (2 allocations: 78.20 KiB)
  9.754 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 150
  5.251 ms (75 allocations: 180.92 KiB)
  363.106 ms (4099 allocations: 351.71 MiB)
  71.963 ms (2 allocations: 175.89 KiB)
  71.158 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 200
  7.870 ms (76 allocations: 317.64 KiB)
  572.025 ms (4099 allocations: 624.95 MiB)
  72.760 ms (2 allocations: 312.58 KiB)
  72.461 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 250
  12.850 ms (76 allocations: 493.45 KiB)
  799.994 ms (4099 allocations: 976.41 MiB)
  80.468 ms (2 allocations: 488.39 KiB)
  82.289 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 500
  45.773 ms (76 allocations: 1.91 MiB)
  3.100 s (4099 allocations: 3.81 GiB)
  118.703 ms (2 allocations: 1.91 MiB)
  120.825 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 750
  119.873 ms (76 allocations: 4.30 MiB)
  5.366 s (4099 allocations: 8.58 GiB)
  212.210 ms (2 allocations: 4.29 MiB)
  217.976 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

First is @tullio, last two are mul! on slices.

Julia 1.5.2, i7-8700, 6 threads.

Above, OpenBLAS, below, MKL.

julia> BLAS.set_num_threads(18); BLAS.vendor()
:mkl

dim = 50
  625.765 μs (75 allocations: 24.67 KiB)
  8.207 ms (4099 allocations: 39.36 MiB)
  2.257 ms (2 allocations: 19.64 KiB)
  2.245 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 100
  2.262 ms (74 allocations: 83.20 KiB)
  34.287 ms (4099 allocations: 156.43 MiB)
  4.756 ms (2 allocations: 78.20 KiB)
  4.816 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 150
  5.141 ms (76 allocations: 180.95 KiB)
  152.586 ms (4099 allocations: 351.71 MiB)
  5.740 ms (2 allocations: 175.89 KiB)
  5.727 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 200
  7.957 ms (76 allocations: 317.64 KiB)
  254.642 ms (4099 allocations: 624.95 MiB)
  8.018 ms (2 allocations: 312.58 KiB)
  8.038 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 250
  12.834 ms (76 allocations: 493.45 KiB)
  384.841 ms (4099 allocations: 976.41 MiB)
  10.891 ms (2 allocations: 488.39 KiB)
  10.981 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 500
  47.541 ms (76 allocations: 1.91 MiB)
  1.857 s (4099 allocations: 3.81 GiB)
  39.667 ms (2 allocations: 1.91 MiB)
  39.889 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 750
  120.173 ms (76 allocations: 4.30 MiB)
  3.254 s (4099 allocations: 8.58 GiB)
  71.169 ms (2 allocations: 4.29 MiB)
  70.832 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

dim = 1000
  186.777 ms (77 allocations: 7.63 MiB)
  5.054 s (4099 allocations: 15.25 GiB)
  121.888 ms (2 allocations: 7.63 MiB)
  122.270 ms (2 allocations: 7.63 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true

Topic		Replies	Views
Speed comparison matrix multiplication in Julia Performance question , linearalgebra , optimization , tullio	45	3191	August 19, 2021
Speed up simple product accumulator loop Performance loops , tullio	6	846	August 17, 2020
Parallelization efficiency for @tensor @tullio Numerics	5	185	June 30, 2023
Optimizing Complex Batch Matrix Multiplication Performance question	2	408	October 25, 2023
Best current approach for working with sparse tensors? Specific Domains sparse , tensors	2	1590	April 24, 2022

Multiply dense arrays

Related topics