My guess is that it multi-threads better, as many of the individual slice-multiplications aren’t big enough to benefit much, but the total problem is. On my 6-core desktop (without AVX512) I see similar numbers to Chris’s (although with OpenBLAS). While on my 2-core laptop, sequential mul!
is fastest from size 150 on.
Edit – now I tried with MKL, and sequential mul!
catches up. So I guess it’s about OpenBLAS being bad at threads.
julia> BLAS.set_num_threads(18); BLAS.vendor()
:openblas64
dim = 50
625.519 μs (74 allocations: 24.64 KiB)
9.591 ms (4099 allocations: 39.36 MiB)
2.991 ms (2 allocations: 19.64 KiB)
2.997 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 100
2.253 ms (74 allocations: 83.20 KiB)
41.021 ms (4099 allocations: 156.43 MiB)
9.755 ms (2 allocations: 78.20 KiB)
9.754 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 150
5.251 ms (75 allocations: 180.92 KiB)
363.106 ms (4099 allocations: 351.71 MiB)
71.963 ms (2 allocations: 175.89 KiB)
71.158 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 200
7.870 ms (76 allocations: 317.64 KiB)
572.025 ms (4099 allocations: 624.95 MiB)
72.760 ms (2 allocations: 312.58 KiB)
72.461 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 250
12.850 ms (76 allocations: 493.45 KiB)
799.994 ms (4099 allocations: 976.41 MiB)
80.468 ms (2 allocations: 488.39 KiB)
82.289 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 500
45.773 ms (76 allocations: 1.91 MiB)
3.100 s (4099 allocations: 3.81 GiB)
118.703 ms (2 allocations: 1.91 MiB)
120.825 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 750
119.873 ms (76 allocations: 4.30 MiB)
5.366 s (4099 allocations: 8.58 GiB)
212.210 ms (2 allocations: 4.29 MiB)
217.976 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
First is @tullio
, last two are mul!
on slices.
Julia 1.5.2, i7-8700, 6 threads.
Above, OpenBLAS, below, MKL.
julia> BLAS.set_num_threads(18); BLAS.vendor()
:mkl
dim = 50
625.765 μs (75 allocations: 24.67 KiB)
8.207 ms (4099 allocations: 39.36 MiB)
2.257 ms (2 allocations: 19.64 KiB)
2.245 ms (2 allocations: 19.64 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 100
2.262 ms (74 allocations: 83.20 KiB)
34.287 ms (4099 allocations: 156.43 MiB)
4.756 ms (2 allocations: 78.20 KiB)
4.816 ms (2 allocations: 78.20 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 150
5.141 ms (76 allocations: 180.95 KiB)
152.586 ms (4099 allocations: 351.71 MiB)
5.740 ms (2 allocations: 175.89 KiB)
5.727 ms (2 allocations: 175.89 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 200
7.957 ms (76 allocations: 317.64 KiB)
254.642 ms (4099 allocations: 624.95 MiB)
8.018 ms (2 allocations: 312.58 KiB)
8.038 ms (2 allocations: 312.58 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 250
12.834 ms (76 allocations: 493.45 KiB)
384.841 ms (4099 allocations: 976.41 MiB)
10.891 ms (2 allocations: 488.39 KiB)
10.981 ms (2 allocations: 488.39 KiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 500
47.541 ms (76 allocations: 1.91 MiB)
1.857 s (4099 allocations: 3.81 GiB)
39.667 ms (2 allocations: 1.91 MiB)
39.889 ms (2 allocations: 1.91 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 750
120.173 ms (76 allocations: 4.30 MiB)
3.254 s (4099 allocations: 8.58 GiB)
71.169 ms (2 allocations: 4.29 MiB)
70.832 ms (2 allocations: 4.29 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true
dim = 1000
186.777 ms (77 allocations: 7.63 MiB)
5.054 s (4099 allocations: 15.25 GiB)
121.888 ms (2 allocations: 7.63 MiB)
122.270 ms (2 allocations: 7.63 MiB)
out1 ≈ out2 ≈ out3 ≈ out4 = true