My first example was with disabled multithreading, yet it was still much faster. When I re-enabled it, it got much faster still at larger sizes.
I’m on 0.7, and OpenBLAS now seems smarter about being single threaded for smaller matrix sizes. Eg, as far as I could tell it ran single threaded here, regardless of what BLAS.set_num_threads was set to.
julia> BLAS.set_num_threads(1)
julia> gradN = rand(60, 40);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
11.014 μs (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
8.425 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
9.289 μs (0 allocations: 0 bytes)
julia> BLAS.set_num_threads(10)
julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
10.994 μs (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
8.420 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
9.277 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
37.251 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
52.913 μs (0 allocations: 0 bytes)
Times were the same, and I never saw CPU usage exceed 100%.
Even with the generic installs, OpenBLAS is supposed to detect your processor and pick the appropriate kernel, so it shouldn’t matter whether you built it from source or not.
BTW, @PetrKryslUCSD, if you’re primarily concerned about small sizes and willing to use StaticArrays:
julia> using StaticArrays
julia> gradN = @MMatrix rand(3, 2);
julia> Ke = @MMatrix fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
5.714 ns (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
8.696 ns (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
8.697 ns (0 allocations: 0 bytes)
MMatrix also defaults to BLAS calls at larger sizes, but many of the other StaticArrays functions (eg, @MMatrix randn(m, n)
) aren’t optimized for big matrices, so lots of things would randomly be painfully slow.