@inbounds: is the compiler now so smart that this is no longer necessary?

My first example was with disabled multithreading, yet it was still much faster. When I re-enabled it, it got much faster still at larger sizes.
I’m on 0.7, and OpenBLAS now seems smarter about being single threaded for smaller matrix sizes. Eg, as far as I could tell it ran single threaded here, regardless of what BLAS.set_num_threads was set to.

julia> BLAS.set_num_threads(1)

julia> gradN = rand(60, 40);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  11.014 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.425 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.289 μs (0 allocations: 0 bytes)

julia> BLAS.set_num_threads(10)

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  10.994 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.420 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.277 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  37.251 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  52.913 μs (0 allocations: 0 bytes)

Times were the same, and I never saw CPU usage exceed 100%.

Even with the generic installs, OpenBLAS is supposed to detect your processor and pick the appropriate kernel, so it shouldn’t matter whether you built it from source or not.

BTW, @PetrKryslUCSD, if you’re primarily concerned about small sizes and willing to use StaticArrays:

julia> using StaticArrays

julia> gradN = @MMatrix rand(3, 2);

julia> Ke = @MMatrix fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  5.714 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  8.696 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  8.697 ns (0 allocations: 0 bytes)

MMatrix also defaults to BLAS calls at larger sizes, but many of the other StaticArrays functions (eg, @MMatrix randn(m, n) ) aren’t optimized for big matrices, so lots of things would randomly be painfully slow.

1 Like