@inbounds: is the compiler now so smart that this is no longer necessary?

Elrod · July 14, 2018, 3:02pm

My first example was with disabled multithreading, yet it was still much faster. When I re-enabled it, it got much faster still at larger sizes.
I’m on 0.7, and OpenBLAS now seems smarter about being single threaded for smaller matrix sizes. Eg, as far as I could tell it ran single threaded here, regardless of what BLAS.set_num_threads was set to.

julia> BLAS.set_num_threads(1)

julia> gradN = rand(60, 40);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  11.014 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.425 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.289 μs (0 allocations: 0 bytes)

julia> BLAS.set_num_threads(10)

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  10.994 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.420 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.277 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  37.251 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  52.913 μs (0 allocations: 0 bytes)

Times were the same, and I never saw CPU usage exceed 100%.

Even with the generic installs, OpenBLAS is supposed to detect your processor and pick the appropriate kernel, so it shouldn’t matter whether you built it from source or not.

BTW, @PetrKryslUCSD, if you’re primarily concerned about small sizes and willing to use StaticArrays:

julia> using StaticArrays

julia> gradN = @MMatrix rand(3, 2);

julia> Ke = @MMatrix fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  5.714 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  8.696 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  8.697 ns (0 allocations: 0 bytes)

MMatrix also defaults to BLAS calls at larger sizes, but many of the other StaticArrays functions (eg, @MMatrix randn(m, n) ) aren’t optimized for big matrices, so lots of things would randomly be painfully slow.

Topic		Replies	Views
Is the triple `@inbounds @fastmath @simd` necessary for absolute peak performance? Performance	7	483	October 21, 2024
@inbounds code slower than one without General Usage	17	2278	March 9, 2019
@inbounds macro on sparse arrays Performance	2	483	February 18, 2020
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2091	August 28, 2021
Question on Julia Blog Post regarding Performance Performance optimization , inbounds	1	360	September 16, 2022

@inbounds: is the compiler now so smart that this is no longer necessary?

Related topics