@inbounds: is the compiler now so smart that this is no longer necessary?


#1

In tests of FinEtools, the code with ALL @inbounds removed is faster than with.


#2

Apparently, it depends. For large arrays, it appears to make no difference, but for smaller one it does. Try the following:

using LinearAlgebra
using BenchmarkTools

function norm2(x)
    s = 0.0
    for i in eachindex(x)
        s += x[i]*x[i]
    end
    sqrt(s)
end
function norm2bis(x)
    s = 0.0
    n = length(x)
    for i in 1:n
        s += x[i]*x[i]
    end
    sqrt(s)
end
function norm3(x)
    s = 0.0
    @inbounds for i in eachindex(x)
        s += x[i]*x[i]
    end
    sqrt(s)
end
function norm4(x)
    s = 0.0
    @inbounds @simd for i in eachindex(x)
        s += x[i]*x[i]
    end
    sqrt(s)
end

a = randn(10000)

@btime norm($a)
@btime norm2($a)
@btime norm2bis($a)
@btime norm3($a)
@btime norm4($a)

I get

  3.004 μs (0 allocations: 0 bytes)
  11.807 μs (0 allocations: 0 bytes)
  11.802 μs (0 allocations: 0 bytes)
  11.150 μs (0 allocations: 0 bytes)
  710.554 ns (0 allocations: 0 bytes)

so no performance penalty. OTOH for size 100, there’s a ~30% penalty of the non-inbounds version. Also the SIMD version is scandalously fast (possibly some fastmath magic there). That’s on the official linux binaries of the beta with -O3 (non-O3 results are similar). Haven’t played with the system image, though.


#3

What a delightful phrase!


#4

For @antoine-levitt’s test, I get the following on v0.7 for n==10000 (no -O3):

3.5 μs
11.2 μs
11.2 μs
8.4 μs
2.9 μs

On v0.6, I get

3.5 μs
8.4 μs
8.4 μs
8.7 μs
2.9 μs

So, I’m seeing a regression in v0.7 on norm2 and norm2bis. I see the same ratios with n == 100.


#5
julia> @btime norm($a)
  2.700 μs (0 allocations: 0 bytes)
100.08555723165735

julia> @btime norm2($a)
  8.907 μs (0 allocations: 0 bytes)
100.08555723165712

julia> @btime norm2bis($a)
  8.907 μs (0 allocations: 0 bytes)
100.08555723165712

julia> @btime norm3($a)
  8.901 μs (0 allocations: 0 bytes)
100.08555723165712

julia> @btime norm4($a)
  407.545 ns (0 allocations: 0 bytes)
100.08555723165735

Re “scandalously fast”, for comparison

julia> z = similar(a);

julia> @btime copyto!($z, $a);
  1.376 μs (0 allocations: 0 bytes)

possibly some fastmath magic there

Here is part of @code_native norm4(a) on a test computer:

; Location: array.jl:705
L80: # move data from vector onto CPU registers 4, 5, 6, and 7
	vmovupd	-192(%rsi), %zmm4
	vmovupd	-128(%rsi), %zmm5
	vmovupd	-64(%rsi), %zmm6
	vmovupd	(%rsi), %zmm7
;}}
; Function macro expansion; {
; Location: float.jl:395 # square and add 4,5,6, and 7 to 0,1,2, and 3.
	vfmadd231pd	%zmm4, %zmm4, %zmm0
	vfmadd231pd	%zmm5, %zmm5, %zmm1
	vfmadd231pd	%zmm6, %zmm6, %zmm2
	vfmadd231pd	%zmm7, %zmm7, %zmm3
;}

Basically, it unrolls the loop by a factor of 4*SIMD length. On each iteration of the loop, it loads 4 SIMD vectors, multiplies them by themselves, and adds them to four separate cumulaton vectors.
Once the loop ends, it sums those vectors.

Check out this post by Keno:
https://juliacomputing.com/blog/2017/09/27/auto-vectorization-in-julia.html

EDIT:
Using a different computer:

julia> @btime norm($a)
  4.833 μs (0 allocations: 0 bytes)
100.37241645291456

julia> @btime norm2($a)
  7.246 μs (0 allocations: 0 bytes)
100.37241645291459

julia> @btime norm2bis($a)
  7.204 μs (0 allocations: 0 bytes)
100.37241645291459

julia> @btime norm3($a)
  7.201 μs (0 allocations: 0 bytes)
100.37241645291459

julia> @btime norm4($a)
  766.965 ns (0 allocations: 0 bytes)
100.37241645291456

#6

FWIW, running julia --math-mode=fast yields @simd-like results when @inbounds is present, and does not improve things if not.

It does not seem easy to devise a policy to write generic code that allows to get fast results while preserving reproductibility. Julia base and stdlib seem inconsistent on the annotations they use (sometimes inbounds, sometimes simd, sometimes both). A sensible default is to write @inbounds in critical loops, and to use math-mode=fast for fast performance. It’s a shame because those annotations are pretty ugly (and in particular destroy the nice “for at beginning of block” visual cue we’ve all been getting used to). Or just don’t have any annotations, and run with check-bounds=no.

However, it’s not clear to me whether check-bounds and math-mode apply to all the code, or simply non-precompiled user code.

Some official guidelines (eg in “performance tips”) would be very helpful here. Barring that, the ecosystem just uses random conventions, meaning that in the default mode (and even with check-bounds and math-mode) you can be sure neither that your results will be the same from machine to machine nor that you are getting optimal performance.


#7

Why isn’t copyto! using SIMD?


#8

Even if it is, it still has to write in memory, which is slower than reading the cache.


#9

Here is a little experiment:

module moinbounds13
using FinEtools
using FinEtools.MatrixUtilityModule: add_mggt_ut_only!
using LinearAlgebra: Transpose, mul!
using BenchmarkTools
function add_mggt_ut_only_wo!(Ke::FFltMat, gradN::FFltMat, mult::FFlt)
    @assert size(Ke, 1) == size(Ke, 2)
    Kedim = size(Ke, 1)
    nne, mdim = size(gradN)
    mx::FInt = 0
    nx::FInt = 0
    px::FInt = 0
    for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
        for px = 1:mdim
            for mx = 1:nx # only the upper triangle
                Ke[mx, nx] +=  gradN[mx, px]*(mult)*gradN[nx, px]
            end
        end
    end
    return true
end
function test()
    gradN = rand(3, 2)
    Ke = fill(0.0, size(gradN, 1), size(gradN, 1))
    @btime add_mggt_ut_only!($Ke, $gradN, 1.0)
    @btime 1.0*($gradN*Transpose($gradN))
    @btime 1.0*mul!($Ke, $gradN, Transpose($gradN))
    @btime add_mggt_ut_only_wo!($Ke, $gradN, 1.0)
    true
end
end
using .moinbounds13
moinbounds13.test()

On my Surface Pro 4 it gives:

julia> include("test/test_inbounds.jl")
WARNING: replacing module moinbounds13.
  28.141 ns (0 allocations: 0 bytes)
  307.359 ns (2 allocations: 320 bytes)
  283.688 ns (1 allocation: 160 bytes)
  23.092 ns (0 allocations: 0 bytes)
true

So, turning off @inbounds (add_mggt_ut_only_wo!) results in code that is 18% faster than with @inbounds (add_mggt_ut_only!) for all three loops.

The performance of mul! is surprising. (Perhaps I’m not using it right?)


#10

Sorry, the comparison in the above example is not quite fair: `add_mggt_ut_only*! only compute the upper triangle. To complete the lower triangle takes an additional 9 ns.


#11

The performance of mul! is surprising. (Perhaps I’m not using it right?)

By doing 1*mul!, you’re allocating a whole new vector.
You could do (aside from dropping the multiplication by 1; I’m assuming there times you really do want to scale it)…

Ke .= 1.0 ,* mul!(Ke, gradN, gradN') # or Transpose(gradN), but they're equivalent for real matrices

Or, if you don’t mind using the lower level BLAS interface, you could try either of the following

BLAS.syrk!('U', 'N', 1.0, gradN, 0.0, Ke)
BLAS.gemm!('N', 'T', 1.0, gradN, gradN, 0.0, Ke)

However, there is massive overhead on calling BLAS. With large matrices, that overhead gets amortized.
Using StaticArrays (MMatrices if you want them mutable) will be much faster for small sizes.

Anyway, the reason your function is much faster without @inbounds is because you need @inbounds for simd instructions.
On a computer, here are the machine instructions when you aren’t using @inbounds:

; Function *; {
; Location: float.jl:399
	vmulsd	(%rbx,%rsi,8), %xmm0, %xmm1
	vmulsd	(%r12,%rdi,8), %xmm1, %xmm1
;}}
; Function +; {
; Location: float.jl:395
	vaddsd	(%rdx,%rsi,8), %xmm1, %xmm1
;}
; Function setindex!; {
; Location: array.jl:745
	vmovsd	%xmm1, (%rdx,%rsi,8)
;}

“xmm” registers are 16 bytes (128 bits), but the “8” means it’s only using 8 bytes, or a single double, at a time.
When you do add @inbounds, even though you didn’t add @simd, LLVM will automatically vectorize it for you. It does include a chunk of code exactly like the above to pick up “leftovers”, but it has a bunch of other code too, including:

; Function *; {
; Location: float.jl:399
	vmulpd	-192(%r14,%rbx,8), %zmm1, %zmm3
	vmulpd	-128(%r14,%rbx,8), %zmm1, %zmm4
	vmulpd	-64(%r14,%rbx,8), %zmm1, %zmm5
	vmulpd	(%r14,%rbx,8), %zmm1, %zmm6
	vmulpd	%zmm2, %zmm3, %zmm3
	vmulpd	%zmm2, %zmm4, %zmm4
	vmulpd	%zmm2, %zmm5, %zmm5
	vmulpd	%zmm2, %zmm6, %zmm2
;}}
; Function +; {
; Location: float.jl:395
	vaddpd	-192(%rcx,%rbx,8), %zmm3, %zmm3
	vaddpd	-128(%rcx,%rbx,8), %zmm4, %zmm4
	vaddpd	-64(%rcx,%rbx,8), %zmm5, %zmm5
	vaddpd	(%rcx,%rbx,8), %zmm2, %zmm2
;}
; Function setindex!; {
; Location: array.jl:745
	vmovupd	%zmm3, -192(%rcx,%rbx,8)
	vmovupd	%zmm4, -128(%rcx,%rbx,8)
	vmovupd	%zmm5, -64(%rcx,%rbx,8)
	vmovupd	%zmm2, (%rcx,%rbx,8)
	addq	$32, %rbx
	cmpq	%rbx, %r9
	jne	L608
;}}

The zmm registers are 512 bits – they hold 8 doubles each. This does the same thing, but each zmm register holds 8 doubles, and it is doing 4 instances of them at a time – meaning you need at least 32 doubles. On computers with avx2, that would be 256 bits == 4 doubles, for a total of at least 16 doubles to load, multiply, and at a time. (Because instructions can overlap, it’s faster to do four at a time, than wait until one is done before starting the next. Hence 4 per go).

Obviously, these matrices are way too small for the above – meaning the entire calculation is just the leftovers, which executes the same way on both the versions with and without @inbounds. While I see a few lines with

     movabsq	$jl_bounds_error_ints, %rax

only in the version without @inbounds, I guess it doesn’t take as long to deal with as all that SIMD stuff that doesn’t actually get used.

One advantage of using the statically sized arrays is that the compiler knows exactly how long the calculations will be – meaning it’ll only insert that SIMD code if it’ll actually get run.

Why isn’t copyto! using SIMD?
Even if it is, it still has to write in memory, which is slower than reading the cache.

Yeah, but adding @simd actually did help a bit.

julia> function copyloop!(y,x)
           s = 0.0
           @inbounds @simd for i in eachindex(x)
               y[i] = x[i]
           end
       end

julia> @benchmark copyloop!($z, $a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.368 μs (0.00% GC)
  median time:      1.379 μs (0.00% GC)
  mean time:        1.398 μs (0.00% GC)
  maximum time:     3.948 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark copyto!($z, $a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.545 μs (0.00% GC)
  median time:      1.842 μs (0.00% GC)
  mean time:        1.838 μs (0.00% GC)
  maximum time:     3.834 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

#12

The top four curves (slower) represent matrix multiplication, and use of mul!, both with .* and without. The bottom two curves (faster) are explicit loops, and indeed as @Elrod pointed out, with larger matrices @inbounds has a chance of being faster than the loops without.

If one were to expect the BLAS overhead to get amortized with larger matrix sizes, it is taking quite a bit to get there: even for 225 rows and columns of the resulting matrix the explicit loops are still twice as fast as BLAS.


#13

Did you try syrk! and gemm!?
The following are all equivalent (so long as typeof(a) == eltype(gradN) == eltype(Ke)).

Ke .= a .* mul!(Ke, gradN, gradN') 
BLAS.syrk!('U', 'N', a, gradN, 0.0, Ke)
BLAS.gemm!('N', 'T', a, gradN, gradN, 0.0, Ke)

The advantage here is that neither syrk! or gemm! loop over the data a second time to multiply by a.

Test functions:

julia>  gradN = rand(3, 2);
julia>  Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> mult = 2.3;
using LinearAlgebra, BenchmarkTools
function add_mggt_ut_only_wo!(Ke, gradN, mult)
    @assert size(Ke, 1) == size(Ke, 2)
    Kedim = size(Ke, 1)
    nne, mdim = size(gradN)
    mx = 0
    nx = 0
    px = 0
    for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
        for px = 1:mdim
            for mx = 1:nx # only the upper triangle
                Ke[mx, nx] +=  gradN[mx, px]*(mult)*gradN[nx, px]
            end
        end
    end
    Ke
end
function add_mggt_ut_only!(Ke, gradN, mult)
    @assert size(Ke, 1) == size(Ke, 2)
    Kedim = size(Ke, 1)
    nne, mdim = size(gradN)
    mx = 0
    nx = 0
    px = 0
    @inbounds for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
        for px = 1:mdim
            for mx = 1:nx # only the upper triangle
                Ke[mx, nx] +=  gradN[mx, px]*(mult)*gradN[nx, px]
            end
        end
    end
    Ke
end

Running the benchmarks with small matrices:

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN')
  334.450 ns (0 allocations: 0 bytes)
3×3 Array{Float64,2}:
 0.70059   1.51745  0.629522
 1.51745   3.63988  1.33395 
 0.629522  1.33395  0.56814 

julia> fill!(Ke, 0); # Just to confirm the result is the same each time

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke)
  299.895 ns (0 allocations: 0 bytes)
3×3 Array{Float64,2}:
 0.70059  1.51745  0.629522
 0.0      3.63988  1.33395 
 0.0      0.0      0.56814 

julia> fill!(Ke, 0);

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke)
  147.368 ns (0 allocations: 0 bytes)
3×3 Array{Float64,2}:
 0.70059   1.51745  0.629522
 1.51745   3.63988  1.33395 
 0.629522  1.33395  0.56814 

julia> fill!(Ke, 0);

julia> add_mggt_ut_only!(Ke, gradN, mult); Ke
3×3 Array{Float64,2}:
 0.70059  1.51745  0.629522
 0.0      3.63988  1.33395 
 0.0      0.0      0.56814 

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  21.483 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  17.165 ns (0 allocations: 0 bytes)

With 250x200, and OpenBLAS

julia> BLAS.set_num_threads(1) # BLAS easily wins with multithreading

julia> gradN = rand(250, 200);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  402.399 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  344.340 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  528.487 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  3.066 ms (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  4.401 ms (0 allocations: 0 bytes)

and MKL:

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  204.780 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  151.862 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  256.089 μs (0 allocations: 0 bytes)

With threading, gemm! is a lot faster than mul!. I thought mul! was just a wrapper for gemm!, so I’m not sure what is going on. MKL:

julia> BLAS.set_num_threads(10)

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  123.680 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  29.612 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  43.061 μs (0 allocations: 0 bytes)

I would have guessed that mul! is now slower because the broadcast part isn’t multithreaded, but…

julia> @btime mul!($Ke, $gradN, $gradN');
  80.795 μs (0 allocations: 0 bytes)

Also, even at 50 I’m seeing dramatically better performance, even with OpenBLAS (which is slower at small/modest matrix sizes than MKL):

julia> gradN = rand(50, 35);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  7.763 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  5.970 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  6.448 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  23.383 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  32.695 μs (0 allocations: 0 bytes)

Could something is wrong with your BLAS install? Maybe it’s using a generic fall back kernels, instead of ones optimized for your processor?
If you’re on Julia 0.6.2 or older with a fairly new processor, that’s sort of likely. Julia 0.6.2’s OpenBLAS didn’t support Ryzen, for example.


#14

This is my configuration:
Julia Version 0.7.0-beta.279
Commit b0f531e5f0* (2018-07-12 15:33 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel® Core™ i7-6650U CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4

I have to admit that I haven’t bothered building Julia for my particular setup. It is the generic Windows installation.


#15

Did you disabled multi threading inside the BLAS routines? I noticed that at least in 0.6 there is a significant overhead if BLAS tries to multi thread.


#16

No, I haven’t tried disabling multithreading.


#17

My first example was with disabled multithreading, yet it was still much faster. When I re-enabled it, it got much faster still at larger sizes.
I’m on 0.7, and OpenBLAS now seems smarter about being single threaded for smaller matrix sizes. Eg, as far as I could tell it ran single threaded here, regardless of what BLAS.set_num_threads was set to.

julia> BLAS.set_num_threads(1)

julia> gradN = rand(60, 40);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  11.014 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.425 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.289 μs (0 allocations: 0 bytes)

julia> BLAS.set_num_threads(10)

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  10.994 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.420 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.277 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  37.251 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  52.913 μs (0 allocations: 0 bytes)

Times were the same, and I never saw CPU usage exceed 100%.

Even with the generic installs, OpenBLAS is supposed to detect your processor and pick the appropriate kernel, so it shouldn’t matter whether you built it from source or not.

BTW, @PetrKryslUCSD, if you’re primarily concerned about small sizes and willing to use StaticArrays:

julia> using StaticArrays

julia> gradN = @MMatrix rand(3, 2);

julia> Ke = @MMatrix fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  5.714 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  8.696 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  8.697 ns (0 allocations: 0 bytes)

MMatrix also defaults to BLAS calls at larger sizes, but many of the other StaticArrays functions (eg, @MMatrix randn(m, n) ) aren’t optimized for big matrices, so lots of things would randomly be painfully slow.


#18

This includes BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke), which eventually becomes the fastest method. (In my application gradN is not likely to be bigger than 64x3.)


#19

Ah, okay. I think the skinniness of your matrix is part of the problem.
I would’ve called a 14x14 matrix “small”, and that has more elements than the 64x3. (Although X * X’ would be much larger for the latter).

3 may also be an awkward width. Eg, on my computer, BLAS takes about the same long for 64x3 as it does for 64x4, while the loop versions are of course about 33% faster for the former:

julia> gradN = rand(64, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  6.355 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  2.933 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  2.800 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  3.731 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  4.930 μs (0 allocations: 0 bytes)

julia> gradN = rand(64, 4);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  6.355 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  3.100 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  2.981 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  4.935 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  6.628 μs (0 allocations: 0 bytes)

#20

Very interesting data. Thanks! And thanks for your suggestions. A lot of food for thought…