@inbounds: is the compiler now so smart that this is no longer necessary?

PetrKryslUCSD · July 13, 2018, 5:03pm

In tests of FinEtools, the code with ALL @inbounds removed is faster than with.

antoine-levitt · July 13, 2018, 5:21pm

Apparently, it depends. For large arrays, it appears to make no difference, but for smaller one it does. Try the following:

using LinearAlgebra
using BenchmarkTools

function norm2(x)
    s = 0.0
    for i in eachindex(x)
        s += x[i]*x[i]
    end
    sqrt(s)
end
function norm2bis(x)
    s = 0.0
    n = length(x)
    for i in 1:n
        s += x[i]*x[i]
    end
    sqrt(s)
end
function norm3(x)
    s = 0.0
    @inbounds for i in eachindex(x)
        s += x[i]*x[i]
    end
    sqrt(s)
end
function norm4(x)
    s = 0.0
    @inbounds @simd for i in eachindex(x)
        s += x[i]*x[i]
    end
    sqrt(s)
end

a = randn(10000)

@btime norm($a)
@btime norm2($a)
@btime norm2bis($a)
@btime norm3($a)
@btime norm4($a)

I get

  3.004 μs (0 allocations: 0 bytes)
  11.807 μs (0 allocations: 0 bytes)
  11.802 μs (0 allocations: 0 bytes)
  11.150 μs (0 allocations: 0 bytes)
  710.554 ns (0 allocations: 0 bytes)

so no performance penalty. OTOH for size 100, there’s a ~30% penalty of the non-inbounds version. Also the SIMD version is scandalously fast (possibly some fastmath magic there). That’s on the official linux binaries of the beta with -O3 (non-O3 results are similar). Haven’t played with the system image, though.

StefanKarpinski · July 13, 2018, 5:23pm

What a delightful phrase!

tshort · July 13, 2018, 5:47pm

For @antoine-levitt’s test, I get the following on v0.7 for n==10000 (no -O3):

3.5 μs
11.2 μs
11.2 μs
8.4 μs
2.9 μs

On v0.6, I get

3.5 μs
8.4 μs
8.4 μs
8.7 μs
2.9 μs

So, I’m seeing a regression in v0.7 on norm2 and norm2bis. I see the same ratios with n == 100.

Elrod · July 13, 2018, 6:12pm

julia> @btime norm($a)
  2.700 μs (0 allocations: 0 bytes)
100.08555723165735

julia> @btime norm2($a)
  8.907 μs (0 allocations: 0 bytes)
100.08555723165712

julia> @btime norm2bis($a)
  8.907 μs (0 allocations: 0 bytes)
100.08555723165712

julia> @btime norm3($a)
  8.901 μs (0 allocations: 0 bytes)
100.08555723165712

julia> @btime norm4($a)
  407.545 ns (0 allocations: 0 bytes)
100.08555723165735

Re “scandalously fast”, for comparison

julia> z = similar(a);

julia> @btime copyto!($z, $a);
  1.376 μs (0 allocations: 0 bytes)

possibly some fastmath magic there

Here is part of @code_native norm4(a) on a test computer:

; Location: array.jl:705
L80: # move data from vector onto CPU registers 4, 5, 6, and 7
	vmovupd	-192(%rsi), %zmm4
	vmovupd	-128(%rsi), %zmm5
	vmovupd	-64(%rsi), %zmm6
	vmovupd	(%rsi), %zmm7
;}}
; Function macro expansion; {
; Location: float.jl:395 # square and add 4,5,6, and 7 to 0,1,2, and 3.
	vfmadd231pd	%zmm4, %zmm4, %zmm0
	vfmadd231pd	%zmm5, %zmm5, %zmm1
	vfmadd231pd	%zmm6, %zmm6, %zmm2
	vfmadd231pd	%zmm7, %zmm7, %zmm3
;}

Basically, it unrolls the loop by a factor of 4*SIMD length. On each iteration of the loop, it loads 4 SIMD vectors, multiplies them by themselves, and adds them to four separate cumulaton vectors.
Once the loop ends, it sums those vectors.

Check out this post by Keno:
https://juliacomputing.com/blog/2017/09/27/auto-vectorization-in-julia.html

EDIT:
Using a different computer:

julia> @btime norm($a)
  4.833 μs (0 allocations: 0 bytes)
100.37241645291456

julia> @btime norm2($a)
  7.246 μs (0 allocations: 0 bytes)
100.37241645291459

julia> @btime norm2bis($a)
  7.204 μs (0 allocations: 0 bytes)
100.37241645291459

julia> @btime norm3($a)
  7.201 μs (0 allocations: 0 bytes)
100.37241645291459

julia> @btime norm4($a)
  766.965 ns (0 allocations: 0 bytes)
100.37241645291456

antoine-levitt · July 13, 2018, 6:47pm

FWIW, running julia --math-mode=fast yields @simd-like results when @inbounds is present, and does not improve things if not.

It does not seem easy to devise a policy to write generic code that allows to get fast results while preserving reproductibility. Julia base and stdlib seem inconsistent on the annotations they use (sometimes inbounds, sometimes simd, sometimes both). A sensible default is to write @inbounds in critical loops, and to use math-mode=fast for fast performance. It’s a shame because those annotations are pretty ugly (and in particular destroy the nice “for at beginning of block” visual cue we’ve all been getting used to). Or just don’t have any annotations, and run with check-bounds=no.

However, it’s not clear to me whether check-bounds and math-mode apply to all the code, or simply non-precompiled user code.

Some official guidelines (eg in “performance tips”) would be very helpful here. Barring that, the ecosystem just uses random conventions, meaning that in the default mode (and even with check-bounds and math-mode) you can be sure neither that your results will be the same from machine to machine nor that you are getting optimal performance.

ChrisRackauckas · July 13, 2018, 9:46pm

Why isn’t copyto! using SIMD?

antoine-levitt · July 14, 2018, 7:01am

Even if it is, it still has to write in memory, which is slower than reading the cache.

PetrKryslUCSD · July 14, 2018, 9:42am

Here is a little experiment:

module moinbounds13
using FinEtools
using FinEtools.MatrixUtilityModule: add_mggt_ut_only!
using LinearAlgebra: Transpose, mul!
using BenchmarkTools
function add_mggt_ut_only_wo!(Ke::FFltMat, gradN::FFltMat, mult::FFlt)
    @assert size(Ke, 1) == size(Ke, 2)
    Kedim = size(Ke, 1)
    nne, mdim = size(gradN)
    mx::FInt = 0
    nx::FInt = 0
    px::FInt = 0
    for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
        for px = 1:mdim
            for mx = 1:nx # only the upper triangle
                Ke[mx, nx] +=  gradN[mx, px]*(mult)*gradN[nx, px]
            end
        end
    end
    return true
end
function test()
    gradN = rand(3, 2)
    Ke = fill(0.0, size(gradN, 1), size(gradN, 1))
    @btime add_mggt_ut_only!($Ke, $gradN, 1.0)
    @btime 1.0*($gradN*Transpose($gradN))
    @btime 1.0*mul!($Ke, $gradN, Transpose($gradN))
    @btime add_mggt_ut_only_wo!($Ke, $gradN, 1.0)
    true
end
end
using .moinbounds13
moinbounds13.test()

On my Surface Pro 4 it gives:

julia> include("test/test_inbounds.jl")
WARNING: replacing module moinbounds13.
  28.141 ns (0 allocations: 0 bytes)
  307.359 ns (2 allocations: 320 bytes)
  283.688 ns (1 allocation: 160 bytes)
  23.092 ns (0 allocations: 0 bytes)
true

So, turning off @inbounds (add_mggt_ut_only_wo!) results in code that is 18% faster than with @inbounds (add_mggt_ut_only!) for all three loops.

The performance of mul! is surprising. (Perhaps I’m not using it right?)

PetrKryslUCSD · July 14, 2018, 9:51am

Sorry, the comparison in the above example is not quite fair: `add_mggt_ut_only*! only compute the upper triangle. To complete the lower triangle takes an additional 9 ns.

Elrod · July 14, 2018, 11:25am

The performance of mul! is surprising. (Perhaps I’m not using it right?)

By doing 1*mul!, you’re allocating a whole new vector.
You could do (aside from dropping the multiplication by 1; I’m assuming there times you really do want to scale it)…

Ke .= 1.0 ,* mul!(Ke, gradN, gradN') # or Transpose(gradN), but they're equivalent for real matrices

Or, if you don’t mind using the lower level BLAS interface, you could try either of the following

BLAS.syrk!('U', 'N', 1.0, gradN, 0.0, Ke)
BLAS.gemm!('N', 'T', 1.0, gradN, gradN, 0.0, Ke)

However, there is massive overhead on calling BLAS. With large matrices, that overhead gets amortized.
Using StaticArrays (MMatrices if you want them mutable) will be much faster for small sizes.

Anyway, the reason your function is much faster without @inbounds is because you need @inbounds for simd instructions.
On a computer, here are the machine instructions when you aren’t using @inbounds:

; Function *; {
; Location: float.jl:399
	vmulsd	(%rbx,%rsi,8), %xmm0, %xmm1
	vmulsd	(%r12,%rdi,8), %xmm1, %xmm1
;}}
; Function +; {
; Location: float.jl:395
	vaddsd	(%rdx,%rsi,8), %xmm1, %xmm1
;}
; Function setindex!; {
; Location: array.jl:745
	vmovsd	%xmm1, (%rdx,%rsi,8)
;}

“xmm” registers are 16 bytes (128 bits), but the “8” means it’s only using 8 bytes, or a single double, at a time.
When you do add @inbounds, even though you didn’t add @simd, LLVM will automatically vectorize it for you. It does include a chunk of code exactly like the above to pick up “leftovers”, but it has a bunch of other code too, including:

; Function *; {
; Location: float.jl:399
	vmulpd	-192(%r14,%rbx,8), %zmm1, %zmm3
	vmulpd	-128(%r14,%rbx,8), %zmm1, %zmm4
	vmulpd	-64(%r14,%rbx,8), %zmm1, %zmm5
	vmulpd	(%r14,%rbx,8), %zmm1, %zmm6
	vmulpd	%zmm2, %zmm3, %zmm3
	vmulpd	%zmm2, %zmm4, %zmm4
	vmulpd	%zmm2, %zmm5, %zmm5
	vmulpd	%zmm2, %zmm6, %zmm2
;}}
; Function +; {
; Location: float.jl:395
	vaddpd	-192(%rcx,%rbx,8), %zmm3, %zmm3
	vaddpd	-128(%rcx,%rbx,8), %zmm4, %zmm4
	vaddpd	-64(%rcx,%rbx,8), %zmm5, %zmm5
	vaddpd	(%rcx,%rbx,8), %zmm2, %zmm2
;}
; Function setindex!; {
; Location: array.jl:745
	vmovupd	%zmm3, -192(%rcx,%rbx,8)
	vmovupd	%zmm4, -128(%rcx,%rbx,8)
	vmovupd	%zmm5, -64(%rcx,%rbx,8)
	vmovupd	%zmm2, (%rcx,%rbx,8)
	addq	$32, %rbx
	cmpq	%rbx, %r9
	jne	L608
;}}

The zmm registers are 512 bits – they hold 8 doubles each. This does the same thing, but each zmm register holds 8 doubles, and it is doing 4 instances of them at a time – meaning you need at least 32 doubles. On computers with avx2, that would be 256 bits == 4 doubles, for a total of at least 16 doubles to load, multiply, and at a time. (Because instructions can overlap, it’s faster to do four at a time, than wait until one is done before starting the next. Hence 4 per go).

Obviously, these matrices are way too small for the above – meaning the entire calculation is just the leftovers, which executes the same way on both the versions with and without @inbounds. While I see a few lines with

     movabsq	$jl_bounds_error_ints, %rax

only in the version without @inbounds, I guess it doesn’t take as long to deal with as all that SIMD stuff that doesn’t actually get used.

One advantage of using the statically sized arrays is that the compiler knows exactly how long the calculations will be – meaning it’ll only insert that SIMD code if it’ll actually get run.

Why isn’t copyto! using SIMD?
Even if it is, it still has to write in memory, which is slower than reading the cache.

Yeah, but adding @simd actually did help a bit.

julia> function copyloop!(y,x)
           s = 0.0
           @inbounds @simd for i in eachindex(x)
               y[i] = x[i]
           end
       end

julia> @benchmark copyloop!($z, $a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.368 μs (0.00% GC)
  median time:      1.379 μs (0.00% GC)
  mean time:        1.398 μs (0.00% GC)
  maximum time:     3.948 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark copyto!($z, $a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.545 μs (0.00% GC)
  median time:      1.842 μs (0.00% GC)
  mean time:        1.838 μs (0.00% GC)
  maximum time:     3.834 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

PetrKryslUCSD · July 14, 2018, 1:53pm

The top four curves (slower) represent matrix multiplication, and use of mul!, both with .* and without. The bottom two curves (faster) are explicit loops, and indeed as @Elrod pointed out, with larger matrices @inbounds has a chance of being faster than the loops without.

If one were to expect the BLAS overhead to get amortized with larger matrix sizes, it is taking quite a bit to get there: even for 225 rows and columns of the resulting matrix the explicit loops are still twice as fast as BLAS.

Elrod · July 14, 2018, 2:34pm

Did you try syrk! and gemm!?
The following are all equivalent (so long as typeof(a) == eltype(gradN) == eltype(Ke)).

Ke .= a .* mul!(Ke, gradN, gradN') 
BLAS.syrk!('U', 'N', a, gradN, 0.0, Ke)
BLAS.gemm!('N', 'T', a, gradN, gradN, 0.0, Ke)

The advantage here is that neither syrk! or gemm! loop over the data a second time to multiply by a.

Test functions:

julia>  gradN = rand(3, 2);
julia>  Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> mult = 2.3;
using LinearAlgebra, BenchmarkTools
function add_mggt_ut_only_wo!(Ke, gradN, mult)
    @assert size(Ke, 1) == size(Ke, 2)
    Kedim = size(Ke, 1)
    nne, mdim = size(gradN)
    mx = 0
    nx = 0
    px = 0
    for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
        for px = 1:mdim
            for mx = 1:nx # only the upper triangle
                Ke[mx, nx] +=  gradN[mx, px]*(mult)*gradN[nx, px]
            end
        end
    end
    Ke
end
function add_mggt_ut_only!(Ke, gradN, mult)
    @assert size(Ke, 1) == size(Ke, 2)
    Kedim = size(Ke, 1)
    nne, mdim = size(gradN)
    mx = 0
    nx = 0
    px = 0
    @inbounds for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
        for px = 1:mdim
            for mx = 1:nx # only the upper triangle
                Ke[mx, nx] +=  gradN[mx, px]*(mult)*gradN[nx, px]
            end
        end
    end
    Ke
end

Running the benchmarks with small matrices:

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN')
  334.450 ns (0 allocations: 0 bytes)
3×3 Array{Float64,2}:
 0.70059   1.51745  0.629522
 1.51745   3.63988  1.33395 
 0.629522  1.33395  0.56814 

julia> fill!(Ke, 0); # Just to confirm the result is the same each time

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke)
  299.895 ns (0 allocations: 0 bytes)
3×3 Array{Float64,2}:
 0.70059  1.51745  0.629522
 0.0      3.63988  1.33395 
 0.0      0.0      0.56814 

julia> fill!(Ke, 0);

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke)
  147.368 ns (0 allocations: 0 bytes)
3×3 Array{Float64,2}:
 0.70059   1.51745  0.629522
 1.51745   3.63988  1.33395 
 0.629522  1.33395  0.56814 

julia> fill!(Ke, 0);

julia> add_mggt_ut_only!(Ke, gradN, mult); Ke
3×3 Array{Float64,2}:
 0.70059  1.51745  0.629522
 0.0      3.63988  1.33395 
 0.0      0.0      0.56814 

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  21.483 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  17.165 ns (0 allocations: 0 bytes)

With 250x200, and OpenBLAS

julia> BLAS.set_num_threads(1) # BLAS easily wins with multithreading

julia> gradN = rand(250, 200);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  402.399 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  344.340 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  528.487 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  3.066 ms (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  4.401 ms (0 allocations: 0 bytes)

and MKL:

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  204.780 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  151.862 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  256.089 μs (0 allocations: 0 bytes)

With threading, gemm! is a lot faster than mul!. I thought mul! was just a wrapper for gemm!, so I’m not sure what is going on. MKL:

julia> BLAS.set_num_threads(10)

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  123.680 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  29.612 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  43.061 μs (0 allocations: 0 bytes)

I would have guessed that mul! is now slower because the broadcast part isn’t multithreaded, but…

julia> @btime mul!($Ke, $gradN, $gradN');
  80.795 μs (0 allocations: 0 bytes)

Also, even at 50 I’m seeing dramatically better performance, even with OpenBLAS (which is slower at small/modest matrix sizes than MKL):

julia> gradN = rand(50, 35);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  7.763 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  5.970 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  6.448 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  23.383 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  32.695 μs (0 allocations: 0 bytes)

Could something is wrong with your BLAS install? Maybe it’s using a generic fall back kernels, instead of ones optimized for your processor?
If you’re on Julia 0.6.2 or older with a fairly new processor, that’s sort of likely. Julia 0.6.2’s OpenBLAS didn’t support Ryzen, for example.

PetrKryslUCSD · July 14, 2018, 2:50pm

This is my configuration:
Julia Version 0.7.0-beta.279
Commit b0f531e5f0* (2018-07-12 15:33 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core™ i7-6650U CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4

I have to admit that I haven’t bothered building Julia for my particular setup. It is the generic Windows installation.

saschatimme · July 14, 2018, 2:50pm

Did you disabled multi threading inside the BLAS routines? I noticed that at least in 0.6 there is a significant overhead if BLAS tries to multi thread.

PetrKryslUCSD · July 14, 2018, 2:51pm

No, I haven’t tried disabling multithreading.

Elrod · July 14, 2018, 3:02pm

My first example was with disabled multithreading, yet it was still much faster. When I re-enabled it, it got much faster still at larger sizes.
I’m on 0.7, and OpenBLAS now seems smarter about being single threaded for smaller matrix sizes. Eg, as far as I could tell it ran single threaded here, regardless of what BLAS.set_num_threads was set to.

julia> BLAS.set_num_threads(1)

julia> gradN = rand(60, 40);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  11.014 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.425 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.289 μs (0 allocations: 0 bytes)

julia> BLAS.set_num_threads(10)

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  10.994 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.420 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.277 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  37.251 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  52.913 μs (0 allocations: 0 bytes)

Times were the same, and I never saw CPU usage exceed 100%.

Even with the generic installs, OpenBLAS is supposed to detect your processor and pick the appropriate kernel, so it shouldn’t matter whether you built it from source or not.

BTW, @PetrKryslUCSD, if you’re primarily concerned about small sizes and willing to use StaticArrays:

julia> using StaticArrays

julia> gradN = @MMatrix rand(3, 2);

julia> Ke = @MMatrix fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  5.714 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  8.696 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  8.697 ns (0 allocations: 0 bytes)

MMatrix also defaults to BLAS calls at larger sizes, but many of the other StaticArrays functions (eg, @MMatrix randn(m, n) ) aren’t optimized for big matrices, so lots of things would randomly be painfully slow.

PetrKryslUCSD · July 14, 2018, 3:03pm

This includes BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke), which eventually becomes the fastest method. (In my application gradN is not likely to be bigger than 64x3.)

Elrod · July 14, 2018, 3:13pm

Ah, okay. I think the skinniness of your matrix is part of the problem.
I would’ve called a 14x14 matrix “small”, and that has more elements than the 64x3. (Although X * X’ would be much larger for the latter).

3 may also be an awkward width. Eg, on my computer, BLAS takes about the same long for 64x3 as it does for 64x4, while the loop versions are of course about 33% faster for the former:

julia> gradN = rand(64, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  6.355 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  2.933 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  2.800 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  3.731 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  4.930 μs (0 allocations: 0 bytes)

julia> gradN = rand(64, 4);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  6.355 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  3.100 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  2.981 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  4.935 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  6.628 μs (0 allocations: 0 bytes)

PetrKryslUCSD · July 14, 2018, 3:25pm

Very interesting data. Thanks! And thanks for your suggestions. A lot of food for thought…

Topic		Replies	Views
Why mul! is so fast? General Usage question , linearalgebra	7	6924	November 26, 2019
We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4) Numerics	17	13629	October 30, 2019
[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays Package Announcements performance , blas	32	5858	July 5, 2020
Benchmark MATLAB & Julia for Matrix Operations Performance	148	20002	October 15, 2019
Simple Mat-Vec multiply (understanding performance, without the bugs) Performance tullio	16	3337	August 12, 2020

@inbounds: is the compiler now so smart that this is no longer necessary?

Related topics