Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly?

Yes I have seen these bug fixes. Many thanks for all contributors :clap::clap::clap:

My main problem is that I use Gridap on one of my project and this package is not yet ready for arm Julia. I am confident that this will be fixed soon. In the meantime Rosetta is OK.

Why? I’m surprised a package depends so much on the architecture.

It is not clear for me but I didn’t investigate since I though that I should wait for Julia to be ready before reporting problem to package developers.
Failed to precompile GridapGmsh LoadError: UndefVarError: libgmsh not defined


     Testing Running tests...
Warning : Unknown entity of dimension 0 and tag 1 in physical group 1
Warning : Unknown entity of dimension 0 and tag 2 in physical group 1
Warning : Unknown entity of dimension 1 and tag 1 in physical group 2
Warning : Unknown entity of dimension 1 and tag 2 in physical group 2
Warning : Unknown entity of dimension 2 and tag 1 in physical group 6
Info    : Meshing 1D...
Info    : [  0%] Meshing curve 1 (Line)
Info    : [ 30%] Meshing curve 2 (Line)
Info    : [ 50%] Meshing curve 3 (Line)
Info    : [ 80%] Meshing curve 4 (Line)
Info    : Done meshing 1D (Wall 0.000194916s, CPU 0.000145s)
Info    : Meshing 2D...
Info    : Meshing surface 1 (Plane, Frontal-Delaunay)
Info    : Done meshing 2D (Wall 0.00640492s, CPU 0.006084s)
Info    : 404 nodes 810 elements
Info    : Writing '/var/folders/v2/hmy3kzgj4tb3xsy8qkltxd0r0000gn/T/jl_HWMsST/t1.msh'...
Info    : Done writing '/var/folders/v2/hmy3kzgj4tb3xsy8qkltxd0r0000gn/T/jl_HWMsST/t1.msh'
Test Summary: | Pass  Total  Time
gmsh          |    1      1  0.2s
Info    : Reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/t1.msh'...
Info    : 9 entities
Info    : 428 nodes
Info    : 816 elements
Info    : Done reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/t1.msh'
┌ Warning: `@_inline_meta` is deprecated, use `@inline` instead.
│   caller = get_staged(mi::Core.MethodInstance) at utilities.jl:110
└ @ Core.Compiler ./compiler/utilities.jl:110
Info    : Reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/twoTetraeder.msh'...
Info    : 4 entities
Info    : 5 nodes
Info    : 2 elements
Info    : Done reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/twoTetraeder.msh'
Info    : Reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/../demo/demo.msh'...
Info    : 188 entities
Info    : 10257 nodes
Info    : 41006 elements
Info    : Done reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/../demo/demo.msh'
Info    : Reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/square.msh'...
Info    : 9 entities
Info    : 9 nodes
Info    : 16 elements
Info    : Done reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/square.msh'
Info    : Reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/cube.msh'...
Info    : 27 entities
Info    : 27 nodes
Info    : 64 elements
Info    : Done reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/cube.msh'
Info    : Reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/plane.msh'...
Info    : 9 entities
Info    : 8 nodes
Info    : 18 elements
Info    : Done reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/plane.msh'
Info    : Reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/periodic.msh'...
Info    : 9 entities
Info    : 365 nodes
Info    : 732 elements
Info    : Done reading '/Users/mose/.julia/packages/GridapGmsh/Q8ZwW/test/periodic.msh'
Test Summary:     | Pass  Total     Time
GmshDiscreteModel |  544    544  1m06.1s
     Testing GridapGmsh tests passed

Thank you very much !
I still have a stupid question : I do not find the arm Julia version for the nightly build on the julia website (the macOS button download the x86 version).

1 Like

There is no CI for that platform at the moment so no nightly builds, I always compile julia locally, it takes about 4 minutes, it’s decently fast.

1 Like

It worked like a charm !!!
Thank you very much for your help.

1 Like

Yes, MKL is not tuned for client CPUs that have only 1 port of AVX-512. All such processors will run better with the AVX2 code path. You should use MKL_ENABLE_INSTRUCTIONS=AVX2 (or whatever is the correct way to disable AVX-512).

MKL is not particularly sensitive to L2 size.

Source: I used to work for Intel and spent over a year of my life on AVX-512 issues.


Interesting. The ability to store 4x the memory in registers should still reduce the rate at which these CPUs traverse memory, cutting memory bandwidth requirements.
Larger L2 should help for this reason as well.

Interesting that MKL’s approach doesn’t benefit from larger register tiles, which is not something one can benefit from automatically (e.g. from a recursive/cache oblivious algorithm that’d be insensitive to particular cache sizes).

1 Like

I didn’t say anything about registers. The sizes of the architecture x86 register files are constant across all implementations. MKL surely takes advantage of all the architectural registers in every implementation.

Performance of BLAS versus L2 block size saturates at some fraction of capacity, let’s say it’s around half, which means the performance isn’t going to be horrible if other things are happening that cause cache pollution. It also means the same implementation can work across a range of L2 implementations.

You talked about AVX2 vs AVX512 code paths, where the latter does make it much easier for naive implementations to achieve peak performance, e.g. I can hit 99% of the theoretical peak on tiger lake, while getting anything other than garbage performance is extremely difficult on Haswell (MKL, on the other hand, gets great performance on Haswell). Haswell has just AVX2 (but also 256 KiB L2 cache, and being older probably weaker prefetchers, etc).

In calculating column major C = A*B with Float64 elements…
The AVX2 code paths allow only using up to 16 named vector registers that are 32 bytes each.
Typically, this means your innermost microkernel computes an 8x6 blocks of C (requiring 2*6=12 registers), updating it on each iteration with a column of A (2 registers), and one at a time a broadcasted load from B (1 register, using 15 named registers in total). In practice, many more physical registers will probably be used (e.g. multiple parallel loads from B), but the kernel’s assembly is necessarily limited.

The AVX512 code path allows using a much larger microkernel. You have 32 named registers that are 64 bytes each, making, for example, 16x14 (214 + 2 + 1 == 31) or 24x9 kernels (39 + 3 + 1 == 31) feasible.

While CPUs like ice lake client, rocket lake, and tiger lake have equivalent FMA throughput with 32 byte vectors (2x 32/cycle) as with 64 byte (1x 64/cycle), the larger kernels should greatly reduce memory bandwidth requirements.

For example, lets say you’re multiplying a 128xK matrix A with a Kx252 matrix B, so that C is 128x252.
Typically, you’ll do this via dividing B into tiles of width equal to your micro kernel’s width (nr) in an outer loop, and A into tiles of height equal to the micro-kernel’s height (mr) in the inner loop.

for n in 1:nr:252
    for m in 1:mr:128
        @views microkernel!(C[m:m+mr-1, n:n+nr-1], A[m:m+mr-1, :], B[:, n:n+nr-1])

the key observation here is that for every iteration of the outer n loop, we are iterating over the entire matrix A, making matrix multiplication O(N^3) in memory bandwidth requirements, even though there is only O(N^2) memory overall.
Taking the 8x6 kernel, we need to load every element of A a total of 252/6 = 42 times. With a 24x9 kernel we’d have only needed to pass over the data and reload them 252/9 = 28 times, while the 16x14 microkernel cuts this all the way down to 18.

Reducing the amount of memory bandwidth required to feed the FMA units makes keeping them fed much easier, and these differences are rather significant: 42/18 > 2.3.
Of course, if one’s implementation is already so good that with AVX2 memory bandwidth isn’t a problem and FMA units achieve peak utilization, then cutting the requirements in half isn’t going to utilize the FMA units any further…
So maybe this is only a big deal for naive implementations. But, I’ve found that my relatively naive single threaded implementations can get 99% theoretical peak performance on Tiger Lake (AVX512, 1 FMA unit), 93% on Cascade Lake (AVX512, 2 FMA), and a bit over 92% on an M1.
Of course, the theoretical peak on Cascade Lake is 2x that of Tiger Lake or the M1.
There was some discussion here: Matrix Multiplication benchmark analysis · Issue #356 · JuliaSIMD/LoopVectorization.jl · GitHub

Note, I’m just talking about how I’d do it and how I’ve seen it discussed in the literature, not how MKL does anything – I’m not privy to any info on that.
MKL does seem to be handling cache much more effectively, so that it seems all of this may be mostly irrelevant.