[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays

Ah, now I noticed a regression in my latest benchmark results and fixed them:

There was a fairly expensive check that used to be evaluated at compile time, but was instead evaluated at runtime in PaddedMatrices v0.1.6. I just released v0.1.7, which moved the check back to compile time. Overhead of the check is O(1), but that it was fairly expensive at small sizes.

I pulled blasfeo master, and reran the benchmarks:
i9-10980XE (Cascadelake-X):


i5-8350U CPU (Skylake):

i3-4010U CPU (Haswell):

Now PaddedMatrices is much faster over the 2:24 size range on the three computers I tested.

due to denormals being present in the memory around the matrices somehow present in your benchmark routines.

This is normal in Julia. If you read out of bounds memory, it will probably be junk. If you interpret the memory as floating point numbers, denormals are likely.
I use @llvm.masked.load and @llvm.masked.store intrinsics to avoid touching out of bounds memory, while still using SIMD instructions. These get lowered to vmaskmovp* instructions with AVX. The mask consumes a floating point register. With AVX512, it uses normal vmovup* instructions, but applies a bitmask, using one of the opmask registers instead of a floating point register.

Also, when it comes to such small sizes (e.g., 2:24) I’d recommend taking advantage of Julia’s JIT if possible.
Benchmarks that allowed specializing on the size of the arrays:

NN

CascadelakeX:


Skylake:

Haswell:

TN

CascadelakeX:


Skylake:

Haswell:

TT

CascadelakeX:


Skylake:

Haswell:

Of course, I don’t think Julia/a massive runtime is a good choice for embedded devices. But for someone already using Julia, it can provide a nice benefit.

I did not include NT because BLASFEO produced an incorrect result at 13x13.

The relevant code from PaddedMatrices does the following:

  1. If the first stride of A is not contiguous, pack.
  2. Heuristically, it will start packing if 73 > M with AVX512F, or 53 > M without it, unless the base of the array is aligned and stride(A,2) is also a multiple of the SIMD vector width.
  3. if mc * kc > M * K, where A is M x K and mc and kc are the blocking parameters, do pack A.

If it is packing A, it will also pack B if kc * nc ≤ K * N.

Therefore, it will generally pack A in T*. The exception there is if A’s size is known at compile time (e.g., the FixedSizeArray), in which case it wont pack A if the number of rows is less than or equal to twice the SIMD vector width. I should probably make that check more sophisticated, e.g. with respect to the number of times the same elements from A will actually have to be reloaded.

When I get there, I’ll decide how data should be laid out. On the code-gen side of things, I still need to add support for triangular loops and handling dependencies between iterations.
But this focus on code gen means that once LoopVectorization understands these things, the macrokernels should be easily specified via simple loops, making details such as memory layout relatively easy to change.

I am a fan of the idea of alternative layouts. I’ve slowly been working on a DSL for probabilistic programming. One optimization I’d like to support some day is having it choose the memory layout of all underlying data structures to optimze performance.
Perhaps I’d be better off making internal arrays default to some sort of tile-major layout instead of column-major?

That is unfortunate, but I like the idea of people having the ability to choose something aligned with their usecase. E.g., reviews are currently overwhelmingly better for AMD, but I am happy to still have the choice to buy AVX512, because I can write software to actually take advantage of it.

But most people don’t/won’t, so why should they pay for silicon they’re not using?
But in terms of segmentation, I’d love to see something like a broad adoption of ARM’s scalar vector extensions (at least, broadly adopted among ARM CPUs), with different segments simply supporting different vector widths. Those who need HPC can buy wide vectors, and those who don’t can buy 128-bit, all within the same instruction set.
Unfortunately, as far as I know, the A64FX is the only CPU currently supporting SVE.

Thanks. The CPU is a 10980XE, which is marketed as “high end desktop”. It hits >2.1 terraflops with MKL’s dgemm, can build software from source quickly, and looks stellar in all the heavily-SIMD benchmarks I run, so IMO worth the expense for a hobbyist. AMD CPUs do well on compiling software, but none currently come close to the >120 double precision GFLOPS/core it achieves.
It’s worth more than my car, if you’re curious where my priorities are :wink:.

8 Likes