Help defining (masked) vload and vstore operations for SArrays (or other isbits structs) using llvmcall

Hi everyone,
First of all, the motivation of this question is the StaticArrays issue 3x3 matrix multiply could potentially be faster.

I showed that matrix multiplication can become substantially faster across a range of sizes:

(size(m3), size(m1), size(m2)) = ((2, 2), (2, 2), (2, 2))
MMatrix Multiplication:
  3.135 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  1.584 ns (0 allocations: 0 bytes)
fastmul!:
  1.896 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0; 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((3, 3), (3, 3), (3, 3))
MMatrix Multiplication:
  7.383 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  8.274 ns (0 allocations: 0 bytes)
fastmul!:
  3.237 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0; 0.0 0.0 0.0; 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((4, 4), (4, 4), (4, 4))
MMatrix Multiplication:
  10.524 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  3.528 ns (0 allocations: 0 bytes)
fastmul!:
  4.154 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((5, 5), (5, 5), (5, 5))
MMatrix Multiplication:
  19.038 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  22.561 ns (0 allocations: 0 bytes)
fastmul!:
  5.894 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((6, 6), (6, 6), (6, 6))
MMatrix Multiplication:
  30.316 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  39.103 ns (0 allocations: 0 bytes)
fastmul!:
  7.837 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((7, 7), (7, 7), (7, 7))
MMatrix Multiplication:
  51.105 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  62.420 ns (0 allocations: 0 bytes)
fastmul!:
  11.871 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((8, 8), (8, 8), (8, 8))
MMatrix Multiplication:
  36.552 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  12.940 ns (0 allocations: 0 bytes)
fastmul!:
  12.794 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((9, 9), (9, 9), (9, 9))
MMatrix Multiplication:
  68.433 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  73.896 ns (0 allocations: 0 bytes)
fastmul!:
  24.042 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((10, 10), (10, 10), (10, 10))
MMatrix Multiplication:
  106.568 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  109.546 ns (0 allocations: 0 bytes)
fastmul!:
  31.296 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((11, 11), (11, 11), (11, 11))
MMatrix Multiplication:
  161.298 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  151.703 ns (0 allocations: 0 bytes)
fastmul!:
  38.405 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((12, 12), (12, 12), (12, 12))
MMatrix Multiplication:
  210.829 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  216.941 ns (0 allocations: 0 bytes)
fastmul!:
  47.986 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((13, 13), (13, 13), (13, 13))
MMatrix Multiplication:
  315.835 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  302.625 ns (0 allocations: 0 bytes)
fastmul!:
  65.856 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((14, 14), (14, 14), (14, 14))
MMatrix Multiplication:
  466.087 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  396.795 ns (0 allocations: 0 bytes)
fastmul!:
  66.755 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((15, 14), (15, 15), (15, 14))
MMatrix Multiplication:
  548.775 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  450.668 ns (0 allocations: 0 bytes)
fastmul!:
  67.804 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

(size(m3), size(m1), size(m2)) = ((16, 14), (16, 16), (16, 14))
MMatrix Multiplication:
  545.207 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  83.752 ns (0 allocations: 0 bytes)
fastmul!:
  65.457 ns (0 allocations: 0 bytes)
m3 - s3 = [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

We can already use the fastmul! function as soon as we’re willing to add dependencies to CpuId and SIMD/SIMDPirates – plus a little extra code to loop over the kernel for whenever the matrices are larger than the kernel size (ie, roughly 8xN * Nx6 for avx2 or 16xN * Nx14 for avx512).

However, it only works for the heap allocated MMatrix, meaning using it forces us to write code like mul!(C, A, B), after having taken care to preallocate C, instead of the simpler C = A * B. Worse than that, if we want to use automatic differentiation libraries, many of these currently do not support mutating arguments, forcing us to constantly trigger the garbage collector (or switch to potentially slower StaticArray methods).

The simple reason we need MMatrices instead of SMatrices is because I don’t know how to load a vector from an SMatrix using llvm intrinsics. This is necessary so that I can use masked load and store operations. In short, for multiplication to be fast, we need to be able to efficiently vectorize it – and masked load/store operations let us do precisely that, even when the number of rows of A in C = A * B isn’t a multiple of the computer’s vector width.

But surely, it is possible somehow? For example,

julia> using StaticArrays

julia> a = @SVector randn(24);

julia> b = @SVector randn(24);

julia> @code_llvm hcat(a, b)

; Function hcat
; Location: /home/chriselrod/.julia/packages/StaticArrays/WmJnA/src/linalg.jl:120
define void @julia_hcat_-777564679({ [48 x double] }* noalias nocapture sret, { [24 x double] } addrspace(11)* nocapture nonnull readonly dereferenceable(192), { [24 x double] } addrspace(11)* nocapture nonnull readonly dereferenceable(192)) {
top:
; Function _hcat; {
; Location: /home/chriselrod/.julia/packages/StaticArrays/WmJnA/src/linalg.jl:124
; Function macro expansion; {
; Location: /home/chriselrod/.julia/packages/StaticArrays/WmJnA/src/linalg.jl:135
; Function getindex; {
; Location: /home/chriselrod/.julia/packages/StaticArrays/WmJnA/src/SVector.jl:37
; Function getindex; {
; Location: tuple.jl:24
  %3 = getelementptr { [24 x double] }, { [24 x double] } addrspace(11)* %1, i64 0, i32 0, i64 8
  %4 = getelementptr { [24 x double] }, { [24 x double] } addrspace(11)* %1, i64 0, i32 0, i64 16
  %5 = getelementptr { [24 x double] }, { [24 x double] } addrspace(11)* %2, i64 0, i32 0, i64 8
  %6 = getelementptr { [24 x double] }, { [24 x double] } addrspace(11)* %2, i64 0, i32 0, i64 16
;}}
  %7 = bitcast { [24 x double] } addrspace(11)* %1 to <8 x i64> addrspace(11)*
  %8 = load <8 x i64>, <8 x i64> addrspace(11)* %7, align 8
  %9 = bitcast double addrspace(11)* %3 to <8 x i64> addrspace(11)*
  %10 = load <8 x i64>, <8 x i64> addrspace(11)* %9, align 8
  %11 = bitcast double addrspace(11)* %4 to <8 x i64> addrspace(11)*
  %12 = load <8 x i64>, <8 x i64> addrspace(11)* %11, align 8
  %13 = bitcast { [24 x double] } addrspace(11)* %2 to <8 x i64> addrspace(11)*
  %14 = load <8 x i64>, <8 x i64> addrspace(11)* %13, align 8
  %15 = bitcast double addrspace(11)* %5 to <8 x i64> addrspace(11)*
  %16 = load <8 x i64>, <8 x i64> addrspace(11)* %15, align 8
  %17 = bitcast double addrspace(11)* %6 to <8 x i64> addrspace(11)*
  %18 = load <8 x i64>, <8 x i64> addrspace(11)* %17, align 8
;}}
  %19 = bitcast { [48 x double] }* %0 to <8 x i64>*
  store <8 x i64> %8, <8 x i64>* %19, align 8
  %.sroa.0.sroa.9.0..sroa.0.0..sroa_cast1.sroa_idx58 = getelementptr inbounds { [48 x double] }, { [48 x double] }* %0, i64 0, i32 0, i64 8
  %20 = bitcast double* %.sroa.0.sroa.9.0..sroa.0.0..sroa_cast1.sroa_idx58 to <8 x i64>*
  store <8 x i64> %10, <8 x i64>* %20, align 8
  %.sroa.0.sroa.17.0..sroa.0.0..sroa_cast1.sroa_idx66 = getelementptr inbounds { [48 x double] }, { [48 x double] }* %0, i64 0, i32 0, i64 16
  %21 = bitcast double* %.sroa.0.sroa.17.0..sroa.0.0..sroa_cast1.sroa_idx66 to <8 x i64>*
  store <8 x i64> %12, <8 x i64>* %21, align 8
  %.sroa.0.sroa.25.0..sroa.0.0..sroa_cast1.sroa_idx74 = getelementptr inbounds { [48 x double] }, { [48 x double] }* %0, i64 0, i32 0, i64 24
  %22 = bitcast double* %.sroa.0.sroa.25.0..sroa.0.0..sroa_cast1.sroa_idx74 to <8 x i64>*
  store <8 x i64> %14, <8 x i64>* %22, align 8
  %.sroa.0.sroa.33.0..sroa.0.0..sroa_cast1.sroa_idx82 = getelementptr inbounds { [48 x double] }, { [48 x double] }* %0, i64 0, i32 0, i64 32
  %23 = bitcast double* %.sroa.0.sroa.33.0..sroa.0.0..sroa_cast1.sroa_idx82 to <8 x i64>*
  store <8 x i64> %16, <8 x i64>* %23, align 8
  %.sroa.0.sroa.41.0..sroa.0.0..sroa_cast1.sroa_idx90 = getelementptr inbounds { [48 x double] }, { [48 x double] }* %0, i64 0, i32 0, i64 40
  %24 = bitcast double* %.sroa.0.sroa.41.0..sroa.0.0..sroa_cast1.sroa_idx90 to <8 x i64>*
  store <8 x i64> %18, <8 x i64>* %24, align 8
  ret void
}

hcat seems to do precisely this. For example, it loads elements 9-16 from “a” with:

  %3 = getelementptr { [24 x double] }, { [24 x double] } addrspace(11)* %1, i64 0, i32 0, i64 8
  %9 = bitcast double addrspace(11)* %3 to <8 x i64> addrspace(11)*
  %10 = load <8 x i64>, <8 x i64> addrspace(11)* %9, align 8

and elements 17-24 from a with:

  %4 = getelementptr { [24 x double] }, { [24 x double] } addrspace(11)* %1, i64 0, i32 0, i64 16
  %11 = bitcast double addrspace(11)* %4 to <8 x i64> addrspace(11)*
  %12 = load <8 x i64>, <8 x i64> addrspace(11)* %11, align 8

etc.

This doesn’t look so hard. Just showing simple code (rather than a generated function):

julia> using StaticArrays

julia> const Vec{N,T} = NTuple{N,Core.VecElement{T}}
Tuple{Vararg{VecElement{T},N}} where T where N

julia> @inline function svload(A::SArray{S,T,SN,L}, i::Int) where {S,T,SN,L}
           Base.llvmcall(("",
               """%elptr = getelementptr { [24 x double] }, { [24 x double] }* %0, i64 0, i32 0, i64 %1
               %ptr = bitcast double addrspace(11)* %elptr to <8 x double>*
               %res = load <8 x double>, <8 x double>* %ptr, align 8
               ret <8 x double> %res"""),
               Vec{8, Float64}, Tuple{SArray{S, T, SN, L}, Int}, A, i)
       end
svload (generic function with 1 method)

julia> a = @SVector randn(24);

julia> svload(a, 7)
ERROR: error compiling svload: Failed to parse LLVM Assembly: 
julia: llvmcall:3:62: error: '%0' defined with type '{ [24 x double] }'
%elptr = getelementptr { [24 x double] }, { [24 x double] }* %0, i64 0, i32 0, i64 %1
                                                             ^

Stacktrace:
 [1] top-level scope at none:0

Somehow, when calling hcat, it interprets the StaticArray as a pointer to elements, but when I try to copy the code, it instead interprets the SArray{S, T, SN, L} type declaration as '{ [L x T] }'. Can I instead correctly declare it as '{ [L x T] }*', or extract the address to get '{ [L x T] }*' from '{ [L x T] }'?

I understand of course that getting a pointer to a stack-allocated struct doesn’t make sense within Julia, but (based on the function hcat), it seems like it ought to work within a brief function?

Once that’s done, I also have no idea what is going on with %.sroa.0.sroa.9.0..sroa.0.0..sroa_cast1.sroa_idx58, nor how the function actually returns the resulting SMatrix - ret void looks like it doesn’t return anything at all.
So I’d also have no idea how to precede there to actually put a resulting StaticArray together.
Unfortunately, I would need the store operations, because we require the masked stores (for all size(A,1) % VECTOR_WIDTH != 0).

I think it would be great if we can get such a huge performance boost on top of StaticArrays, even if we have to load another library with a few more dependencies on top of it (that replaces the * and LinearAlgebra.mul! methods defined in StaticArrays).

EDIT:
It looks like inttoptr is one of the things I am looking for.

Have you considered the following?

julia> to_vec(t::NTuple{N,T}) where {N,T} = ntuple(i->VecElement{T}(t[i]),N)
julia> to_tup(v::NTuple{N,VecElement{T}}) where {N,T} = ntuple(i->v[i].value, N)

Dealing with the julia/llvm calling convention for tuples / vectors is often a pain, but the above mostly compiles to nothing. And you can extract the tuple from svectors easily.

Sadly, I think that is the best I can do.
I was hoping I could use alloca to get a pointer, store the variable there, and finally load it – and have llvm eliminate the allocation. However, that does not seem to be the case:

using StaticArrays
a = @SVector randn(24);
@inline function llvmzload(A::SArray{Tuple{24}, Float64, 1, 24}, i::Int)
    #= REPL[150]:29 =#
    Base.llvmcall(("",
    """%sptr = alloca { [24 x double] }
    store { [24 x double] } %0, { [24 x double] }* %sptr
    %elptr = getelementptr { [24 x double] }, { [24 x double] }* %sptr, i32 0, i32 0, i64 %1
    %ptr = bitcast double* %elptr to <8 x double>*
    %res = load <8 x double>, <8 x double>* %ptr, align 8
    ret <8 x double> %res"""), Vec{8, Float64}, Tuple{SArray{Tuple{24}, Float64, 1, 24}, Int}, A, i)
end

yields:

julia> @code_native llvmzload(a, 7)
	.text
; Function llvmzload {
; Location: REPL[162]:3
	subq	$64, %rsp
	vmovups	(%rdi), %zmm0
	vmovups	64(%rdi), %zmm1
	vmovups	128(%rdi), %zmm2
	vmovups	%zmm0, -128(%rsp)
	vmovups	%zmm1, -64(%rsp)
	vmovups	%zmm2, (%rsp)
	vmovups	-128(%rsp,%rsi,8), %zmm0
	addq	$64, %rsp
	retq
	nopl	(%rax)
;}

I was basing this off a 10 year old email that said llvm has no address-of operator, unlike C (& Co):

You probably already know this, but I want to state it just in case.
LLVM has no address-of operator, so if you see an instruction like “%sum
= add i32 %x, %y” you can’t get the address of %sum (it’s a register,
after all).

The C frontend doesn’t interpret “int32_t i;” as creating an LLVM
variable of type i32; it actually does “i32* %i_addr = alloca i32”.
LLVM’s optimizers will remove the alloca and lower it to registers if
nobody takes its address.

So I was hoping it would eliminate the moves to %rsp, and just do a single move from %rdi.
If I try to use @asmcall instead, could that function still be inlined into the calling functions?

The problem with those simple tuple functions is that I’ve never seen the compiler use a mask instruction.

julia> m2 = (true, true, true, true, true, true, true, false)
(true, true, true, true, true, true, true, false)

julia> @inline function vload2(::Type{Vec{N,T}}, x, ::Val{mask}) where {N,T,mask}
           ntuple(n -> mask[n] ? VE(T(x[n])) : VE(zero(T)), Val(N))
       end
vload2 (generic function with 2 methods)

julia> vload2(Vec{8,Float64}, a, Val(m2))
(VecElement{Float64}(0.1341517372061226), VecElement{Float64}(-1.4478062620335768), VecElement{Float64}(-0.830543624514204), VecElement{Float64}(0.9402472351085811), VecElement{Float64}(-0.5112288316089226), VecElement{Float64}(0.19750599041128963), VecElement{Float64}(0.9528499075643893), VecElement{Float64}(0.0))

julia> @code_native vload2(Vec{8,Float64}, a, Val(m2))
	.text
; Function vload2 {
; Location: REPL[165]:1
; Function ntuple; {
; Location: sysimg.jl:271
; Function macro expansion; {
; Location: REPL[165]:1
	vmovups	(%rsi), %ymm0
	vmovups	32(%rsi), %xmm1
	vmovsd	48(%rsi), %xmm2         # xmm2 = mem[0],zero
	vinsertf128	$1, %xmm2, %ymm1, %ymm1
	vinsertf64x4	$1, %ymm1, %zmm0, %zmm0
;}}
	retq
	nopl	(%rax)
;}

I want a single masked move, not 3 moves, a vinsertf128, and a vinsertf64x4. That is still much better than even the non-masked move using llvmcall.

MArrays are far easier to optimize, but don’t get along with autodiff libraries like Zygote.

What, exactly, do you want to achieve?

Can you post julia code that does semantically the same you want, and we then think how the correct assembly should look like, and then we try to coax llvm into emitting this?

Or can you post C code using the _mm_something intel intrinsics that does what you want?

I think r=Ref(something_bitstype) and then using pointer_from_objref(r) like an MMatrix has a certain chance of working just as well as the MMatrix.

What I want to achieve:
Lets say we have two SMatrices, A and B. Lets say they are 3x6 and 6x4.
We want to calculate C = A * B.
The most efficient way to do this involves loading columns of A into registers. However, each column is length “3”, which is not a multiple of a SIMD register width. If we could use a masked load, masking the 4th load, we wouldn’t risk a segmentation fault on the last column.

This is basically what I did with the “fastmul!” function that worked on MMatrices; get the pointer, and then from SIMD.jl:

using SIMD: llvmtype
using Core.Intrinsics: llvmcall
const Vec{N,T} = NTuple{N,Core.VecElement{T}}

@generated function vload(::Type{Vec{N,T}}, ptr::Ptr{T},
                          mask::Vec{N,Bool},
                          ::Type{Val{Aligned}} = Val{false}) where {N,T,Aligned}
    @assert isa(Aligned, Bool)
    ptyp = llvmtype(Int)
    typ = llvmtype(T)
    vtyp = "<$N x $typ>"
    btyp = llvmtype(Bool)
    vbtyp = "<$N x $btyp>"
    decls = []
    instrs = []
    if Aligned
        align = N * sizeof(T)
    else
        align = sizeof(T)   # This is overly optimistic
    end

    if VERSION < v"v0.7.0-DEV"
        push!(instrs, "%ptr = bitcast $typ* %0 to $vtyp*")
    else
        push!(instrs, "%ptr = inttoptr $ptyp %0 to $vtyp*")
    end
    push!(instrs, "%mask = trunc $vbtyp %1 to <$N x i1>")
    push!(decls,
        "declare $vtyp @llvm.masked.load.$(suffix(N,T))($vtyp*, i32, " *
            "<$N x i1>, $vtyp)")
    push!(instrs,
        "%res = call $vtyp @llvm.masked.load.$(suffix(N,T))($vtyp* %ptr, " *
            "i32 $align, <$N x i1> %mask, $vtyp $(llvmconst(N, T, 0)))")
    push!(instrs, "ret $vtyp %res")
    quote
        $(Expr(:meta, :inline))
        Base.llvmcall($((join(decls, "\n"), join(instrs, "\n"))),
            Vec{N,T}, Tuple{Ptr{T}, Vec{N,Bool}}, ptr, mask)
    end
end

These masked operations allow matrix multiplication to be many times faster at odd sizes (in line with the sizes that are a multiple of SIMD width):

(size(m3), size(m1), size(m2)) = ((7, 7), (7, 7), (7, 7))
MMatrix Multiplication:
  51.105 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  62.420 ns (0 allocations: 0 bytes)
fastmul!:
  11.871 ns (0 allocations: 0 bytes)

(size(m3), size(m1), size(m2)) = ((8, 8), (8, 8), (8, 8))
MMatrix Multiplication:
  36.552 ns (0 allocations: 0 bytes)
SMatrix Multiplication:
  12.940 ns (0 allocations: 0 bytes)
fastmul!:
  12.794 ns (0 allocations: 0 bytes)

by letting me apply the same kernel.
I want to write a fastmul for StaticArrays, that doesn’t require mutability.

But that requires masked vload and vstore functions. I know there are issues with immutability and vstore, but ideally there would at least be some constructor that could be compiled using masked stores to assemble the SMatrix C from C = A * B.

As a simple example of a vload that works on (what I believe?) is a stack allocated object:

#include<immintrin.h>
#include<iostream>

struct TwentyFourDoubles{
    double a, b, c, d, e, f, g, h, i, j, k, l,
            m, n, o, p, q, r, s, t, u, v, w, x;
};

extern "C" __m256d mask_extract(TwentyFourDoubles, long);

__m256d mask_extract(TwentyFourDoubles v, long i){
    __mmask64 mask = __mmask8( 0x7 );
    return _mm256_maskz_loadu_pd (mask, (double*)&v + i);
}

int main(){
    TwentyFourDoubles v = {
          0.0,  1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,
          8.0,  9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0,
         16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0
    };
    __m256d vm = mask_extract(v, 21);
    std::cout << vm[0] << "\n";
    std::cout << vm[1] << "\n";
    std::cout << vm[2] << "\n";
    std::cout << vm[3] << "\n";
    return 0;
}

This was C++ because I first tried using std::vector, but I couldn’t get it to work.
Running:

$ g++ -O3 -march=native masked_loadavx2.cpp -o masked_loadavx2
$ ./masked_loadavx2 
21
22
23
0
$ g++ -O3 -march=native -shared -fPIC masked_loadavx2.cpp -o libmasked_loadavx2.so
$ g++ -O3 -march=native -shared -fPIC -S masked_loadavx2.cpp -o masked_loadavx2.s

Godbolt.

Calling the function from Julia:

julia> using StaticArrays

julia> x = @SVector randn(24);

julia> cppload(x, i) = ccall((:mask_extract, "/home/chriselrod/Documents/progwork/Cxx/libmasked_loadavx2.so"), NTuple{4,Core.VecElement{Float64}}, (SVector{24,Float64},Int), x, i);

julia> cppload(x, 0)
(VecElement{Float64}(-0.3958496038785274), VecElement{Float64}(-0.004217241837986538), VecElement{Float64}(-0.8215047867850469), VecElement{Float64}(0.0))

julia> x'
1×24 LinearAlgebra.Adjoint{Float64,SArray{Tuple{24},Float64,1,24}}:
 -0.39585  -0.00421724  -0.821505  1.17331  -0.0995132  0.282014  -1.72385  …  1.60397  -0.412593  -2.11107  0.146823  0.0842252  0.109252  -0.616276  0.0695952

julia> cppload(x, 21)
(VecElement{Float64}(0.10925210293336388), VecElement{Float64}(-0.6162761526512442), VecElement{Float64}(0.06959519642418371), VecElement{Float64}(0.0))

In the assembly, I do see a hodge-podge of instructions before the

vmovupd ymm0{k1}, YMMWORD PTR [rax]

and I am not sure what they are doing, but it definitely does not look like they moved the entire struct onto another region of memory before performing the move.

Versus…

function loadsvector(::Type{Vec{N,T}}, a, i) where {N,T}
    r = Ref(a)
    ptr = Base.unsafe_convert(Ptr{T}, pointer_from_objref(r))
    vload(Vec{N,T}, ptr + sizeof(T)*i, (VE(true),VE(true),VE(true),VE(false)))
end

Copies the entire array before copying, just like the llvm alloca version:

julia> @code_native loadsvector(Vec{4,Float64}, a, 0)
	.text
; Function loadsvector {
; Location: REPL[19]:2
; Function Type; {
; Location: refpointer.jl:82
; Function Type; {
; Location: refvalue.jl:10
; Function Type; {
; Location: REPL[19]:2
	subq	$72, %rsp
	vmovups	(%rsi), %zmm0
	vmovups	64(%rsi), %zmm1
	vmovups	128(%rsi), %zmm2
	vmovups	%zmm2, (%rsp)
	vmovups	%zmm1, -64(%rsp)
	vmovups	%zmm0, -128(%rsp)
;}}}
; Location: REPL[19]:4
; Function vload; {
; Location: memory.jl:80
; Function vload; {
; Location: memory.jl:80
; Function macro expansion; {
; Location: memory.jl:109
	movb	$7, %al
	kmovd	%eax, %k1
	vmovupd	-128(%rsp,%rdx,8), %ymm0 {%k1} {z}
;}}}
	addq	$72, %rsp
	retq
	nopw	%cs:(%rax,%rax)
;}

EDIT:

More on motivation; here is a lazy / non-general implementation that only works when matrix A has 7 rows:

using SIMDPirates, Base.Cartesian
m2 = (true, true, true, true, true, true, true, false)
@inline vload2(::Type{Vec{N,T}}, x, i, ::Val{mask}) where {N,T,mask} = ntuple(n -> mask[n] ? VE(T(x[n+i])) : VE(zero(T)), Val(N))
vload2(Vec{8,Float64}, a, 0, Val(m2))
@code_native vload2(Vec{8,Float64}, a, 0, Val(m2))

@generated function mul(A::SMatrix{7,N,T}, B::SMatrix{N,P,T}) where {N,P,T}
    out_tup = :(C_1[1].value, C_1[2].value)
    for m ∈ 3:7
        push!(out_tup.args, :(C_1[$m].value))
    end
    for p ∈ 2:P, m ∈ 1:7
        push!(out_tup.args, :($(Symbol(:C_,p))[$m].value))
    end
    V = Vec{8,Float64}
    mask = (true, true, true, true, true, true, true, false)
    quote
        Acol = vload2($V, A, 0, Val{$mask}())
        @nexprs $P p -> C_p = SIMDPirates.vmul(Acol, vbroadcast($V, B[1,p]))
        @nexprs $(N-1) n -> begin
            Acol = vload2($V, A, 7n, Val{$mask}())
            @nexprs $P p -> begin
                C_p = vfma(Acol, vbroadcast($V, B[n+1,p]), C_p)
            end
        end
        SMatrix{7,$P,$T}($out_tup)
    end
end

It uses the tuple masked load functions. If you’re testing this on a computer without avx-512, use 7x6 matrices instead (7x7 will cause register spilling on avx2, and therefore will be slow).

julia> A = @SMatrix randn(7,7);

julia> B = @SMatrix randn(7,7);

julia> @benchmark $A * $B
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     61.418 ns (0.00% GC)
  median time:      63.340 ns (0.00% GC)
  mean time:        64.400 ns (0.00% GC)
  maximum time:     103.359 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     981

julia> @benchmark mul($A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.446 ns (0.00% GC)
  median time:      23.705 ns (0.00% GC)
  mean time:        23.977 ns (0.00% GC)
  maximum time:     50.328 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

This is a nice improvement over StaticArray’s current function. But still nearly 2x slower than the MMatrix implementation using the masked load/store operations:

SMatrix Multiplication:
  62.420 ns (0 allocations: 0 bytes)
fastmul!:
  11.871 ns (0 allocations: 0 bytes)

The direct llvm construct is shufflevector. I would try passing your llvm-vector (using tup_to_vec(sarray.data)) into your kernel, and then select whatever you need with shufflevector (using source-code literals as the mask).

In my experience, this often compiles down to the correct masked vector instructions, if tup_to_vec(sarray.data) and your llvmcall are inlined into the same context.

1 Like

Masked loads and stores are also LLVM intrinsics, albeit only defined for pointers. That is what I was using for MMatrices. In my opening post, I noted that somehow hcat receives pointers to SArrays and uses load / store operations on them.
Meaning - it seems to me - it should be possible for us to also get a pointer, and use these masked intrinsics.

But, I like the shufflevector idea.
An immediate problem unfortunately seems to be that Julia interprets some NTuple{N,VecElement{T}}s as LLVM-arrays instead of LLVM-vectors (depending on the N):

julia> A = @SMatrix randn(7,7);

julia> Avec = to_vec(A.data);

julia> typeof(Avec)
NTuple{49,VecElement{Float64}}

julia> SIMDPirates.shufflevector(Avec, Val{(0,1,2,3,4,5,6,6)}())
ERROR: error compiling shufflevector: Failed to parse LLVM Assembly: 
julia: llvmcall:3:36: error: '%0' defined with type '[49 x double]'
%res = shufflevector <49 x double> %0, <49 x double> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 6>
                                   ^

Stacktrace:
 [1] top-level scope at none:0

I’ll make a shufflevector method that accepts NTuple{N,T}s (always arrays) and then bitcasts to vectors. Hopefully that gets compiled away.

But that leaves the store / constructing the output SMatrix.

Of course, I only actually need one masked load operation / shuffle on a load.
In this example, @code_native shows it was not replaced with a masked move:

@generated function mul2(A::SMatrix{7,N,T}, B::SMatrix{N,P,T}) where {N,P,T}
    out_tup = :(C_1[2].value, C_1[3].value)
    for m ∈ 4:8
        push!(out_tup.args, :(C_1[$m].value))
    end
    for p ∈ 2:P, m ∈ 2:8
        push!(out_tup.args, :($(Symbol(:C_,p))[$m].value))
    end
    V = Vec{8,Float64}
    mask = (true, true, true, true, true, true, true, false)
    quote
        Acol = SIMDPirates.shufflevector(
            vload2($V, A, 0),
            (VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0),VE(0.0)),
            Val{(8,0,1,2,3,4,5,6)}())
        @nexprs $P p -> C_p = SIMDPirates.vmul(Acol, vbroadcast($V, B[1,p]))
        @nexprs $(N-1) n -> begin
            Acol = vload2($V, A, 7n-1)
            @nexprs $P p -> begin
                C_p = vfma(Acol, vbroadcast($V, B[n+1,p]), C_p)
            end
        end
        SMatrix{7,$P,$T}($out_tup)
    end
end

Instead, we still have vinsertf128 and vinsertf64x4:

julia> A = @SMatrix randn(7,7);

julia> B = @SMatrix randn(7,7);

julia> @code_native mul2(A, B)
	.text
; Function mul2 {
; Location: REPL[113]:2
; Function macro expansion; {
; Location: REPL[113]:12
; Function shufflevector; {
; Location: shufflevector.jl:20
; Function macro expansion; {
; Location: REPL[113]:2
	vmovsd	(%rsi), %xmm0           # xmm0 = mem[0],zero
	vpslldq	$8, %xmm0, %xmm0        # xmm0 = zero,zero,zero,zero,zero,zero,zero,zero,xmm0[0,1,2,3,4,5,6,7]
	vinsertf128	$1, 8(%rsi), %ymm0, %ymm0
	vinsertf64x4	$1, 24(%rsi), %zmm0, %zmm0
;}}
#... (cut rest of function)

julia> @btime mul2($A, $B);
  16.619 ns (0 allocations: 0 bytes)

But (with doing this on only a single load) it is a little faster. The storing still adds 5ns / roughly 50% runtime compared to the MMatrix version.

Unfortunately both vector arguments in a shuffle have to be the same length, but I can also try using those for assembling the SMatrix (combine columns 1&2, 3&4, 5&6, grow 7, combine (1&2)&(3&4),…).

I’m also inclined to create my own SMatrix type, where the tuple length is a multiple of SIMD register width (ie, 4 or 8 depending on avx2 / avx512). If we can’t liberally use the masked operations, they will be more efficient for most operations.

Ok, WTF.

julia> a = @SVector rand(6*8); v = to_vec(a.data); @code_llvm identity(v)

; Function identity
; Location: operators.jl:475
define <48 x double> @julia_identity_35402(<48 x double>) {
top:
  ret <48 x double> %0
}
julia> a = @SVector rand(7*8); v = to_vec(a.data); @code_llvm identity(v)

; Function identity
; Location: operators.jl:475
define void @julia_identity_35397([56 x double]* noalias nocapture sret, [56 x double] addrspace(11)* nocapture nonnull readonly dereferenceable(448)) {
top:
  %2 = bitcast [56 x double]* %0 to i8*
  %3 = bitcast [56 x double] addrspace(11)* %1 to i8 addrspace(11)*
  call void @llvm.memcpy.p0i8.p11i8.i64(i8* %2, i8 addrspace(11)* %3, i64 448, i32 8, i1 false)
  ret void
}

I fail at finding the weird size limitation in either llvm langref or the barebones description of the julia native ABI.

Note the docs: SIMD Support ¡ The Julia Language

It has a special compilation rule: a homogeneous tuple of VecElement{T} maps to an LLVM vector type when T is a primitive bits type and the tuple length is in the set {2-6,8-10,16}.

1 Like

Ok, 7 also becomes an aggregate instead of a vector. This does not depend on the eltype, same for Bool, Float32, etc.