[ANN] LoopVectorization

That could be interesting to experiment with. Though I’d probably just switch the data representation based on julia version rather than ceasing support for earlier versions. As long as that’s not disruptive in some other way.

Exactly. @maleadt could you shed some light on this?

Generally I feel like StaticArrays must rely on the optimizer in the long term because it knows target-specific details which we will never know in the library. So it’s a game of getting enough information to the optimization passes that they can do the best thing. (That also needs to be balanced with allowing good information in type inference. Unrolling just so happens to make this much easier, but we may need to stop relying on that crutch).

Ideally LoopVectorization would be an optimization pass which StaticArrays could send metadata to, and it figures out what to do based on the compilation target. (Basically like @simd I guess, though it would be great if it didn’t need to be built in to start with.)

CUDAnative basically takes the LLVM IR as generated by Julia, which would include the output by LoopVectorize without an easy way to circumvent them. So it would be bad if that IR contains CPU/x86_64-specific instructions. I haven really followed this thread / LoopVectorize developments, but wouldn’t the same problem pose with regular Julia execution on other platforms (arm, ppc, webassembly)?

1 Like

Ah, that’d be better.
I’d hope the matrix intrinsics get lowered in an appropriate, target-dependent, way.

Is there any way the front end can gain access to this sort of information?

Aside from a recent request for it to emit vpmaddwd intrinsics in some situations (which I haven’t implemented yet), it currently just uses LLVM intrsinics discussed in the reference manual, which should be cross platform.
However, that just means LLVM should be able to produce valid code from this IR cross platform. Not that this code will be efficient.

Coincidentally, this is exactly what I just did as an example of how to work with static ranges in StaticNumbers.jl.

It works quite well. I’ve benchmarked matrix-matrix multiplication for all sizes up to 8x8, and for some of the sizes I get 2-4 times speedup compared to SMatrix.

5 Likes

I’m not surprised about that speedup. Under the hood, vectors are padded to more workable sizes (on a computer with AVX2):

julia> [(R,8R, sizeof(NTuple{R,Float64}), sizeof(NTuple{R,Core.VecElement{Float64}})) for R ∈ 1:8]
8-element Array{NTuple{4,Int64},1}:
 (1, 8, 8, 8)
 (2, 16, 16, 16)
 (3, 24, 24, 32)
 (4, 32, 32, 32)
 (5, 40, 40, 48)
 (6, 48, 48, 48)
 (7, 56, 56, 64)
 (8, 64, 64, 64)

So for example, your 7x7 matrix is secretly an 8x7 matrix. Life is a lot easier with 8x7 matrices.
Masking loads/stores becomes unnecessary; to load a column, just use 2x loads of 4 elements.

EDIT: You do explain this in the README*. Worth checking out!

  • In the README, you mention alignment, which is very important, as can be seen in the LoopVectorization benchmarks, where performance is much higher whenever we have matrices with a multiple of 8 rows.
    Also more important for the 5x5 case, where loading a column via loading two vectors of lengths 4 and 2 doesn’t sound any better than loading a vector of length 4 and a scalar. Meaning I’d guess alignment is a major factor there.
    I also haven’t checked the generated code, but I wouldn’t be surprised if it also tends to vectorize better.
1 Like

In multiplying two 5x5 Float64 matrices on my system, I get 9.8 Gflops with SMatrix and 21.0 Gflops with Mat (i.e. a Tuple of LLVM vectors.) However, storing the data densely (SMatrix-layout) in memory and converting to Mat while loading into registers seems to be the fastest with 22.6 Gflops (2.3 times speedup compared to SMatrix).

Interesting. Mind sharing the @code_native? I’m guessing the issue then is that SMatrix fails to vectorize, while Mat does vectorize.

I think LLVM auto-vectorization only really works with vector lengths that are powers of two.

Here’s the result of copy-pasting the code from the above-linked example into the REPL, followed by:

 @code_native debuginfo=:none mul_as_Mat(rand(SMatrix{5,5}), rand(SMatrix{5,5}))

that is, this is the code to multiply two 5-by-5 Float64 matrices on my system:

.section	__TEXT,__text,regular,pure_instructions
	movq	%rdi, %rax
	vmovupd	120(%rsi), %ymm2
	vmovsd	152(%rsi), %xmm1        ## xmm1 = mem[0],zero
	vmovupd	160(%rsi), %ymm4
	vmovsd	192(%rsi), %xmm6        ## xmm6 = mem[0],zero
	vmovsd	32(%rdx), %xmm0         ## xmm0 = mem[0],zero
	vbroadcastsd	%xmm0, %ymm3
	vmulpd	%ymm0, %ymm6, %ymm0
	vmulpd	%ymm3, %ymm4, %ymm3
	vmovsd	24(%rdx), %xmm7         ## xmm7 = mem[0],zero
	vbroadcastsd	%xmm7, %ymm5
	vfmadd231pd	%ymm7, %ymm1, %ymm0 ## ymm0 = (ymm1 * ymm7) + ymm0
	vfmadd213pd	%ymm3, %ymm2, %ymm5 ## ymm5 = (ymm2 * ymm5) + ymm3
	vmovsd	72(%rdx), %xmm3         ## xmm3 = mem[0],zero
	vbroadcastsd	%xmm3, %ymm7
	vmulpd	%ymm3, %ymm6, %ymm3
	vmulpd	%ymm7, %ymm4, %ymm7
	vmovsd	64(%rdx), %xmm9         ## xmm9 = mem[0],zero
	vbroadcastsd	%xmm9, %ymm8
	vfmadd231pd	%ymm9, %ymm1, %ymm3 ## ymm3 = (ymm1 * ymm9) + ymm3
	vfmadd213pd	%ymm7, %ymm2, %ymm8 ## ymm8 = (ymm2 * ymm8) + ymm7
	vmovsd	112(%rdx), %xmm7        ## xmm7 = mem[0],zero
	vbroadcastsd	%xmm7, %ymm9
	vmulpd	%ymm7, %ymm6, %ymm7
	vmulpd	%ymm9, %ymm4, %ymm9
	vmovsd	104(%rdx), %xmm11       ## xmm11 = mem[0],zero
	vbroadcastsd	%xmm11, %ymm10
	vfmadd231pd	%ymm11, %ymm1, %ymm7 ## ymm7 = (ymm1 * ymm11) + ymm7
	vfmadd213pd	%ymm9, %ymm2, %ymm10 ## ymm10 = (ymm2 * ymm10) + ymm9
	vmovsd	152(%rdx), %xmm9        ## xmm9 = mem[0],zero
	vbroadcastsd	%xmm9, %ymm11
	vmulpd	%ymm9, %ymm6, %ymm9
	vmulpd	%ymm11, %ymm4, %ymm12
	vmovsd	144(%rdx), %xmm11       ## xmm11 = mem[0],zero
	vfmadd231pd	%ymm11, %ymm1, %ymm9 ## ymm9 = (ymm1 * ymm11) + ymm9
	vbroadcastsd	%xmm11, %ymm11
	vfmadd213pd	%ymm12, %ymm2, %ymm11 ## ymm11 = (ymm2 * ymm11) + ymm12
	vmovsd	192(%rdx), %xmm12       ## xmm12 = mem[0],zero
	vmulpd	%ymm12, %ymm6, %ymm6
	vbroadcastsd	%xmm12, %ymm12
	vmulpd	%ymm12, %ymm4, %ymm4
	vmovsd	184(%rdx), %xmm12       ## xmm12 = mem[0],zero
	vfmadd213pd	%ymm6, %ymm12, %ymm1 ## ymm1 = (ymm12 * ymm1) + ymm6
	vmovupd	80(%rsi), %ymm6
	vbroadcastsd	%xmm12, %ymm12
	vfmadd213pd	%ymm4, %ymm2, %ymm12 ## ymm12 = (ymm2 * ymm12) + ymm4
	vmovsd	112(%rsi), %xmm2        ## xmm2 = mem[0],zero
	vmovsd	16(%rdx), %xmm4         ## xmm4 = mem[0],zero
	vfmadd231pd	%ymm4, %ymm2, %ymm0 ## ymm0 = (ymm2 * ymm4) + ymm0
	vbroadcastsd	%xmm4, %ymm4
	vfmadd213pd	%ymm5, %ymm6, %ymm4 ## ymm4 = (ymm6 * ymm4) + ymm5
	vmovsd	56(%rdx), %xmm5         ## xmm5 = mem[0],zero
	vfmadd231pd	%ymm5, %ymm2, %ymm3 ## ymm3 = (ymm2 * ymm5) + ymm3
	vbroadcastsd	%xmm5, %ymm5
	vfmadd213pd	%ymm8, %ymm6, %ymm5 ## ymm5 = (ymm6 * ymm5) + ymm8
	vmovsd	96(%rdx), %xmm8         ## xmm8 = mem[0],zero
	vfmadd231pd	%ymm8, %ymm2, %ymm7 ## ymm7 = (ymm2 * ymm8) + ymm7
	vbroadcastsd	%xmm8, %ymm8
	vfmadd213pd	%ymm10, %ymm6, %ymm8 ## ymm8 = (ymm6 * ymm8) + ymm10
	vmovsd	136(%rdx), %xmm10       ## xmm10 = mem[0],zero
	vfmadd231pd	%ymm10, %ymm2, %ymm9 ## ymm9 = (ymm2 * ymm10) + ymm9
	vbroadcastsd	%xmm10, %ymm10
	vfmadd213pd	%ymm11, %ymm6, %ymm10 ## ymm10 = (ymm6 * ymm10) + ymm11
	vmovsd	176(%rdx), %xmm11       ## xmm11 = mem[0],zero
	vfmadd213pd	%ymm1, %ymm11, %ymm2 ## ymm2 = (ymm11 * ymm2) + ymm1
	vmovupd	40(%rsi), %ymm1
	vbroadcastsd	%xmm11, %ymm11
	vfmadd213pd	%ymm12, %ymm6, %ymm11 ## ymm11 = (ymm6 * ymm11) + ymm12
	vmovsd	72(%rsi), %xmm6         ## xmm6 = mem[0],zero
	vmovsd	8(%rdx), %xmm12         ## xmm12 = mem[0],zero
	vfmadd231pd	%ymm12, %ymm6, %ymm0 ## ymm0 = (ymm6 * ymm12) + ymm0
	vbroadcastsd	%xmm12, %ymm12
	vfmadd213pd	%ymm4, %ymm1, %ymm12 ## ymm12 = (ymm1 * ymm12) + ymm4
	vmovsd	48(%rdx), %xmm4         ## xmm4 = mem[0],zero
	vfmadd231pd	%ymm4, %ymm6, %ymm3 ## ymm3 = (ymm6 * ymm4) + ymm3
	vbroadcastsd	%xmm4, %ymm4
	vfmadd213pd	%ymm5, %ymm1, %ymm4 ## ymm4 = (ymm1 * ymm4) + ymm5
	vmovsd	88(%rdx), %xmm5         ## xmm5 = mem[0],zero
	vfmadd231pd	%ymm5, %ymm6, %ymm7 ## ymm7 = (ymm6 * ymm5) + ymm7
	vbroadcastsd	%xmm5, %ymm5
	vfmadd213pd	%ymm8, %ymm1, %ymm5 ## ymm5 = (ymm1 * ymm5) + ymm8
	vmovsd	128(%rdx), %xmm8        ## xmm8 = mem[0],zero
	vfmadd231pd	%ymm8, %ymm6, %ymm9 ## ymm9 = (ymm6 * ymm8) + ymm9
	vbroadcastsd	%xmm8, %ymm8
	vfmadd213pd	%ymm10, %ymm1, %ymm8 ## ymm8 = (ymm1 * ymm8) + ymm10
	vmovsd	168(%rdx), %xmm10       ## xmm10 = mem[0],zero
	vfmadd213pd	%ymm2, %ymm10, %ymm6 ## ymm6 = (ymm10 * ymm6) + ymm2
	vmovupd	(%rsi), %ymm2
	vbroadcastsd	%xmm10, %ymm10
	vfmadd213pd	%ymm11, %ymm1, %ymm10 ## ymm10 = (ymm1 * ymm10) + ymm11
	vmovsd	32(%rsi), %xmm1         ## xmm1 = mem[0],zero
	vmovsd	(%rdx), %xmm11          ## xmm11 = mem[0],zero
	vfmadd231pd	%ymm11, %ymm1, %ymm0 ## ymm0 = (ymm1 * ymm11) + ymm0
	vbroadcastsd	%xmm11, %ymm11
	vfmadd213pd	%ymm12, %ymm2, %ymm11 ## ymm11 = (ymm2 * ymm11) + ymm12
	vmovsd	40(%rdx), %xmm12        ## xmm12 = mem[0],zero
	vfmadd231pd	%ymm12, %ymm1, %ymm3 ## ymm3 = (ymm1 * ymm12) + ymm3
	vbroadcastsd	%xmm12, %ymm12
	vfmadd213pd	%ymm4, %ymm2, %ymm12 ## ymm12 = (ymm2 * ymm12) + ymm4
	vmovsd	80(%rdx), %xmm4         ## xmm4 = mem[0],zero
	vfmadd231pd	%ymm4, %ymm1, %ymm7 ## ymm7 = (ymm1 * ymm4) + ymm7
	vbroadcastsd	%xmm4, %ymm4
	vfmadd213pd	%ymm5, %ymm2, %ymm4 ## ymm4 = (ymm2 * ymm4) + ymm5
	vmovsd	120(%rdx), %xmm5        ## xmm5 = mem[0],zero
	vfmadd231pd	%ymm5, %ymm1, %ymm9 ## ymm9 = (ymm1 * ymm5) + ymm9
	vbroadcastsd	%xmm5, %ymm5
	vfmadd213pd	%ymm8, %ymm2, %ymm5 ## ymm5 = (ymm2 * ymm5) + ymm8
	vmovsd	160(%rdx), %xmm8        ## xmm8 = mem[0],zero
	vfmadd213pd	%ymm6, %ymm8, %ymm1 ## ymm1 = (ymm8 * ymm1) + ymm6
	vmovupd	%ymm11, (%rdi)
	vpermpd	$144, %ymm12, %ymm6     ## ymm6 = ymm12[0,0,1,2]
	vblendpd	$1, %ymm0, %ymm6, %ymm0 ## ymm0 = ymm0[0],ymm6[1,2,3]
	vmovupd	%ymm0, 32(%rdi)
	vbroadcastsd	%xmm8, %ymm0
	vfmadd213pd	%ymm10, %ymm2, %ymm0 ## ymm0 = (ymm2 * ymm0) + ymm10
	vshufpd	$5, %ymm3, %ymm12, %ymm2 ## ymm2 = ymm12[1],ymm3[0],ymm12[3],ymm3[2]
	vpermpd	$230, %ymm2, %ymm2      ## ymm2 = ymm2[2,1,2,3]
	vinsertf128	$1, %xmm4, %ymm0, %ymm3
	vblendpd	$12, %ymm3, %ymm2, %ymm2 ## ymm2 = ymm2[0,1],ymm3[2,3]
	vmovupd	%ymm2, 64(%rdi)
	vperm2f128	$33, %ymm7, %ymm4, %ymm2 ## ymm2 = ymm4[2,3],ymm7[0,1]
	vbroadcastsd	%xmm5, %ymm3
	vblendpd	$8, %ymm3, %ymm2, %ymm2 ## ymm2 = ymm2[0,1,2],ymm3[3]
	vmovupd	%ymm2, 96(%rdi)
	vperm2f128	$33, %ymm9, %ymm5, %ymm2 ## ymm2 = ymm5[2,3],ymm9[0,1]
	vshufpd	$5, %ymm2, %ymm5, %ymm2 ## ymm2 = ymm5[1],ymm2[0],ymm5[3],ymm2[2]
	vmovupd	%ymm2, 128(%rdi)
	vmovupd	%ymm0, 160(%rdi)
	vmovlpd	%xmm1, 192(%rdi)
	vzeroupper
	retq
	nopw	%cs:(%rax,%rax)
	nop
2 Likes

This is interesting. I think it would be very practical to do something like mul_as_Mat internally in the StaticArrays implementation of mul!, to copy data into a better SIMD friendly layout temporarily. Then rely on the compiler to elide the copying if possible. That puts the layout decisions in the optimizer where they belong (translation of VecElement into an appropriate LLVM type). I guess VecElement is not super expressive compared to things like the tensor types in MLIR but it seems to be a whole lot better than what we’re doing at the moment.

1 Like

I think that’s a good idea. To be fair, I should probably mention that there was slowdown in some cases as well, especially when the first matrix is much wider than it is tall. The decision on how to vectorize needs to be carefully calibrated. The code would probably look something like

if some_condition_on_sizes_and_element_types
    mul_as_Mat(A,B)
elseif some_other_condition
    mul_as_Mat(transpose(B), transpose(A)) |> transpose
else
   mul_as_SMatrix(A,B)
end

It looks like the second of the two needed PRs is on the verge of being merged into Julia, so this will all work in nightly builds very soon!

5 Likes

I assume you mean #34473? It looks like that just got merged and will probably be backported to julia 1.4 RC2 :tada:

4 Likes

Yes, that’s the one! I had all sorts of weird problems, including segfaults, and after merging that into Julia they just disappeared.

2 Likes

Okay, so I’ve been working on a pure-julia multi-threaded BLAS library leveraging LoopVectorization.jl and I’m excited to show off a sneak preview here.

For now, it’s called Gaius.jl (I’m open to naming suggestions) and it’s able to beat (multi-threaded) OpenBLAS for matrix sizes up to around 500 x 500.

Because Gaius.jl uses julia’s PARTR for it’s threading, you can nest Gaius.jl’s matrix multiplication inside other multi-threaded functions without any destructive interference like you’d see with julia’s built in matrix multiple (which uses OpenBLAS).

Here are some benchmarks:

F64_mul F32_mul

I64_mul I32_mul


Compared to the work @Elrod put into LoopVectorization, this isn’t anything too special, it’s basically just a wrapper around one of the GEMM kernels he showed above. My main contribution is the divide and conquer scheme for partitioning the arrays into chunks and doing many GEMMs in parallel recursively.

@Elrod, I’d be happy to make you an owner of the repo; it’s mainly LoopVectorization.jl doing the real work here.

39 Likes

What’s the bump around size 100 (maybe 104) that I see in several of the plots?
I guess you’re restricting yourself to matrices with dimensions a multiple of 8; is that right?

2 Likes

Good eye! The bump is indeed at 104, and it’s when I turn on the multi-threading.

Gaius can work with matrices of any size, but it’ll do the recursive blocking in such a way that as many blocks as possible are of size 104 x 104. I just chose 104 as the cutoff because it seemed to give the best results in my limited testing.

The brief period where OpenBLAS beats Gaius is because OpenBLAS turns on it’s threads a bit earlier than Gaius.jl, but Julia’s threads have a significant enough startup cost that I found this to be a better choice over a wider parameter range. There’s definitely room for tweaking though!

1 Like

It is somewhat unsatisfying that in the practically important case of 64-bit floats there is no gain really.
Do you know why?

I’d argue that the gain over OpenBLAS for 64 bit floats is actually pretty significant over a wide range of matrix sizes, remember that these are log-log plots. For instance,

julia> using Gaius, BenchmarkTools, LinearAlgebra

julia> A, B, C = rand(104, 104), rand(104, 104), zeros(104, 104);

julia> @btime mul!($C, $A, $B);
  42.494 μs (0 allocations: 0 bytes)

julia> @btime Gaius.mul!($C, $A, $B);
  29.396 μs (32 allocations: 3.03 KiB)

julia> A, B, C = rand(50, 50), rand(50, 50), zeros(50, 50);

julia> @btime mul!($C, $A, $B);
  8.365 μs (0 allocations: 0 bytes)

julia> @btime Gaius.mul!($C, $A, $B);
  5.324 μs (0 allocations: 0 bytes)

though I’ll admit I cherry picked the 104 x 104 example.

But as to your question “do you know why?”, the answer is just that BLAS libraries tend to be the most optimized, non-trivial pieces of code out there. I’d argue that even trading blows with OpenBLAS is a worthwhile goal, especially if it means that we could one day get rid of OpenBLAS in the julia installation and have a pure-julia BLAS library.

OpenBLAS is responsible for a quite significant amount of julia’s start-up overhead and it’s something we have little control over.

15 Likes

I agree with you 100%. But still, for integers your code is much more successful. Hence my question.

1 Like

Oh I see! OpenBLAS doesn’t do integer matrix multiplication, so julia is forced to use it’s generic matrix multiply fallback (which isn’t multi-threaded) when it encounters matrices of integers.

This is a real pain point, because it means that currently it’s actually faster to actually convert your matrix of Ints to Float64s before multiplication, but you don’t always want to do that, hence a big advantage to using Gaius.

4 Likes

Oh, duh. My bad. Thanks for this brilliant bit of work. I see a great future for pure-Julia LA.