Julia alignas: is there a way to specify the alignment of Julia objects in memory?

In C++ there is alignas (and related stuff like alignof):
https://en.cppreference.com/w/cpp/language/alignas

alignas enables specifying, for example, that an array’s storage should be aligned to 64 bytes in memory, which is useful if it is known that the array will be large and that it will be operated on by SIMD instructions. It can also be useful when one wants to ensure that the data will fit in CPU cache nicely.

I think such features are missing from Julia? Is it preferable to open a thread in Internals & Design - JuliaLang, or to open an issue on Github?

You can allocate an Array using posix_memalign, for example:

function alignedvec(::Type{T}, n::Integer, alignment::Integer=sizeof(Int)) where {T}
    @static if Sys.iswindows()
        return Array{T}(undef, n)
    else
        ispow2(alignment) || throw(ArgumentError("$alignment is not a power of 2"))
        alignment ≥ sizeof(Int) || throw(ArgumentError("$alignment is not a multiple of $(sizeof(Int))"))
        isbitstype(T) || throw(ArgumentError("$T is not a bitstype"))
        p = Ref{Ptr{T}}()
        err = ccall(:posix_memalign, Cint, (Ref{Ptr{T}}, Csize_t, Csize_t), p, alignment, n*sizeof(T))
        iszero(err) || throw(OutOfMemoryError())
        return unsafe_wrap(Array, p[], n, own=true)
    end
end

but posix_memalign is unavailable on Windows so it has to fall back to unaligned memory in that case (Julia defaults to 16-byte alignment). If the alignment is just an optimization, however, that may not be so terrible.

It doesn’t seem crazy to add an alignment keyword argument to the Array constructor, but the difficulty of supporting Windows is (as usual) a pain point. (You can have an aligned-memory allocator on Windows, but then you can’t use free to deallocate it, and so you’d need lower-level hacks for garbage-collection to work.)

12 Likes

Array alignments:

#define JL_SMALL_BYTE_ALIGNMENT 16
#define JL_CACHE_BYTE_ALIGNMENT 64

Array size threshold:

// how much space we're willing to waste if an array outgrows its
// original object
#define ARRAY_INLINE_NBYTES (2048*sizeof(void*))

Array allocation. I didn’t look too closely, but calls to functions like jl_alloc_array_1d forward to _new_array_ eventually, which branches off ARRAY_INLINE_NBYTES, and calls JL_ARRAY_ALIGN to “align whole object” with either JL_SMALL_BYTE_ALIGNMENT or JL_CACHE_BYTE_ALIGNMENT.
For Float64, note that

julia> 256 * sizeof(Float64)
2048

Experimentally, the threshold seems to be at Vector{Float64}(undef, 245) and larger are always aligned to 64 bytes.

julia> any((reinterpret(UInt, pointer(Vector{Float64}(undef, 245))) % 64) ≠ zero(UInt) for i ∈ 1:1000)
false

julia> any((reinterpret(UInt, pointer(Vector{Float64}(undef, 244))) % 64) ≠ zero(UInt) for i ∈ 1:1000)
true
5 Likes

I guess adding something like std::assume_aligned from C++20 is a more viable option, then? I guess it would be called something like unsafe_assumealigned in Julia. It would be used by the programmer to tell Julia (and then LLVM) what’s the alignment of a pointer obtained through posix_memalign.

You’d first have to make a compelling case for this providing a real benefit (i.e. much faster code is generated by LLVM).

3 Likes

I don’t think that’s the case, at least on recent x64 hardware.
Aligned loads/stores are faster, but aligned move instructions aren’t (move instructions are used for loading/storing). It’s just that one crashes if unaligned.
So the benefit of promising alignment isn’t performance. The benefit is free runtime checks (free from a performance perspective) that the memory really is aligned. You’ll be notified by a segfault if you’re wrong, rather than silently having worse performance.

EDIT:
Some compilers, like gcc, will often generate alignment checks + some code to align if unaligned in front of loops. So std::assume_aligned could let it skip these checks, which would have a performance benefit.

Also:

using VectorizationBase: assume
function mydot_aligned(x,y)
    s = zero(promote_type(eltype(x),eltype(y)))
    assume((reinterpret(UInt, pointer(x)) % (64 % UInt)) == zero(UInt))
    assume((reinterpret(UInt, pointer(y)) % (64 % UInt)) == zero(UInt))
    @inbounds @simd for i in eachindex(x,y)
        s += x[i]*y[i]
    end
    s
end

produces this SIMD loop (@code_native):

L176:
        vmovapd zmm4, zmmword ptr [rax + 8*rsi]
        vmovapd zmm5, zmmword ptr [rax + 8*rsi + 64]
        vmovapd zmm6, zmmword ptr [rax + 8*rsi + 128]
        vmovapd zmm7, zmmword ptr [rax + 8*rsi + 192]
        vfmadd231pd     zmm0, zmm4, zmmword ptr [rcx + 8*rsi] # zmm0 = (zmm4 * mem) + zmm0
        vfmadd231pd     zmm1, zmm5, zmmword ptr [rcx + 8*rsi + 64] # zmm1 = (zmm5 * mem) + zmm1
        vfmadd231pd     zmm2, zmm6, zmmword ptr [rcx + 8*rsi + 128] # zmm2 = (zmm6 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, zmmword ptr [rcx + 8*rsi + 192] # zmm3 = (zmm7 * mem) + zmm3
        add     rsi, 32
        cmp     rdx, rsi
        jne     L176

Notice the vmovapds instead of vmovupds.
So this does work to tell LLVM about alignment.

EDIT:
Maybe I’m wrong:

julia> x = rand(256);

julia> y = rand(256);

julia> @btime mydot($x,$y)
  10.032 ns (0 allocations: 0 bytes)
67.11501240811893

julia> @btime mydot_aligned($x,$y)
  8.534 ns (0 allocations: 0 bytes)
67.11501240811893

julia> @btime mydot($x,$y)
  10.031 ns (0 allocations: 0 bytes)
67.11501240811893

julia> @btime mydot_aligned($x,$y)
  8.530 ns (0 allocations: 0 bytes)
67.11501240811893

mydot is the same, except I commented out the assumes.
EDIT:
restarted Julia:

julia> x = rand(256);

julia> y = rand(256);

julia> @btime mydot($x,$y)
  11.590 ns (0 allocations: 0 bytes)
63.81066585474556

julia> @btime mydot_aligned($x,$y)
  11.885 ns (0 allocations: 0 bytes)
63.81066585474556

julia> @btime mydot($x,$y)
  11.589 ns (0 allocations: 0 bytes)
63.81066585474556

julia> @btime mydot_aligned($x,$y)
  11.960 ns (0 allocations: 0 bytes)
63.81066585474556

Was probably just noise. Sometimes functions are just randomly faster or slower for no discernible (by me) reason in a manner that is consistent within a Julia session, but not between Julia sessions/recompilations.

4 Likes

I’m currently doing some experiments, with C++20 and Clang 11 on a laptop with a recent Intel CPU, and it seems that using alignas and assume_aligned can give significant performance improvements. Basically I’m giving two huge arrays with parameterized alignment to some toy functions to process (where the result goes in one of the arrays), and for some of those simple functions the benefit of an alignment to 256 byte boundaries gives up to 25% throughput improvement compared to default alignment.

To be more specific, each toy function takes two bytes as input and returns a single byte; and then I apply them to the arrays.

Will post results and describe the experiments in detail after trying out some other stuff like Polly and OpenMP and making visualizations.

Wait what? I can’t find VectorizationBase on JuliaObserver, and it’s not part of Julia proper? Is it some proof of concept implementation made by yourself?

What about alignas without assume_aligned?

Use JuliaHub, not JuliaObserver.
(Note, it says tests are failing because CI’s been temporarily disabled by GitHub staff)

1 Like

Well, assume_aligned is only useful in my experiments when the compiler “forgets” about the alignment it already promised as part of the alignas. They’re basically complementary tools as far as I understand, but assume_aligned shouldn’t really make much of a difference for the effects of alignas conceptually.

I didn’t/don’t think the compiler knowing about alignment helps, except when it lets the compiler skip alignment checks. That for unaligned loads/stores that cross vector-width boundaries, vmovupd will have half the throughput as when it’s aligned while vmovapd will segfault.
But that if loads are aligned / don’t cross such a boundary, vmovapd and vmovupd are equally fast (i.e., twice as fast as the unaligned vmovupd).

Which would mean that big arrays in Julia – being aligned to 64 bytes automatically – means that you should be able to get full performance without needing to tell the compiler about it.

But they’re not aligned to 256 bytes. Out of curiosity, why do you need 256 bytes?
64 byte would be aligned with x64 cacheline sizes as well as the largest SIMD registers on x64. 4096 with pages. Hadn’t heard of 256 bytes before, so I’m curious about the reason.

An exception would be vmovnt, non-temporal stores, which require alignment. But a compiler isn’t likely to use those automatically AFAIK.

1 Like

Oh, I don’t need 256 bytes, I’m just playing around. You’re probably correct about 256 byte alignment being unnecessary, I just chose it because with only two arrays in each experimental program, giving them too much alignment is not an issue.

I’m just making uneducated guesses, but it could be that the compiler decides not to vectorize, or to use suboptimal vectorization, if it doesn’t know about the alignment. Another guess is that the eliminated branches help with code fitting in L1 cache or something like that.

Hey, I don’t really know anything about that llvmcall stuff, so I have to ask: it seems that
assume is not actually used anywhere in VectorizationBase, so are you sure it’s actually functional?

Yes.
From my example:

L176:
        vmovapd zmm4, zmmword ptr [rax + 8*rsi]
        vmovapd zmm5, zmmword ptr [rax + 8*rsi + 64]
        vmovapd zmm6, zmmword ptr [rax + 8*rsi + 128]
        vmovapd zmm7, zmmword ptr [rax + 8*rsi + 192]

Notice the vmovapd? These are aligned loads.

VectorizationBase mostly just defines functions for other libraries to use. ThreadingUtilities.jl and LoopVectorization.jl both use VectorizationBase.assume.

(EDIT: But you should always double check when using assume.)

1 Like

valloc in SIMD.jl allows you to specify an alignment.

using SIMD
for i ∈ 1:10
   a = valloc(UInt8, 1024, 8)
   b = valloc(UInt8, 2048, 8)
   Int(pointer(a)) % 2048 == 0 || println("a not aligned to 2048")
   Int(pointer(b)) % 2048 == 0 || println("b not aligned to 2048")
end
1 Like