Hi,
Is it possible to control the data alignment of Julia Arrays (for example to enable direct use of
the vloada
function of SIMD.jl package) ?
What do you mean?
Julia arrays of isbits
types are stored in column major format.
If you want to change this, you could define your own array type and specify a different getindex function.
I was unclear.
I wonder if it is possible to control the alignment of the memory allocated for Julia Arrays to enforce the first array element address to be a multiple of N=(32) bytes for AVX2.
This is enforced by a (Julia) compile-time constant for Array
instantiation right now — the first element is aligned to 16 bytes. So, no, it’s not terribly easy to control that.
I would not worry about this too much. If you only care about vectorized code, most aligned instruction should just work fine on unaligned data and simply become slow.
As far as I understood, large arrays are always JL_CACHE_BYTE_ALIGNMENT = 64
byte aligned, if they are longer than ARRAY_INLINE_NBYTES = (2048*sizeof(void*))
= 16 KB. So if you are adressing into a large matrix, then you always control the alignment.
Short testing (on linux 64 bit, both 0.62 and 0.7) got me better alignment in practice most of the time for even smaller arrays:
julia> for i=1:1000 AA=zeros(UInt8,1952); assert(0x40== pointer(AA)-pointer_from_objref(AA)); assert(0==reinterpret(UInt64,pointer(AA)) & UInt64(63)) end
julia> for i=1:1000 AA=zeros(UInt8,1951); assert(0x40== pointer(AA)-pointer_from_objref(AA)); assert(0==reinterpret(UInt64,pointer(AA)) & UInt64(63)) end
ERROR: AssertionError:
Do you know if that is set to 128 for the POWER architecture? (if not, it should be)
OK, thank for your answers. So it is 16 bytes (minimum) or 64 bytes for large arrays.
I don’t really understand this:
I believe that an aligned load instruction will crash on unaligned data. In addition, becoming slow is quite problematic if you engage a vectorization process…
Anyway thank you for indicating the JL_CACHE_BYTE_ALIGNMENT and ARRAY_INLINE_NBYTES constants. It will help me to find how the allocation is done.
I guess that this choice for allocation has been carefully made. I clearly do not know what would imply to allow more control on this choice. If it could be done I believe that it could be useful for vectorization via package like SIMD.jl.
I think not, I found the original question interesting and looked it up in the source code here.
However, the allocator will often produce better alignment than requested, just as shown in the experiment. I haven’t read the code of the impressively fast pool allocator, but if the better alignments for shorter arrays are in fact reliable (and not just true in practice) one might document this behavior.
IMHO often is problematic because you can’t use aligned access instructions in your code.
It’s done here.
No, I think it’s black magic, and we could get better alignment for free: I think that the pool allocator doesn’t like to mix tiny and medium objects anyway, and medium-sized objects appear to get good alignment today. Maybe it’s just a documentation issue? But I have not yet read the code for the pool allocator.
What instructions do you plan to use? As far as I understood, some aligned vector load instructions check alignment and crash, but other vector loads are just as fast for aligned data (but become slow if alignment fails). For this one, you should really look at the processor manual / google for explanations and cycle counts. I don’t know the relevant instructions by heart, normally not an assembly programmer.
Thank you for the link.
Ok, now I think I understand. You say that I can use the safe vload
instruction of the SIMD.jl package instead of the unsafe vloada
without significant performance penalty. It probably depends on the target architecture. It was not the case in the past but I must experiment on current architecture.
The other possibility is to check the alignment in the beginning and branch to a different implementation if it is misaligned. Or use scalar instructions for the first few array elements until you get to an element with the desired alignment, then switch to SIMD.
For shorter things, that can slow things down enough to be significant.
I’m hoping that at some point, small strings can be better aligned (they are always misaligned currently,
because of the 4/8 byte length, that is at the start of each allocated 16 byte chunk.
It might make sense to allocate small strings out of a pool that always allocates 16-byte chunks at an offset of 8.
Yes, but it is more work and an additional test that can cost a few cycles
The constants defined in src/julia_internal.h should allow me to study the merit of larger alignment constraints.
If am able to show some cases where larger alignments provide significative performance improvements, I guess that it may be a sound base to discuss the opportunity for an evolution of the Julia’s allocation alignment policy.
Don’t trust stuff I half-remember, experiment, google and consult the processor manual. I really am no uarch expert.
But yes, that’s what I was saying.
One important thing to note: the older instructions (SSE, maybe SSE2?) require 16-byte alignment in many cases.
OK. I only consider to increase the alignment constraint to 32 or 64 bytes for arrays. The problem of doing it globally via the internal definition of Julia is that it will waste some memory space for small arrays and may reduce global performance of applications. But it is not a problem for local experiments.
I have implemented a C++ ND-array library called Legolas++ which allow to control the underlying data layout with type parameters (slides). The idea was to allow for automatic vectorization of arbitrary algorithms applied to a collection of equally sized problems. I think that it can be efficiently written in Julia but the control of the memory alignment would greatly simplify this work.
No, I was thinking of having the pools used for small allocations set up so that things would be better aligned for their use, to avoid crossing cache lines also, for example.
It would not waste any more space than the current scheme.
OK I see, for a memory pool it should work. But if you have to deal with a large quantify of small objects, with sizes not being multiple of 16, then you will need some extra space… A kind of 1D Tetris problem . This concern differs largely from mine where I want to play on very large chunks of memory and avoid to deal with nonalignment prologues.
Because of the variety of situations, I guess that in the end, a kind of control on the allocator may be interesting to have.
I think that enforcing 64-byte alignment on >= 2KB objects, plus guaranteed padding to a multiple of 64 byte (so that you can’t corrupt memory by overwriting the trailing rest of the last line) would not be too expensive and be very nice for a lot of code (and I have the feeling that this is already the case, only undocumented and hence open to changes without warning). But at this point someone who actually understands the pool allocator should chime in.
Depending on how your target arch reacts to alignment failures for the instructions you use, you can then get a very simple prologue: Either nothing (tiny objects are still handled correctly, only slower, but being fast on them is not your goal anyway), or do an alignment check and punt to generic fall-backs in case of alignment failure.