Aligned Julia Arrays?

I’ve had some experience with this in the past, and it can be handled so that not much space is “wasted” at all.
I’m hoping that similar optimizations will be possible in the future for Julia.

In C++ (my mother tongue :yum:) the allocator design is a field that has been thoroughly explored by many devs and specialized for a lot of different use cases (small object allocator, concurrent allocator taking into account false sharing by Intel, numerous object of the same sizes,…). Most of the design of such allocator flies far above my head. I found that the strategy of proposing a default allocation policy that could be overridden for specific purposes was convenient.

Because of the Julia functional style with immutable struct, I suspect that the allocator design may fulfill specific constraint like being optimized for reusing recently released object…

I agree with foobar_lv2 that it would be nice if Julia memory_pool authors/specialists chime in and say something about your evolution proposal. If not I guess that you can propose a PR.

I can carry on my own experiments now with a local hack to enforce alignment and see if it actually improve performance.

So, I skimmed the allocator, and every object above roughly 2 KB (GC_MAX_SZCLASS) appears to be 64 byte aligned (via jl_gc_big_alloc and ultimately posix_memalign, or for very large arrays via a different mechanism. I think that the separate allocation of jl_array and the bulk data for very large arrays is in order to improve cache / paging during garbage collection runs?

If this is good enough for you, it might be worth opening an issue over, in order to get this documented and get warnings in the News.md if this changes in future versions; or, alternatively get a definite statement that you should not rely on this (which you still could do…).

OK. As I said, I would prefer to be able to control the array alignment than to check for a given size and coding prologues or fallbacks (even if this can be done of course). Don’t you think that we could wait a bit to see if a specialist of the language internals join the discussion and comment about that ?
Would a larger control over the array alignment be difficult to add to the language ?
In addition, it is not clear to me what kind of topic should be better discussed on discourse or on github. I have submitted a bug on github but this evolution query is not a bug… @mbauman answered clearly to the initial question but he may have something to say about this evolution query.

The way I see it, github is better for getting real experts (not me) chime in, and discourse is better for getting non-expert answers (like mine) quickly. But of course, the core devs are active here as well, and sometimes comment. I think opening an issue on github is not really a big escalation, once we get stymied here: worst cast, it will be closed quickly without wasting too much core-dev time.

Depends what you want. Allocating an aligned array? Probably can be done with a ccall into the runtime already, or would need only very mild updates. Checking whether an array is aligned? Easy-peasy. Making dispatch / the type-system aware of alignment issues? Very unlikely, imo.

The key is how you would want the alignment control to take effect. Do you want it just on a few Arrays that you explicitly construct yourself? Or do you want it to affect all Arrays? If it’s the latter, then what’s the interface for changing it? That’s where I might see the greatest “difficulty” in adding this to the language — not that it’s hard, but that adding command line switches or environment variables are not taken lightly. If it’s a function you call, then you’d still need to be able to handle any arrays created before you call the function.

Yes ! The idea would be to be able to specify an alignment constraint for a given array.
As a complete Julia newbie, my a priori ideas about the array interface modifications are certainly irrelevant. What I imagine is one or several specific constructor functions (or an optional alignment argument on existing ones) specifying that a given array (data) should respect a given alignment constraint. Something like

a=rand(Float64,1000,align64)

No. I only deals with my own arrays. But note that @ScottPJones proposed more global modifications for the allocation of small objects.

If your array is larger 2KB, it will be automatically 64-byte aligned, currently (and I think for the forseeable future).

Otherwise, you can just ccall an aligned malloc for your system and unsafe_wrap with own=true.

The only thing you don’t get this way, for arrays of sizeof(A) < 2KB is the nice inline layout where the data starts one cache-line after the header (which is good for the prefetcher when iterating from the start). But for arrays >16KB you don’t get this nice arrangement anyway, and I’d guess that is your primary usecase.

Yes, for small strings, in particular, the current structure makes it impossible to use the older SSE instructions (which require alignment).
If pools for short strings where created, to try to 1) keep the start of each string 16-byte aligned (necessary for the SSE instructions [not for AVX, that is only a limitation with the older instruction set]), and 2) to try to keep the strings in a single cache block, I believe it could substantially increase performance for strings, esp. if then code is written using the SSE / AVX instructions to take advantage of that layout).

(Note: I do have experience with doing just this sort of optimizing the layout of allocation pools).

OK, I think I was unclear (again). I want to use Julia Arrays instances that I build myself (not my own type/struct).
About the ccall, I don’t know how I could use it without a major hack in the code. I don’t believe that it is possible to construct a Julia Array from a user defined ptr.

Are Julia strings arrays of char ?

They used to be made from Vector{UInt8}, however general vectors have a large amount of overhead (because the structure is essentially the same as for an Array, and supports changing the size of the array, pointing to externally allocated memory, information about the number of dimensions, which ends up requiring extra space overhead (around 40 bytes for an Array on a 64-bit system).

As of v0.6, they are special, in that there is a UInt for the length, followed immediately by the bytes of the string, with one byte reserved for a \0.
However, since the memory for these is normally allocated 16-byte aligned (on a 64-bit system), it means that the start of the string itself is always misaligned for the older SSE instructions.
You can check this by looking at the values returned by pointer(str).

OK, so you could have a specific allocator for strings that will not interfere with Arrays… but because of their specific format the actual string data is always misaligned… BTW I am curious about the specific SIMD instructions you want to use for strings (I usually only deals with numbers) ? vectorized comparison ?

The SIMD operations I generally have used in the past are ones where I can widen (zero-extend) or narrow
chunks of characters whose codeunits are UInt8, UInt16, or UInt32, and also to conditionally flip bits depending on the results of another vectorized operation (i.e. upper or lowercase chunks of 16, 32, 64 ASCII characters all at once, to check chunks for characters in a certain range of UTF-16 (like surrogates) all at once, or for invalid or overlong characters for UTF-8, lots of different operations that are much much faster using the SIMD instructions.

Your way of constructing a guaranteed aligned Vector{T} on linux is by

function alloc_aligned_vec(::Type{T},dims...) where T
    assert(isbits(T))
    sz = sizeof(T)*prod(dims)
    align = 64
    pt = ccall(:aligned_alloc, Ptr{Void}, (Base.Csize_t, Base.Csize_t), align, sz)
    assert(pt != C_NULL)
    return unsafe_wrap(Array, convert(Ptr{T},pt), dims, true)
end

usable by e.g.

julia> A=build_aligned_vec(Int,2,2), pointer(A)
([94266412571552 6872887083333350261; 7666359528109531950 127979076609647], Ptr{Int64} @0x000055bc1c219840)

Depending on your needs, include error handling (allocation failure), overflow checking, and possibly support for windows.

Possibly also try to avoid leaking memory in case of allocation failure? But these are all cosmetics that need to be done super-duper correctly in Base and can be ignored for most numerics packages.

4 Likes

Thank you very much ! I did not knew about unsafe_wrap function.

OK, thank you. This is Impressive ! I did not knew that string treatments could be so highly optimized.

I spent a good deal of time over almost 30 years optimizing the performance of a language/database system that was originally based on the concept that everything is a string :grinning: (two other semantic types were added later, IEEE binary doubles, and objects, but semantically there is no difference between a number and a “canonic numeric” string [i.e. a string that follows certain rules such that you can go from string <=> number, i.e.
"1" is canonic, but "1.000" is not, and “1.235” is canonic, but “+1235e-3” is not., even though the numeric interpretations are equal, i.e. +"1" is equal to +"1.000")
Since the language is so heavily string oriented, you can understand why no stone was left unturned in making them fast as possible.