Should memory allocation alignment be increased to e.g. 64 (or 256)?

I see it’s 16-byte aligned, for Julia allocations (also if going straight to Libc.malloc).

I’m thinking of a 32-bit pointer (Ptr32) idea, that wouldn’t have byte-addressability, i.e. for base pointers, then if you add e.g. 1 or other offset, you get promoted to regular Ptr).

Even without my idea, I’m not sure status quo is better when you have 64-byte cache lines.

It works for correctness, but is there a drawback if different unrelated allocations share a cacheline?

I think it might actually be ok, for single-threaded, or if you have separate threads that hit the same L1 cache.

But if not, the cache line will be ok in L2+ (or at least L3), but it would go to one L1, and does it jump around between cores, slowing down?

This is maybe a rare issue in practice, so is it more valuable to not waste memory?

It seems to me 64-byte allocations it’s too bad, most allocations are that large or larger anyway. My Ptr32 idea is actually to optimize Strings so don’t worry too much about such smaller.

Trees (and linked lists) need a minimum of two pointers, 16 bytes yes, but often with something more, also are B-trees not yet popular for RAM too?

What are them main arguments for or against larger (or smaller) minimum lengths.

Even large allocations are only 16-byte aligned, at least for Libc, but they could at least be 64-byte or more aligned?

I might be wrong, this information might be outdated, but isn’t 32 bits optimal for x84_64 CPU performance? As far as I am aware (again may be outdated) there is a performance penalty for loading/storing data which is not 32 bit aligned, but no performance penalty (or benefit) for data which is 64 bit aligned.

My knowledge of ASM is limited, however. Someone with more knowledge can either confirm or correct me.

You need 64-bit alignment for some types (like pointers) on some CPUs, though only 32-bit alignment works for some 64-bit types (like Float64), at least on some CPUs.

But this isn’t too relevant here, I’m talking about byte alignment, would still be 32- AND 64-bit (4-byte) aligned also, just way more apart, as already done.

The memory allocations are basically of structs, or one struct, so the question is sort of what’s the minimum useful struct, and how much memory are you willing to waste on say a 32-byte one, ok with allocating 64 bytes?

Libc has some overhead also for each memory allocation (anyone knows how much or little)?

I suppose a 16-byte allocation is already way larger in practice, and most allocations aren’t of single structs, only of worry are trees, so there how large?

https://www.nic.uoregon.edu/~khuck/ts/acumem-report/manual_html/ch03s02.html

Common cache line sizes are 32, 64 and 128 bytes.

I believe 64-byte most often by now, e.g. in Apple chips.

The original Pentium 4 had 64 byte L1 cache lines and 128 byte L2 cache lines, apparently. … I would imagine designs that require the L1 cache be a subset of the L2 cache would keep the line sizes the same.

It might be bad for performance allocating 64 bytes and often use only the first half, if the cacheline is 32 bytes (not if 64 bytes really), but I think 32-byte cachelines are dying if not dead already.

We only need to optimize for 64-byte plus cachelines, I think they are increasing over time, and will trend up, never get smaller again.

I didn’t really know we had 128-byte in some level. I thnk we need to assume the smallest of all levels.

Even the Pi 5 has 64-byte on all levels:
https://forums.raspberrypi.com/viewtopic.php?t=363247

With 64-byte alignment of allocations my 32-bit pointers could address 256 GB of memory, ideally I would like 1 TB RAM addressability, then 256-byte alignment. I could go even higher, with tricks.

Minimum allocation is actually 64 bytes I discovered, already (alignment might still be 16-byte, unreliable tests below make it seem larger, sometimes). At least for Vectors:

julia> @time zeros(UInt8, 1)
  0.000003 seconds (2 allocations: 64 bytes)

then:

julia> @time zeros(UInt8, 9)
  0.000004 seconds (2 allocations: 80 bytes)

while you can go up to on 1.10.5 before going to that next larger size:

julia> @time zeros(UInt8, 15)
  0.000005 seconds (1 allocation: 64 bytes)

I’m also investigating a speed regression in 1.11, that might be related to this.

Next jump to 80 bytes is at only 24 vs allows up to 31 in 1.10.5:

julia> @time zeros(UInt8, 31)
  0.000006 seconds (1 allocation: 80 bytes)

Since implemented in C before, might pre-1.11 have some hidden allocations not shown? Or does 1.11 actually have double the counts and more size overhead?

It seems though I have 32-byte alignment though not always:

julia> bitstring(pointer(zeros(UInt8, 1)))  # 64-byte allocation...
"0000000000000000011111110000011101000010110110101111001001100000"

It *seems* like I have 128-byte alignment on 1.10.5 when up to:
julia> bitstring(pointer(zeros(UInt32, 128÷4)))  # 128-byte allocation
"0000000000000000011111110000111010101010100110000000111101000000"

But then only 64-byte alignment for larger (might be an illusion, seems like 32-byte on 1.11):
julia> bitstring(pointer(zeros(UInt32, 132÷4)))  # 132-byte allocation...
"0000000000000000011111110000111010101010100011100010111011000000"


julia> @time zeros(UInt16, 0)  # weird exception for 0-byte allocation, and larger for UInt8
  0.000003 seconds (1 allocation: 48 bytes)
UInt16[]