Improved allocation design, with 4-byte pointers, and sometimes 5-byte in effect

I want to make default allocations faster (and actually less needed too) than default allocations in C, C++, Rust, and more space-efficient, also the structs containing the Refs/pointers to the allocations (important for cache effects), more compact with my proposal. [It is similar to what Java does with compressed pointers, but I want to learn from their mistakes(?) and what they do right, see at the bottom.]

C++ (and I believe Rust) has many pointer types, but I think all 8 bytes each on 64-bit platforms, and it comes with a downside Julia can get around.

I’ve thought of the smallest allocations (trees or linked lists, the minimum allocation currently in Julia is 16 bytes), and thought of very large allocations and typical-sized in-between.

Part of the idea is to have 4-byte pointers, as the default, i.e. NOT being able to point to arbitrary bytes in memory with them. But still have current Julia Ptr 8-byte, as is, available.

A second part of the proposal is get away with fewer pointer, not just smaller, and in many cases they would be null-pointers, lessening the load on the GC even further.

So to show an example:

julia> struct personal_info
         last_name::String  # Currently implemented by Ptr[64], but hardest Julia type to change @edit does work for it.
         first_name::String
         post_code::String
         country::String
       end

julia> sizeof(personal_info)  # NOT the true size
32

julia> @time (p = personal_info("Haraldsson", "Páll", "220", "Iceland"); println(p.last_name))  # It's actually 4 for p, plus overhead for printing, I have to do that to not optimize them away
Haraldsson
  0.000077 seconds (6 allocations: 128 bytes)

So, 128 bytes allocated, in 4 allocations, plus 32 bytes in the struct, i.e. likely 1 more allocation for it (or rather 1 for an array of such struct, so the allocation for the struct itself can usually be ignored).

Phase 1:

If you use Ptr32, then the struct itself goes down to 16 bytes, but you have a dilemma, now the pointers can only address 64 GB of memory, while not wasting any space on the heap (ok as a configuration option), but we could go up to 512 GB (my choice) or even 16 TB addressability with 4KB-aligned pointers.

But with 128-byte minimum allocation (I think Common Lisp may be adopting such) and alignment, then for e.g. a binary tree each node wastes space, takes 8x, though likely only 5.3x or less, i.e. with 81.125% of the allocated space unused (that’s the worst real-world case, actually usually a lot less overhead, even without phase 2). I think it’s actually ok for the first phase, and especially if the GC can track two types of pointers, this new one and the old one, then no overhead (we could also just GC-track the smaller kind).

Phase 2:

To lessen the overhead, what can be done is, in effect:

julia> struct personal_info
         last_name::Ptr32  # Implemented with UInt32, here would look like a String to users
         first_name::UInt8  # These extra fields could actually be optimized away, moved to the heap
         post_code::UInt8
         country::UInt8
       end

julia> sizeof(personal_info)  # 8 (or even just 4), plus the real text date, 24 bytes, it would be padded to e.g. 128 likely
8

I.e. the String for last_name, i.e. its pointer, is shared, and then you only have offsets into that string for first_name etc. They could add up on top of each other, they could also be implemented with e.g. UInt16.

It’s a bit complex for the compiler to convert the first form to that form, but not unthinkable. As a first step I would implement a SIMDString4 just as that personal_info, but that could be used (e.g. if ends up as the default String type) for from only 1 string up to 4 strings, though most users wouldn’t want to use it directly, and need number indexing…

This brings me back the the binary tree case. Such is non-optimal if you want to address more the 64 GB, but for larger addressability, without wasting memory it’s possible with:

struct 5_byte_pointer
  pointer::Ptr32
  offset::UInt8
end

How is that better? If the original struct is implemented with this, then UInt8 will actually be expanded 4 bytes for alignment reasons, i.e. no savings, when/because of interleaving, but only if structs are not allowed to be reordered like in Rust (unlike in C++, and Julia currently). The struct could be 4*4 + 1*4 = 20 bytes, in effect implemented as:

julia> struct personal_info
         last_name::UInt32  # Done explicitly by 5_byte_pointer, the compiler implementing with Ptr32, and those extra hidden fields appended
         first_name::UInt32
         post_code::UInt32
         country::UInt32
         _o_last_name::UInt8
         _o_first_name::UInt8
         _o_post_code::UInt8
         _o_country::UInt8
       end

So for a binary tree, the first node would allocate 128-byte block or whatever, e.g. 4096-byte and its struct would be:

julia> struct b_tree
         left::UInt32  # Done explicitly by 5_byte_pointer, the compiler implementing with Ptr32, and those extra hidden fields appended
         right::UInt32
         _o_left::UInt8   # would be 0 for the first node
         _o_right::UInt8  # also 0
       end

It would still be a 25% saving for each node, except, the first node would allocate say 128 bytes. The implicit malloc you do would allocate those 128 bytes, but then you allocate another node, and it would basically bump allocate from within the unused space for the allocation used by the first node until its full, with the pointers the same, only the offsets changing. The GC however would not have to track the those suballocations, would never look at those offsets. The allocator would need to care though, and it would call a fast thread-local bump suballocator. You could get as fast at allocating as in Java, by trivial addition, but would not get the compacting part of it. I think fragmentation within the small blocks will not be an issue.

I’m thinking of fixing Julia’s outlier (much slower than Java, open to other ways to fix that, anyone want to test with non-default GC in Julia?), in addition to mainly string optimizations (I have also other ideas to fix strings):

I see now Java uses compressed 4-byte (not 5- or 8-byte) pointers on 64-bit (I just propose going further than they do, why do they only 8-byte align?! I would not be limited to 4 billion objects in with phase 2):
https://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html#compressedOop

Managed pointers in the Java heap point to objects which are aligned on 8-byte address boundaries. Compressed oops represent managed pointers (in many but not all places in the JVM software) as 32-bit object offsets from the 64-bit Java heap base address. Because they’re object offsets rather than byte offsets, they can be used to address up to four billion objects (not bytes), or a heap size of up to about 32 gigabytes. […]

Compressed oops is supported and enabled by default in Java SE 6u23 and later. In Java SE 7, use of compressed oops is the default for 64-bit JVM processes when -Xmx isn’t specified and for values of -Xmx less than 32 gigabytes.

I have seen real-world application use 40% less memory by leveraging the compressed object representation. When you’re talking about an application that uses 20 GB of memory in “full” 64-bit mode running on a system with 16 GB of physical ram, the performance gains are very, very real.

JVM uses compressed object pointers automatically below 32 GiB of memory. If you understand how it works (dropping youngest three bits from each address as they are always 0 due to memory alignment) you’ll understand that you can’t go further.There is an interesting fact: once you exceed this 32 GiB border JVM will stop using compressed object pointers, effectively reducing the available memory. That means increasing your JVM heap above 32 GiB you must go way above. According to great Everything I Ever Learned about JVM Performance Tuning @twitter (around 13:00) presentation increasing heap from 32 GiB to anything below 48 GiB will actually decrease the amount of available memory (!) because compressed object pointers are no longer there.

Seems like an interesting talk:

At 12:18 Java has a “performance uncanny memory performance valley between” 32 GB and 40 something gigs, “next stop is 48 GB”.

It seems like over 32 GB compressed pointers are turned off (not something I would want to do, so the limit must be high enough, and if exceeded just run out of memory). In Java it seems like they turn into 64-bit pointer, which is intriguing, wouldn’t be possible without JIT, and re-JITting everything! Java can compress heaps, so it could also be that Java is changing memory alignment from 8 (shift by 3) to 12 bytes dynamically?! It seems actually likely easier for them, and might be what is happening. That wouldn’t support though having less memory, only a slight hickup in perfomance…

2 Likes

88% of allocations in Go at least (or only at Bytedance?) are 128-byte allocations or less:

33% are 16-byte, and 17% are 48-byte, 8-byte (not possible in Julia) and anything else is a smaller fraction. Calculations below based on that time-point in the video.

Meaning if we have a minimum 64-byte allocation and alignment (a cache-line in recent CPUs), we can address 256 GB of memory, i.e. of such objects, still limited by 4 billion addressible objects (actually fewer when some are large, and I always planned to special-case large allocations anyway, was unsure of at which cut-off, i.e. then not bump-allocate).

And note though I calculate an (64/16*.33 + 64/48*.17 + 64/8*0.04 + 64/24*0.08 + 64/32*0.08) = 2.24x expansion factor for the allocations smaller than 64-byte ones, though those are only 70% of the allocations, and no expansions for the rest meaning only 56.8% expansion in RAM use (and RAM is cheap, but cache is not), with no mitigations such as as in phase 2.

By going to 128-byte minimum, it of course gets worse adding 64-byte additional overhead to all allocations up to 112-byte allocations.

Isn’t that just an arena allocator?

It’s similar but no, not just arenas. I do not claim full originality, basing my ideas on arena allocators, and how databases allocate in page-sized chunks. My SIMDString idea is similar to rows in a database, sharing a page, and arenas aren’t about fewer pointer if I understand correctly.

Note, you DO have arenas with Bumper.jl already, and I’m pretty sure with it (or similar techniques) Julia can be as fast as any language, C/C++, Rust and Java. That said it’s slower here, and I’m motivated by that worst-case outlier for Julia:

As I understand arenas in the video and all I know about such allocators, yes, they can be fast, but you have to opt into them, in e.g. C and Julia. And as explained in the video you can have many of them, and release all at once, for each one (but you need to do that, or be ok with that memory leak). But that’s based on an arena ID. And then you have an offset into your arena. So you have likely have at a minimum 4 + 4 bytes, to point to memory within such an arena (possibly only 4 bytes?), though more likely 16 bytes? In my idea the pointer would be 4 bytes, no need for an arena ID. And if you choose to have an offset into your block as I explained, if could be as little as 1 byte, and still be useful.

The rules at the Benchmark Game rule out using arenas for CG-languages (at least for some code, i.e. there).

It seems unfair, if you want to show what is the “fastest language” but I believe the point with the benchmark is to show what speed you get out of the box for GC languages (without tuning the GC allocator). That’s of course very unfair, since shown for C, C++ or Rust isn’t want you get by default, only for highly tuned code, opting into arenas. If might be implicit that arenas are much used such languages (but not GC languages?), and true, that might be the culture, for e.g. game code. I’m just unsure about most code.

The video is interesting, claiming manual memory allocation isn’t that hard, or needs to be, though it IS strictly speaking a bit harder if you must think of arenas, and opt into them. At least vs GC languages.

I’m thinking what can be had automatically, sort-of arenas behind the scenes, without the programmer needing to know, some or many scientists might not know of or care about arenas. You also need to think of how composable they are, if you destroy an arena, how does that work? For games they are ok, with e.g. arrays of structs of bittypes, but in general pointers from it could point outside of your arenas(?) to other arenas or just to strings.

If your arena points to another arena (would you?) then this gets complicated. If you have even just a huge array of strings (in your arena), they all point to the heap. In C and C++ if you release your arena, you must release all you strings, and it’s an O(n) operation, i.e. slow. A garbage collector CAN actually be better for hard real-time, it could lazily release off-arena dependencies. To be fair in robotics or games you would just not release, or do it outside of your hot loop.

I think compressed object references in julia could make a lot of sense, and the fears of max heap size are somewhat overblown – we’re facing quite different tradeoffs from JVM.

Julia could use the following:

  1. Definition: A julia heap object is something that starts with a julia object header.
  2. Pointers (Ptr) stay 8 byte, without any constraints. This especially means that the pointer in Memory objects must be 8 bytes. This is completely non-negotiable, because it must be possible to mmap / wrap externally allocated arrays.
  3. We could go to a design where object references are compressed, i.e. where julia heap objects can be addressed by 4 bytes. This would mean that we decide on some alignment for julia objects, e.g. 16 byte, and then julia reserves e.g. 64g of virtual address space on startup. Out of that virtual address space, julia can then allocate / mmap its memory pools, and large objects. This introduces the constraint that all julia objects must live in this memory range!
  4. Notably, the backing payload of julia Memory / Array is not necessarily a julia object! Already today, the Memory object contains a pointer and large buffers are using the system (libc) allocator which often means separate mmap. So large arrays don’t come out of the limited 64g of “object heap space”.

This distinction is not available to JVM, because its Array type is always inline allocated, i.e. the object header is directly adjacent to the large payload. Huge array allocs in java are not bump allocated either, they go straight to syscall, just as in julia.

Compared to JVM arrays, we are paying one additional level of indirection in julia, for the benefit of being able to wrap standard Memory/Array around mmaped / separately allocated memory, and ultimately more convenient FFI. This is a huge performance price we’re already paying!

This is much better than JVM off-heap data: First, the process is automatic for large allocs, and second, our backing memory is permitted to contain object references (e.g. Vector{Any}(undef, 1<<24) only consumes a couple hundred bytes of precious object-heap-space).

With large arrays separately allocated, I believe that 64g of object-heap space is plenty for the forseeable future, especially for typical julia code, which consists of a moderate amount of julia objects carrying metadata, and a handful of large arrays / matrices containing actual data (often of the floating point variety).

1 Like

I think you’re basically agreeing with me, I also wanted Ptr to stay 8-byte, and I suppose for “the pointer in Memory objects”. I did think of mmap, more so that Memory. I suppose it’s common to mmap files, and you need to read strings even from there, i.e. need byte-addressible. But I’m not sure strings in general need 8-byte pointers, to get away with 4-byte pointer they need to be under the (say) 64 GB ceiling (and it’s unclear to me all memory mapping needs to be above), then they would need to be copied to your heap, what you likely want anyway. Or do you think you need the (ordinary) Strings to be able to point to any mmapped region? I think it might be actively bad since it might or will go away, only good thing I see about having them there is the possible no-copy use (avoiding a one-time cost, per string). Then you would need two types of strings… which is also possible.

That’s what I had in mind for Ptr32. It could be Base.Ptr32 not (yet) exported. You can, and would, convert all such pointers to 8-byte Ptr, but the reverse is not just possible in general (though you could actually with indirection, if such is wanted, still would only allow limited amount of such pointers).

Types like e.g. Int need to be aligned in general (even though some platforms allow misaligned). When you mmap, you would put the file on say a page boundary, so all Ints and Floats would be aligned, assuming they were in the file also. But can you in general assume that, and i.e. read such types? Should you only parse, from it, use byte-aligned (or use strings and the few types like UInt8 and Bool… directly)? I suppose in many cases you assume suitably aligned in a binary file, like jpeg?, and in others only bytes, like JSON. Even when aligned, you might have big-endian, so maybe not too helpful to access directly?

Sorry, I didn’t mean “map a file in the filesystem”, I referred to “issue a syscall to map memory”, which is typically mmap with an anonymous mapping, or madvise.

In the julia world, we just ask malloc for the backing memory of large Memory instances. malloc in turn asks the OS kernel.

This btw is why many users have such performance issues with dTLB! Due to this architecture, we don’t get to control the fine details of the request and are at the mercy of e.g. glibc defaults that are not tuned to julia. It is extremely advisable to inform the kernel that userspace (aka julia) is too stupid to correctly convey madvise details about desired huge-pages, and to system-wide override userspace choices that are bad for julia performance (just don’t run large mySQL instances on the same machine). Cf eg here.

In the JVM world, they have to talk to the OS kernel relatively directly, because compressed object references together with the JVM array memory layout require very careful management of precious virtual address space. But they do set all the madvise options correctly, no system-wide kernel tuning required!

For large arrays, you can simply try out the alignment:
julia> @noinline function foo(n)
a=Vector{Int}(undef, n)
trailing_zeros(reinterpret(UInt64,pointer(a)))
end
foo (generic function with 1 method)

julia> function foo2(n)
       mi = 64
       ma = 0
       for i=1:1000
       t=foo(n)
       mi = min(mi, t)
       ma = max(ma, t)
       end
       mi, ma
       end
foo2 (generic function with 1 method)

julia> begin
       @show foo2(1)
       @show foo2(1<<10)
       @show foo2(1<<16)
       @show foo2(1<<24)
       end
foo2(1) = (5, 13)
foo2(1 << 10) = (6, 15)
foo2(1 << 16) = (6, 12)
foo2(1 << 24) = (6, 6)

You see that large array allocations are always aligned to exactly 64 bytes: The request goes to the OS kernel, gets page / hugepage aligned, and the initial 64 bytes are filled with metadata to allow free to clean up.

Yes.

I think that is guaranteed by hardware, pretty much, i.e. it would not be possible to write an OS that does it differently. Mapping between file and physical memory is pure software, but the mapping between virtual address and physical address is done by the MMU, directly in silicon. And the layout of page-tables has certain alignment guarantees.

So if you wrote a kernel that allows you to map a file with an offset of e.g. 1 – sure, no problem for your first process, just need to record and handle that right on page-fault. Then a different process tries to map the file with an offset of zero and the two use that for IPC – and now you’re fucked because the structure of the pagetable, as required by the in-silicon pagewalkers, just doesn’t have space to even encode an unaligned virtual-to-physical mapping, and even though userspace never sees physical addresses, this implies a constraint on alignment / offset between two different mappings that alias, i.e. target the same physical memory.

1 Like

And if one is less definitive?

“We are profoundly uninterested in claims that these measurements, of a few tiny programs, somehow define the relative performance of programming languages aka Which programming language is fastest.

If only someone would suggest what they consider to be very fair to all the programming language implementations :slight_smile:

1 Like

I was hoping even faster with the new upcoming GC PR, but it is 4.8% faster, though only compared to 1.11 (for min):

$ juliaup add pr56288

$ hyperfine 'julia +1.11 -t 4 binarytrees.julia-4.julia 21'
Benchmark 1: julia +1.11 -t 4 binarytrees.julia-4.julia 21
  Time (mean ± σ):     12.591 s ±  0.304 s    [User: 14.098 s, System: 0.530 s]
  Range (min … max):   12.248 s … 13.252 s    10 runs
 
$ hyperfine 'julia +pr56288 -t 4 binarytrees.julia-4.julia 21'
Benchmark 1: julia +pr56288 -t 4 binarytrees.julia-4.julia 21
  Time (mean ± σ):     12.059 s ±  0.246 s    [User: 16.355 s, System: 0.559 s]
  Range (min … max):   11.654 s … 12.539 s    10 runs

both are even regressions from 1.10, even when it’s single threaded:

$ hyperfine 'julia +1.10 binarytrees.julia-4.julia 21'
Benchmark 1: julia +1.10 binarytrees.julia-4.julia 21
  Time (mean ± σ):     11.352 s ±  0.244 s    [User: 11.430 s, System: 0.765 s]
  Range (min … max):   11.068 s … 11.925 s    10 runs
 
$ hyperfine 'julia +1.10 -t 4 binarytrees.julia-4.julia 21'
Benchmark 1: julia +1.10 -t 4 binarytrees.julia-4.julia 21
 ⠏ Current estimate: 10.957 s     █████████████████████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:01:17
^C

I’m looking into if I’m for sure using the new MMTk, and if I need to tune it somehow:

I haven’t seen if I need to tune it for sure (with some env var), or even compile from source, is it a build option? It seems to be an off-by-default build option, then still some improvement of 1.12-DEV vs 1.11, for some other reason…

I’m now trying with the instructions from there:

first:

$ sudo apt  install cargo

and it’s seemingly downloading and compiling also a lot on unrelated code ending with:

error: failed to run custom build command for mmtk-julia v0.1.0 (/home/pharaldsson/MMTk/mmtk-julia/mmtk)

Caused by:
process didn’t exit successfully: /home/pharaldsson/MMTk/mmtk-julia/mmtk/target/release/build/mmtk-julia-3d100ed15296045f/build-script-build (exit status: 101)
— stderr
/home/pharaldsson/MMTk/julia/src/julia_fasttls.h:7:10: fatal error: ‘atomic’ file not found
thread ‘main’ panicked at build.rs:69:10:
Unable to generate bindings: ClangDiagnostic(“/home/pharaldsson/MMTk/julia/src/julia_fasttls.h:7:10: fatal error: ‘atomic’ file not found\n”)
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

So I suppose not helpful to run this last command:

$ MMTK_PLAN=Immix MMTK_BUILD=release MMTK_JULIA_DIR=`pwd`/mmtk-julia make -C julia

For example, MMTk provides BumpPointer, which simply includes a cursor and a limit.In the following example, we embed one BumpPointer struct in the TLS.

Does Julia itself need to opt into something like: post_alloc in mmtk::memory_manager - Rust

I see in the code:

INLINE_FASTPATH_ALLOCATION

Something else to look into and might be related to what I’m proposing: