I want to make default allocations faster (and actually less needed too) than default allocations in C, C++, Rust, and more space-efficient, also the structs containing the Refs/pointers to the allocations (important for cache effects), more compact with my proposal. [It is similar to what Java does with compressed pointers, but I want to learn from their mistakes(?) and what they do right, see at the bottom.]
C++ (and I believe Rust) has many pointer types, but I think all 8 bytes each on 64-bit platforms, and it comes with a downside Julia can get around.
I’ve thought of the smallest allocations (trees or linked lists, the minimum allocation currently in Julia is 16 bytes), and thought of very large allocations and typical-sized in-between.
Part of the idea is to have 4-byte pointers, as the default, i.e. NOT being able to point to arbitrary bytes in memory with them. But still have current Julia Ptr 8-byte, as is, available.
A second part of the proposal is get away with fewer pointer, not just smaller, and in many cases they would be null-pointers, lessening the load on the GC even further.
So to show an example:
julia> struct personal_info
last_name::String # Currently implemented by Ptr[64], but hardest Julia type to change @edit does work for it.
first_name::String
post_code::String
country::String
end
julia> sizeof(personal_info) # NOT the true size
32
julia> @time (p = personal_info("Haraldsson", "Páll", "220", "Iceland"); println(p.last_name)) # It's actually 4 for p, plus overhead for printing, I have to do that to not optimize them away
Haraldsson
0.000077 seconds (6 allocations: 128 bytes)
So, 128 bytes allocated, in 4 allocations, plus 32 bytes in the struct, i.e. likely 1 more allocation for it (or rather 1 for an array of such struct, so the allocation for the struct itself can usually be ignored).
Phase 1:
If you use Ptr32, then the struct itself goes down to 16 bytes, but you have a dilemma, now the pointers can only address 64 GB of memory, while not wasting any space on the heap (ok as a configuration option), but we could go up to 512 GB (my choice) or even 16 TB addressability with 4KB-aligned pointers.
But with 128-byte minimum allocation (I think Common Lisp may be adopting such) and alignment, then for e.g. a binary tree each node wastes space, takes 8x, though likely only 5.3x or less, i.e. with 81.125% of the allocated space unused (that’s the worst real-world case, actually usually a lot less overhead, even without phase 2). I think it’s actually ok for the first phase, and especially if the GC can track two types of pointers, this new one and the old one, then no overhead (we could also just GC-track the smaller kind).
Phase 2:
To lessen the overhead, what can be done is, in effect:
julia> struct personal_info
last_name::Ptr32 # Implemented with UInt32, here would look like a String to users
first_name::UInt8 # These extra fields could actually be optimized away, moved to the heap
post_code::UInt8
country::UInt8
end
julia> sizeof(personal_info) # 8 (or even just 4), plus the real text date, 24 bytes, it would be padded to e.g. 128 likely
8
I.e. the String for last_name, i.e. its pointer, is shared, and then you only have offsets into that string for first_name etc. They could add up on top of each other, they could also be implemented with e.g. UInt16.
It’s a bit complex for the compiler to convert the first form to that form, but not unthinkable. As a first step I would implement a SIMDString4 just as that personal_info, but that could be used (e.g. if ends up as the default String type) for from only 1 string up to 4 strings, though most users wouldn’t want to use it directly, and need number indexing…
This brings me back the the binary tree case. Such is non-optimal if you want to address more the 64 GB, but for larger addressability, without wasting memory it’s possible with:
struct 5_byte_pointer
pointer::Ptr32
offset::UInt8
end
How is that better? If the original struct is implemented with this, then UInt8 will actually be expanded 4 bytes for alignment reasons, i.e. no savings, when/because of interleaving, but only if structs are not allowed to be reordered like in Rust (unlike in C++, and Julia currently). The struct could be 4*4 + 1*4
= 20 bytes, in effect implemented as:
julia> struct personal_info
last_name::UInt32 # Done explicitly by 5_byte_pointer, the compiler implementing with Ptr32, and those extra hidden fields appended
first_name::UInt32
post_code::UInt32
country::UInt32
_o_last_name::UInt8
_o_first_name::UInt8
_o_post_code::UInt8
_o_country::UInt8
end
So for a binary tree, the first node would allocate 128-byte block or whatever, e.g. 4096-byte and its struct would be:
julia> struct b_tree
left::UInt32 # Done explicitly by 5_byte_pointer, the compiler implementing with Ptr32, and those extra hidden fields appended
right::UInt32
_o_left::UInt8 # would be 0 for the first node
_o_right::UInt8 # also 0
end
It would still be a 25% saving for each node, except, the first node would allocate say 128 bytes. The implicit malloc you do would allocate those 128 bytes, but then you allocate another node, and it would basically bump allocate from within the unused space for the allocation used by the first node until its full, with the pointers the same, only the offsets changing. The GC however would not have to track the those suballocations, would never look at those offsets. The allocator would need to care though, and it would call a fast thread-local bump suballocator. You could get as fast at allocating as in Java, by trivial addition, but would not get the compacting part of it. I think fragmentation within the small blocks will not be an issue.
I’m thinking of fixing Julia’s outlier (much slower than Java, open to other ways to fix that, anyone want to test with non-default GC in Julia?), in addition to mainly string optimizations (I have also other ideas to fix strings):
I see now Java uses compressed 4-byte (not 5- or 8-byte) pointers on 64-bit (I just propose going further than they do, why do they only 8-byte align?! I would not be limited to 4 billion objects in with phase 2):
https://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html#compressedOop
Managed pointers in the Java heap point to objects which are aligned on 8-byte address boundaries. Compressed oops represent managed pointers (in many but not all places in the JVM software) as 32-bit object offsets from the 64-bit Java heap base address. Because they’re object offsets rather than byte offsets, they can be used to address up to four billion objects (not bytes), or a heap size of up to about 32 gigabytes. […]
Compressed oops is supported and enabled by default in Java SE 6u23 and later. In Java SE 7, use of compressed oops is the default for 64-bit JVM processes when
-Xmx
isn’t specified and for values of-Xmx
less than 32 gigabytes.
I have seen real-world application use 40% less memory by leveraging the compressed object representation. When you’re talking about an application that uses 20 GB of memory in “full” 64-bit mode running on a system with 16 GB of physical ram, the performance gains are very, very real.
JVM uses compressed object pointers automatically below 32 GiB of memory. If you understand how it works (dropping youngest three bits from each address as they are always 0 due to memory alignment) you’ll understand that you can’t go further.There is an interesting fact: once you exceed this 32 GiB border JVM will stop using compressed object pointers, effectively reducing the available memory. That means increasing your JVM heap above 32 GiB you must go way above. According to great Everything I Ever Learned about JVM Performance Tuning @twitter (around 13:00) presentation increasing heap from 32 GiB to anything below 48 GiB will actually decrease the amount of available memory (!) because compressed object pointers are no longer there.
Seems like an interesting talk:
At 12:18 Java has a “performance uncanny memory performance valley between” 32 GB and 40 something gigs, “next stop is 48 GB”.
It seems like over 32 GB compressed pointers are turned off (not something I would want to do, so the limit must be high enough, and if exceeded just run out of memory). In Java it seems like they turn into 64-bit pointer, which is intriguing, wouldn’t be possible without JIT, and re-JITting everything! Java can compress heaps, so it could also be that Java is changing memory alignment from 8 (shift by 3) to 12 bytes dynamically?! It seems actually likely easier for them, and might be what is happening. That wouldn’t support though having less memory, only a slight hickup in perfomance…