Question about how to fix some efficient memory layout of large 1d-structures?

Hi,

I’m trying to figure out the “right way” to be declaring a few 1d-tables for my chess-engine, in julia, such that their memory-layout is as space-efficient as possible, and (b) fixed and never changes (i.e. the gc should never shuffle around any portion or even a whole table, for whatever reason).

Specifically the requirements & specs for any one of those tables are:

  1. They can be huge in size, relative to total available ram on a computer (up to ~50% of ram, as a rough estimate, in total).

  2. All tables are to be allocated and initialized at startup, of the engine, and will never change in size - just data within tables might change (some will even have constant values, just for efficient lookups).

  3. There will only be random-access to the tables, ever, (no looping in any regular manner). just indexed access to single entries at any given time, with no correlatation to previous indices.

…so given the above, I basically just need some kind of fixed 1d-array-structure, but with tight memory-layout (no need to grow / shrink or do other funky stuff), for being able to to access any entry with a given key like so… entry = table[key].

I think this code probably summarizes, my attempts, so far, well enough…

using Formatting
using Printf
const fD = generate_formatter("%'d")

mutable struct Entry
    data0::UInt64
    data1::UInt16
end
function test()
    nEntries = 4
    table = Array{Entry,1}(undef, nEntries)
    #table = Vector{Entry}(undef, nEntries) # alternative, but really same thing
    for i = 1 : nEntries
        table[i] = Entry(i, i) # just fill with unique dummy-data, for testing
    end
    entrySize = sizeof(Entry)
    tableSize = sizeof(table)
    @assert tableSize == nEntries * entrySize "Unexpected: sizeof(table) != nEntries*entrySize ($tableSize!=$(nEntries*entrySize))."
    printfmt("size of table: {1} entries [{2} Bytes @ {3} B/entry]\n", fD(nEntries), fD(tableSize), fD(entrySize) )
    return table
end

This immediately results in a failed assertion…

ERROR: AssertionError: Unexpected: sizeof(table) != nEntries*entrySize (32 != 64).

…I was expecting, the stuct to not be tightly packed, but rather padded to some multiple of the struct’s largest component (UInt64 = 8 Bytes in this case). But why is it showing me 32 Bytes, for the Vector - are those “just” the size of the memory-pointers (64-bit, on my platform / OS)?
When removing the assertion, everything is initialized, as I had anticipated, just the sizeof(table) doesn’t correspond to the actual size…

julia> test()
size of table: 4 entries [32 Bytes @ 16 B/entry]
4-element Vector{Entry}:
 Entry(0x0000000000000001, 0x0001)
 Entry(0x0000000000000002, 0x0002)
 Entry(0x0000000000000003, 0x0003)
 Entry(0x0000000000000004, 0x0004)

Just looking at the table-entries, clearly, the table has to have at least size >= 40 Bytes. Is there another way, to assert the actual size of a(ny) 1d-structure, in julia? If there isn’t any easy function for querying the size of the table, can I just assume, that it is sizeof(entry) * nEntries large (64 Bytes, in this case)? Or do I need to use some other construct, for this kind of requirement (fixed static 1d-tables)?

1 Like

I think a vector (what you used) is what you want.

Do you need the entries to be mutable?

yes, there are some tables, that start empty, but are only being filled during run-time, so the individual entries need to be mutable, just the structure (number of entries, total) won’t change. And yes, a vector seems like the most suitable simple structure. :+1:

The entries don’t need to be mutable for the array to be mutable, you can fill the array and replace entries by new immutable entries. I’m not completely sure, but in general that is preferable in terms of performance.

2 Likes

yes, the array will not be mutable, just the entries. And any structure, which ensures, that the entries (mutable or not) stay in their designated memory-spots should be performant. If I could, I would staple each entry to an address. :sweat_smile:

you should make the elements immutable and just replace them. Julia won’t store mutable elements of an array inline because different references to a mutable object must have the same memory address. this will work out because by using immutable element types, the compiler won’t allocate memory when you create new ones and it should get optimized to a mutation of the enemy in the array by the compiler.

3 Likes

If you can live with the caveats in Some counterintuitive behaviors · StructArrays, StructArrays.jl could be another option.

Wow, I didn’t expect it to be so complicated, getting some datastructures instantiated, that should really be extremely simple. But it turns out, I’m running into error after error… :grimacing:

So I got some sort of rudimentary hashtable-code up and running, for my chess-engine…

using Formatting
using Printf
import Base: length
import Base.@propagate_inbounds

const fF = generate_formatter("%5.3'f")     # format for typical float-output (3-digit-prec after .)
const fD = generate_formatter("%'d")        # format for typical decimal-output (with 1000s-sep)
const fH = generate_formatter("0x%016x")    # format for typical hex-output
const SI(x) = formatted(x, :SI, ndigits=4)  # scientific not. (kilo, mega, ...)

mutable struct HashTable{T}
    entries::Vector{T}
    size::UInt64
    hashKeyMask::UInt64
    function HashTable{T}(n::UInt64) where T
        @assert checkPow2(n) "Spec-mismatch: size is not a pow of 2 (size= $size)."
        en = Vector{T}(undef, n)
        new(en,n,n-1)
    end
end
function length(ht::HashTable) return ht.size end
function checkPow2(v) return ( (v & (v-1))==0 && v>0 ) end
function hashKey2Index(ht::HashTable, hashkey) return (hashkey & ht.hashKeyMask) + 1 end
@propagate_inbounds function putForced(ht::HashTable{T}, hashkey::UInt64, entry::T) where T
    # no checking for collisions - potential overwrite...
    ht.entries[ hashKey2Index(ht, hashkey) ] = entry
    nothing
end
function get(ht::HashTable{T}, hashkey::UInt64) where T
    index = hashKey2Index(ht, hashkey)
    @inbounds return ht.entries[index]::T
end

…so far so good…
Then I made a function setupTable(...), which actually creates a (preinitialized) hashtable for me.
(Note: I’ve, for now, fallen back to just using UInt64s as my ht-entries. Just to make stuff simpler and not have another struct in the array, of the mutable struct HashTable
…also, because I don’t really understand, whether those entries should then be mutable or immutable - I’ve read so much, now, but it still really doesn’t make sense to me, how to change immutable elements.

function setupTable(sizeInBits, default)
    tableSize = 1 << sizeInBits
    entrySize = sizeof(UInt64)
    nEntries = (tableSize ÷ entrySize) % UInt64
    table = HashTable{UInt64}(nEntries::UInt64)
    for i = 1 : nEntries
        # init just with dummy-data, for getting it to work...
        putForced(table, i, (default + i)%UInt64)
    end
    tableSize = nEntries * entrySize
    @assert nEntries > 0 "Tablesize to small => 0 elements fit."
    printfmt("size of table {1}: {2} elements [{3} bytes @ {4} B/el.]\n"
            , SI(tableSize), fD(nEntries), fD(tableSize), fD(entrySize) )
    return table
end

…but even with all these simplifications, which are still quite a bit away, from the problem, I tried to understand earlier - I still don’t even get this simple variation to work, when I try to create a bunch of these Hashtables and put them into a StaticArray, to perform some benchmarks and comparisons on them (with different sizes)…

This snippet…

# maxSize: In bits -> 20=1MB, 30=1GB, etc.
function tables1(maxSize, default::UInt64)
    tmp = Vector{HashTable{UInt64}}(undef, maxSize)
    for i = 12 : maxSize
        tmp[i] = setupTable(i, default)
    end
    # construct SVector, explicitly from the 21 HashTables in tmp
    tables = SVector{maxSize, HashTable{UInt64}}(tmp)
    return tables
end

…was supposed to produce 21 HashTables for benchmarking, but it yields an error, and I don’t understand, why…

julia> test = tables1(30,0x80000000000)
size of table 4.096k: 512 elements [4,096 bytes @ 8 B/el.]
size of table 8.192k: 1,024 elements [8,192 bytes @ 8 B/el.]
size of table 16.38k: 2,048 elements [16,384 bytes @ 8 B/el.]
size of table 32.77k: 4,096 elements [32,768 bytes @ 8 B/el.]
size of table 65.54k: 8,192 elements [65,536 bytes @ 8 B/el.]
size of table 131.1k: 16,384 elements [131,072 bytes @ 8 B/el.]
size of table 262.1k: 32,768 elements [262,144 bytes @ 8 B/el.]
size of table 524.3k: 65,536 elements [524,288 bytes @ 8 B/el.]
size of table 1.049M: 131,072 elements [1,048,576 bytes @ 8 B/el.]
size of table 2.097M: 262,144 elements [2,097,152 bytes @ 8 B/el.]
size of table 4.194M: 524,288 elements [4,194,304 bytes @ 8 B/el.]
size of table 8.389M: 1,048,576 elements [8,388,608 bytes @ 8 B/el.]
size of table 16.78M: 2,097,152 elements [16,777,216 bytes @ 8 B/el.]
size of table 33.55M: 4,194,304 elements [33,554,432 bytes @ 8 B/el.]
size of table 67.11M: 8,388,608 elements [67,108,864 bytes @ 8 B/el.]
size of table 134.2M: 16,777,216 elements [134,217,728 bytes @ 8 B/el.]
size of table 268.4M: 33,554,432 elements [268,435,456 bytes @ 8 B/el.]
size of table 536.9M: 67,108,864 elements [536,870,912 bytes @ 8 B/el.]
size of table 1.074G: 134,217,728 elements [1,073,741,824 bytes @ 8 B/el.]
ERROR: UndefRefError: access to undefined reference
Stacktrace:
 [1] getindex
   @ .\array.jl:924 [inlined]
 [2] macro expansion
   @ C:\Users\rober\.julia\packages\StaticArrays\PUoe1\src\convert.jl:212 [inlined]
 [3] unroll_tuple
   @ C:\Users\rober\.julia\packages\StaticArrays\PUoe1\src\convert.jl:207 [inlined]
 [4] convert
   @ C:\Users\rober\.julia\packages\StaticArrays\PUoe1\src\convert.jl:199 [inlined]
 [5] SVector{30, Vector{HashTable{UInt64}}}(a::Vector{HashTable{UInt64}})
   @ StaticArrays C:\Users\rober\.julia\packages\StaticArrays\PUoe1\src\convert.jl:169
 [6] tables1(maxSize::Int64, default::UInt64)
   @ Main .\Untitled-2:58
 [7] top-level scope

…creation of the ht is effortless, even at large sizes, but I don’t understand, what kind of undefined reference the static array is complaining about?
The constructor-signature seems to match, what I’ve found, elsewhere, in a (toy-) example?

v2 = SVector{3,Float64}(1, 2, 3) # length 3, eltype Float64

Since it didn’t work, I also tried this macro-variant of creating a SVector, like this…

function tables2(maxSize, default::UInt64)
    # try with a macro
    tables = @SVector [ setupTable(i, default) for i = 10 : maxSize ]
    return tables
end

…which gives me a compile-time-error:

ERROR: LoadError: UndefVarError: maxSize not defined
Stacktrace:
 [1] top-level scope
   @ none:1
in expression starting at Untitled-2:64

Interpolating maxSize with a $ then gives this error…

ERROR: LoadError: syntax: "$" expression outside quote
Stacktrace:
 [1] top-level scope
   @ none:1
in expression starting at Untitled-2:64

:grimacing::man_facepalming:

Edit: Admittely, I’ve resorted to guessing, only, now - but I feel like I have no clue, what is going on - when all I wanted, was to allocated some simple block of memory, for some extremely simple array of primitive data-types. :thinking:

In my really humble opinion, you are overcomplicating things that should be simple, perhaps because of trying to translate the implementation from another language. Your original list of tables I would create like this:

julia> struct Entry
           data0::UInt64
           data1::UInt16
       end

julia> hash_table_list = [ Vector{Entry}(undef,10) for _ in 1:21 ];

julia> hash_table_list[1][1] = Entry(1,1) # set some entry
Entry(0x0000000000000001, 0x0001)

julia> hash_table_list[1][1] = Entry(1,2) # replace entry
Entry(0x0000000000000001, 0x0002)

where the “10” there is the length of the tables.

That data structure is compact. The immutable part is that of the entries, the rest (the containers) can be mutable and won’t affect performance much.

1 Like

Perhaps you can explain a bit more about the particular use these 1D arrays will require - then you can dispatch on them based on specific properties you want them to have.

If you want your Entry to represent subsequent chessboards, maybe a Run Length Encoding technique could work.
If you’re Entry are historical pro chess moves, you don’t need to consider the entire universe of possible chess boards, but a reduced subset of them that may permit a more compact encoding.
If you only need to process the data0 fields at a time, putting them in a StructOfArrays.jl type Vec could yield good benefits.

But these all depend on what you specifically need. Fortuntately a custom array and dispatching on it is the Julia creed.

thx for your response.
As much, as I understand your curiosity and willingness, to embark in thoughts about the content of data-entries in the arrays. There is literally nothing uncertain about that part, for me. I am merely trying to implement efficient containers to store and retrieve individual data-elements and I’m happy, opening another thread, explaining all the details of the content, but for the sake of the questions I posted here, and for solving the issues I’m having, the content is literally irrelevant. :wink:

I’m happy giving some background, though, just to motivate, why I’m even needing these containers.

  • One Container / Hashtable (the largest) will be a a transposition-table, storing some of the positions, encountered in the search-tree, of what is, essentially a (much refined) alpha-beta - search. But since there will be a few M nodes being visited, during the search, within short time-spans (seconds), only a portion of those positions will make it into the hashtable (those, which belong to the currently best evaluated line, for either side, and which have the highest chances, enabling the pruning of the search-tree in other parts of the tree, but which also traverse the same position). The encoding of “position” is achieved via zobrist-hashing, which can be generated much faster, than any run-lenght encoding, etc., and serves as the hash-key (index into the table), at the same time. You can find a rough sketch, of how entries in the transposition-table will look like, in this post. Overall, the hashtable is a global structure, where entries are persistent, even across moves, as positions are sometimes precalculated at great depth and their evaluation might stay relevant or they may even reoccur (due to move-ordering, especially in phases of the game, with few captures).

  • The above mentioned Zobrist-hashing is being calculated by having ~750 pre-computed (random) numbers, one for each piece on every square: 64 sq. x 2 colors x 6 types of pieces = 768 + castling-rights and a few other details. When a move is being made (simulated in the search!), the current’s position zobrist-hashkey is simply xor-ed with 2…3 of those numbers. One for the piece, leaving a sq., one for the same piece, moving to another sq. + potentially one more, for the latter target-square & a potentially captured piece, in case it is a capturing move. Etc.
    These keys are also stored in a lookup-table. In this case, it is a very small table (~750 x 8 bytes = ~6 KB). Kind of a lookup-table for the building-blocks of hashkeys to the first table. :wink:

  • Then there are other hashing functions, being used, especially for the move-generation of sliding pieces (rooks, bishops, queens), which are (like the zobrist-hashkeys) pre-computed and being looked-up, during a game-tree-search, for quick move-generation. See here, especially, for a good intro, into fancy magic bitboards, which can be understood in terms of minimal perfect hash-functions. Those tables are still much smaller, than the transposition-table, but do exceed L1-cache & L2-cache, even on modern machines, sadly, still. (~820 KByte).

  • Those are probably the most prominent tables, but then there are a bunch of others, acting as parameter-lookups for positonal bonuses of pieces (depending on their board-position) or even whole arrangements of piece-ensembles (like pawn-structures, etc.). But this data is more flexible, to either be coded or for pre-computed data to be looked up. It is essentially always a tradeoff between costs of calculations and costs for alternative lookups in precomputed tables. Interestingly, this tradeoff itself is usually a parameter, which can be tuned, according to certain criteria. For example, a transposition-table can often speed up search by a almost a factor 4…5x, in the endgame, but it is only marginally useful, in the opening, etc., etc.

  • A whole other story are Endgame tablebase - Wikipedia. I think, at the moment, they are still working on 8-piece-endgame - tablebases, but this might get solved, in the foreseeable future, but this is not, what I’m trying to do here. I will eventually support endgame-tablebases, and I’m sure, I’ll also provide a cache for some of that data, but they will have to reside on disk, as even the (“only”) 6-piece-tables are ~150GB large - and this is with some very efficient encoding, already. :wink:

…I hope, that helped motivating my efforts a bit.

For the questions, I’m having, regarding implementing a general structure, of such tables, in julia: ALL access, to those tables will only ever be random, i.e. no iteration of entries, in any way. No calculation (like matrix-multiplication, broadcasts, or whatnot) Just plain access to individual entries of 1d-tables via their unique index. There is really not more to say, about it. Only read-access, in some cases, Read-write, in others.

1 Like

you’re absolutely right, I somehow made it much more complicated, than it had to be.
It’s not really, because of porting from another language - the allocation in java is basically a (trivial) one-liner, but all that mutable / immutable - StaticArray - hocuspocus just got me confused, I think - even more so, since I knew, that all I want is just allocate a block of 1d-memory! :sweat_smile:
After your example, though, I have a pretty simple example running (a bit more complicated, as I now already have a hashtable-struct with some logic, but that all works very well.
Have run some test-benchmarks over 1M mutations of entries and 0 allocations. :+1:
Just started a (batch-) benchmark over a couple of those tables (at different sizes and with linear vs. random access to see, if I can eyeball-validate the structures working in memory as expected, when I can see them performing worse, as soon, as their sizes exceed the cache-limits of the various stages… Will post some pic, here, later, likely.

3 Likes

@fins

This may help:

struct Foo
   a::Int64
end

mutable struct Foo2
    a::Int64
end

isbits(Foo(1)) ## true
isbits(Foo2(1)) ## false

So if you make an array of a mutable struct that “isbits” it’ll just be a big block of memory, and that’s what you want.

If it’s an array of mutable structs… it’ll be a big block of memory containing pointers… and the actual data is stored in the heap somewhere.

That’s so that you can do something like

a = myarray[123]
push!(myarray,MyStruct(...))

a.myfield = 1

the push! might cause the array to have to allocate a new big block of memory and copy over the old contents… if it does that and a is a “pointer to the old location” then when you mutate a you’ll be mutating old data.

If you can’t mutate the struct then a = myarray[123] can just be basically copying the entry at 123 onto the stack.

Of course although the structs are immutable, it doesn’t mean you can’t change what’s in the array! you can change the entries of the array! so

myarray[123] = MyStruct(...)

is fine, because you’re replacing the struct at 123 with a new struct. Under the hood the compiler is just changing the bits.

1 Like

Thx! That really helps - I remember looking at some objects with isbits(...) a while ago and not being able to quite understand, what it was really telling me (despite a clear description in the docs). But it might be, because I didn’t realize the consequences, at the time, what it means, when some object (i.e. a mutable struct) yields false. It got likely buried under so many layers of abstract vocabulary, that I discarded it as meaningless - since everything is bits, on a computer, literally (including pointers).

2 Likes

yes everything is made of bits, but isbits(x) means it can be shoved into an array or stuck on the stack. If it’s not isbits(x) then it’ll get allocated on the heap and track the object with pointers… that’s more or less my understanding… remember I was in the same place you are like in 2019-2020 and so I’m trying to pull you up along with me, but we both are probably making mistakes :wink:

Yeah, the name isbits isn’t great, but it basically means that there aren’t any pointers hierarchically in the struct.

1 Like

no worries - names are to be learned and as long as the semantics fit and are relevant, I’ll learn it. :sweat_smile:

…kind of funny, as I produced a svg-output with Gadfly, but had to make a jpg-screenshot, to upload.
…but being able to quickly do stuff like this, was definetly one of the reasons, why I went for julia. :wink:

What’s interesting, is, that one can find the cache-capacities of L3-cache, quite clearly (I have 4MB L3), but the other 2 (L1 & L2) don’t really show any distinct drop in performance, when the tables exceed those. Not sure, I overlooked something, but still interesting.
My L1: 32KB (only data-cache) & L2: 512KB (per core, each, while L3 is shared between all cores).

What’s your benchmarking code? If it has too much overhead, it might be hard to see the difference.