Question about how to fix some efficient memory layout of large 1d-structures?

I’m going to try posting the relevant portions. It’s all a bit hacked together, after all the testing with datastructures, etc. And there are still a lot of test-outputs, that can be switched on, by calling the functions with some value verbose>0, but 0 was used, for producing the graphic.

datastructure…

mutable struct HashTable{T}
    entries::Vector{T}
    size::UInt64
    hashKeyMask::UInt64
    function HashTable{T}(n::UInt64) where T
        if !checkPow2(n) printfmt("Houston... size is not a pow of 2 (size= ", n) end
        en = Vector{T}(undef, n)
        new(en,n,n-1)
    end
end
function checkPow2(v) return ( (v & (v-1))==0 && v>0 ) end
function hashKey2Index(ht::HashTable, hashkey) return (hashkey & ht.hashKeyMask) + 1 end
@propagate_inbounds function putForced(ht::HashTable{T}, hashkey::UInt64, entry::T) where T
    # no checking for collisions - potential overwrite.
    ht.entries[ hashKey2Index(ht, hashkey) ] = entry
    nothing
end
function get(ht::HashTable{T}, hashkey::UInt64) where T
    index = hashKey2Index(ht, hashkey)
    @inbounds return ht.entries[index]::T
end
struct HTEntry
    d0::UInt64  # use this, for now (1.)
end

Outer loops, controlling the calls to the actual function, doing the fetches of the HashTables…

function testEverything(maxSize, iter=100_000_000, metaIter=20, verbose=0)
    # ...initialization of all tables, ONCE.
        # bunch of loops and rng-stuff to randomize the order of tables in the benchmark-calls...

        # eventually this was getting called for each table (of different size)
            # for a total of 26x for each table (2x warmup + 24x recorded)
            # ...of each 2 of those calls, one was for type = linear (-1) & one for type = random (0)...

                sleep(.1)
                b = @benchmark $dummy ⊻= accessData($table, $iter, $type, $verbose) seconds = 1
                printfmt("memory= {1}, allocs= {2} => ", SI( b.memory ), b.allocs )
                if !warmup
                    df = DataFrame( access=access, prefetch="no", tabSize=SI(sizeTable), tsize=sizeTable, time=b.times, mem=b.memory, allocs=b.allocs )
                    append!(ret,df)
                end
        # loop-logic
    printfmt("dummy= {1}", fH(dummy) )
    return ret
end

Actual output of the batch, that was produced for the data in the graphic…

julia> df = testEverything(30, 100_000_000, 13, 0)
Iteration #1, warmup= true.
Counters of the 19 different sizes: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]tableSize= 262.1k, access= random  => memory= 0.000, allocs= 0 => Trial(446.196 ms)
tableSize= 262.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(334.162 ms)
tableSize= 8.389M, access= random  => memory= 0.000, allocs= 0 => Trial(2.411 s)
tableSize= 8.389M, access= linear  => memory= 0.000, allocs= 0 => Trial(906.360 ms)
tableSize= 4.096k, access= linear  => memory= 0.000, allocs= 0 => Trial(375.776 ms)
tableSize= 4.096k, access= random  => memory= 0.000, allocs= 0 => Trial(440.447 ms)
tableSize= 524.3k, access= random  => memory= 0.000, allocs= 0 => Trial(516.041 ms)
tableSize= 524.3k, access= linear  => memory= 0.000, allocs= 0 => Trial(363.881 ms)
tableSize= 32.77k, access= linear  => memory= 0.000, allocs= 0 => Trial(352.729 ms)
tableSize= 32.77k, access= random  => memory= 0.000, allocs= 0 => Trial(516.605 ms)
tableSize= 65.54k, access= linear  => memory= 0.000, allocs= 0 => Trial(399.582 ms)
tableSize= 65.54k, access= random  => memory= 0.000, allocs= 0 => Trial(427.079 ms)
tableSize= 1.049M, access= random  => memory= 0.000, allocs= 0 => Trial(568.007 ms)
tableSize= 1.049M, access= linear  => memory= 0.000, allocs= 0 => Trial(390.836 ms)
tableSize= 8.192k, access= linear  => memory= 0.000, allocs= 0 => Trial(360.025 ms)
tableSize= 8.192k, access= random  => memory= 0.000, allocs= 0 => Trial(540.340 ms)
tableSize= 131.1k, access= random  => memory= 0.000, allocs= 0 => Trial(483.857 ms)
tableSize= 131.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(417.620 ms)
tableSize= 268.4M, access= linear  => memory= 0.000, allocs= 0 => Trial(928.717 ms)
tableSize= 268.4M, access= random  => memory= 0.000, allocs= 0 => Trial(3.280 s)
tableSize= 67.11M, access= random  => memory= 0.000, allocs= 0 => Trial(2.978 s)
tableSize= 67.11M, access= linear  => memory= 0.000, allocs= 0 => Trial(935.133 ms)
tableSize= 16.38k, access= random  => memory= 0.000, allocs= 0 => Trial(488.913 ms)
tableSize= 16.38k, access= linear  => memory= 0.000, allocs= 0 => Trial(419.545 ms)
tableSize= 134.2M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.057 s)
tableSize= 134.2M, access= random  => memory= 0.000, allocs= 0 => Trial(4.120 s)
tableSize= 536.9M, access= random  => memory= 0.000, allocs= 0 => Trial(3.900 s)
tableSize= 536.9M, access= linear  => memory= 0.000, allocs= 0 => Trial(929.721 ms)
tableSize= 1.074G, access= linear  => memory= 0.000, allocs= 0 => Trial(1.185 s)
tableSize= 1.074G, access= random  => memory= 0.000, allocs= 0 => Trial(6.127 s)
tableSize= 4.194M, access= random  => memory= 0.000, allocs= 0 => Trial(1.230 s)
tableSize= 4.194M, access= linear  => memory= 0.000, allocs= 0 => Trial(551.794 ms)
tableSize= 16.78M, access= random  => memory= 0.000, allocs= 0 => Trial(3.100 s)
tableSize= 16.78M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.037 s)
tableSize= 2.097M, access= linear  => memory= 0.000, allocs= 0 => Trial(508.653 ms)
tableSize= 2.097M, access= random  => memory= 0.000, allocs= 0 => Trial(887.637 ms)
tableSize= 33.55M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.121 s)
tableSize= 33.55M, access= random  => memory= 0.000, allocs= 0 => Trial(3.715 s)
Iteration #2, warmup= false.
Counters of the 19 different sizes: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]tableSize= 8.192k, access= random  => memory= 0.000, allocs= 0 => Trial(544.064 ms)
tableSize= 8.192k, access= linear  => memory= 0.000, allocs= 0 => Trial(487.343 ms)
tableSize= 2.097M, access= linear  => memory= 0.000, allocs= 0 => Trial(465.791 ms)
tableSize= 2.097M, access= random  => memory= 0.000, allocs= 0 => Trial(827.870 ms)
tableSize= 32.77k, access= linear  => memory= 0.000, allocs= 0 => Trial(424.061 ms)
tableSize= 32.77k, access= random  => memory= 0.000, allocs= 0 => Trial(589.995 ms)
tableSize= 65.54k, access= random  => memory= 0.000, allocs= 0 => Trial(561.615 ms)
tableSize= 65.54k, access= linear  => memory= 0.000, allocs= 0 => Trial(588.125 ms)
tableSize= 1.074G, access= linear  => memory= 0.000, allocs= 0 => Trial(955.785 ms)
tableSize= 1.074G, access= random  => memory= 0.000, allocs= 0 => Trial(5.179 s)
tableSize= 268.4M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.147 s)
tableSize= 268.4M, access= random  => memory= 0.000, allocs= 0 => Trial(3.920 s)
tableSize= 262.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(532.428 ms)
tableSize= 262.1k, access= random  => memory= 0.000, allocs= 0 => Trial(744.403 ms)
tableSize= 524.3k, access= random  => memory= 0.000, allocs= 0 => Trial(663.747 ms)
tableSize= 524.3k, access= linear  => memory= 0.000, allocs= 0 => Trial(633.909 ms)
tableSize= 16.38k, access= linear  => memory= 0.000, allocs= 0 => Trial(615.999 ms)
tableSize= 16.38k, access= random  => memory= 0.000, allocs= 0 => Trial(782.442 ms)
tableSize= 1.049M, access= linear  => memory= 0.000, allocs= 0 => Trial(567.768 ms)
tableSize= 1.049M, access= random  => memory= 0.000, allocs= 0 => Trial(587.174 ms)
tableSize= 33.55M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.005 s)
tableSize= 33.55M, access= random  => memory= 0.000, allocs= 0 => Trial(3.436 s)
tableSize= 8.389M, access= random  => memory= 0.000, allocs= 0 => Trial(2.899 s)
tableSize= 8.389M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.248 s)
tableSize= 134.2M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.155 s)
tableSize= 134.2M, access= random  => memory= 0.000, allocs= 0 => Trial(3.541 s)
tableSize= 67.11M, access= random  => memory= 0.000, allocs= 0 => Trial(3.156 s)
tableSize= 67.11M, access= linear  => memory= 0.000, allocs= 0 => Trial(986.694 ms)
tableSize= 16.78M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.110 s)
tableSize= 16.78M, access= random  => memory= 0.000, allocs= 0 => Trial(3.364 s)
tableSize= 536.9M, access= random  => memory= 0.000, allocs= 0 => Trial(3.842 s)
tableSize= 536.9M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.238 s)
tableSize= 4.096k, access= linear  => memory= 0.000, allocs= 0 => Trial(569.840 ms)
tableSize= 4.096k, access= random  => memory= 0.000, allocs= 0 => Trial(586.638 ms)
tableSize= 4.194M, access= linear  => memory= 0.000, allocs= 0 => Trial(682.322 ms)
tableSize= 4.194M, access= random  => memory= 0.000, allocs= 0 => Trial(1.371 s)
tableSize= 131.1k, access= random  => memory= 0.000, allocs= 0 => Trial(643.775 ms)
tableSize= 131.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(457.866 ms)
Iteration #3, warmup= false.
Counters of the 19 different sizes: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]tableSize= 16.78M, access= random  => memory= 0.000, allocs= 0 => Trial(2.964 s)
tableSize= 16.78M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.130 s)
tableSize= 67.11M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.070 s)
tableSize= 67.11M, access= random  => memory= 0.000, allocs= 0 => Trial(3.829 s)
tableSize= 4.096k, access= random  => memory= 0.000, allocs= 0 => Trial(675.305 ms)
tableSize= 4.096k, access= linear  => memory= 0.000, allocs= 0 => Trial(580.119 ms)
tableSize= 268.4M, access= random  => memory= 0.000, allocs= 0 => Trial(3.280 s)
tableSize= 268.4M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.287 s)
tableSize= 262.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(622.292 ms)
tableSize= 262.1k, access= random  => memory= 0.000, allocs= 0 => Trial(658.215 ms)
tableSize= 16.38k, access= linear  => memory= 0.000, allocs= 0 => Trial(453.156 ms)
tableSize= 16.38k, access= random  => memory= 0.000, allocs= 0 => Trial(602.407 ms)
tableSize= 131.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(603.467 ms)
tableSize= 131.1k, access= random  => memory= 0.000, allocs= 0 => Trial(662.592 ms)
tableSize= 8.192k, access= random  => memory= 0.000, allocs= 0 => Trial(700.595 ms)
tableSize= 8.192k, access= linear  => memory= 0.000, allocs= 0 => Trial(437.117 ms)
tableSize= 2.097M, access= random  => memory= 0.000, allocs= 0 => Trial(718.578 ms)
tableSize= 2.097M, access= linear  => memory= 0.000, allocs= 0 => Trial(487.956 ms)
tableSize= 4.194M, access= linear  => memory= 0.000, allocs= 0 => Trial(638.939 ms)
tableSize= 4.194M, access= random  => memory= 0.000, allocs= 0 => Trial(1.535 s)
tableSize= 1.074G, access= random  => memory= 0.000, allocs= 0 => Trial(4.352 s)
tableSize= 1.074G, access= linear  => memory= 0.000, allocs= 0 => Trial(991.550 ms)
tableSize= 8.389M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.037 s)
tableSize= 8.389M, access= random  => memory= 0.000, allocs= 0 => Trial(2.625 s)
tableSize= 536.9M, access= random  => memory= 0.000, allocs= 0 => Trial(3.853 s)
tableSize= 536.9M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.056 s)
tableSize= 33.55M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.017 s)
tableSize= 33.55M, access= random  => memory= 0.000, allocs= 0 => Trial(3.118 s)
tableSize= 32.77k, access= random  => memory= 0.000, allocs= 0 => Trial(662.053 ms)
tableSize= 32.77k, access= linear  => memory= 0.000, allocs= 0 => Trial(712.738 ms)
tableSize= 65.54k, access= random  => memory= 0.000, allocs= 0 => Trial(545.191 ms)
tableSize= 65.54k, access= linear  => memory= 0.000, allocs= 0 => Trial(487.400 ms)
tableSize= 1.049M, access= random  => memory= 0.000, allocs= 0 => Trial(727.315 ms)
tableSize= 1.049M, access= linear  => memory= 0.000, allocs= 0 => Trial(540.514 ms)
tableSize= 134.2M, access= random  => memory= 0.000, allocs= 0 => Trial(3.268 s)
tableSize= 134.2M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.091 s)
tableSize= 524.3k, access= linear  => memory= 0.000, allocs= 0 => Trial(564.646 ms)
tableSize= 524.3k, access= random  => memory= 0.000, allocs= 0 => Trial(877.778 ms)
Iteration #4, warmup= false.
Counters of the 19 different sizes: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]tableSize= 2.097M, access= random  => memory= 0.000, allocs= 0 => Trial(941.583 ms)
tableSize= 2.097M, access= linear  => memory= 0.000, allocs= 0 => Trial(615.029 ms)
tableSize= 4.194M, access= random  => memory= 0.000, allocs= 0 => Trial(1.442 s)
tableSize= 4.194M, access= linear  => memory= 0.000, allocs= 0 => Trial(588.756 ms)
tableSize= 536.9M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.049 s)
tableSize= 536.9M, access= random  => memory= 0.000, allocs= 0 => Trial(3.760 s)
tableSize= 16.78M, access= linear  => memory= 0.000, allocs= 0 => Trial(963.186 ms)
tableSize= 16.78M, access= random  => memory= 0.000, allocs= 0 => Trial(2.983 s)
tableSize= 1.074G, access= random  => memory= 0.000, allocs= 0 => Trial(4.620 s)
tableSize= 1.074G, access= linear  => memory= 0.000, allocs= 0 => Trial(1.004 s)
tableSize= 134.2M, access= random  => memory= 0.000, allocs= 0 => Trial(3.291 s)
tableSize= 134.2M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.078 s)
tableSize= 4.096k, access= linear  => memory= 0.000, allocs= 0 => Trial(585.948 ms)
tableSize= 4.096k, access= random  => memory= 0.000, allocs= 0 => Trial(679.346 ms)
tableSize= 67.11M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.104 s)
tableSize= 67.11M, access= random  => memory= 0.000, allocs= 0 => Trial(3.268 s)
tableSize= 268.4M, access= random  => memory= 0.000, allocs= 0 => Trial(3.409 s)
tableSize= 268.4M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.156 s)
tableSize= 262.1k, access= random  => memory= 0.000, allocs= 0 => Trial(671.724 ms)
tableSize= 262.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(704.944 ms)
tableSize= 524.3k, access= random  => memory= 0.000, allocs= 0 => Trial(914.110 ms)
tableSize= 524.3k, access= linear  => memory= 0.000, allocs= 0 => Trial(555.201 ms)
tableSize= 32.77k, access= linear  => memory= 0.000, allocs= 0 => Trial(488.503 ms)
tableSize= 32.77k, access= random  => memory= 0.000, allocs= 0 => Trial(587.351 ms)
tableSize= 8.389M, access= linear  => memory= 0.000, allocs= 0 => Trial(989.530 ms)
tableSize= 8.389M, access= random  => memory= 0.000, allocs= 0 => Trial(2.595 s)
tableSize= 8.192k, access= linear  => memory= 0.000, allocs= 0 => Trial(634.589 ms)
tableSize= 8.192k, access= random  => memory= 0.000, allocs= 0 => Trial(632.075 ms)
tableSize= 65.54k, access= random  => memory= 0.000, allocs= 0 => Trial(652.192 ms)
tableSize= 65.54k, access= linear  => memory= 0.000, allocs= 0 => Trial(624.299 ms)
tableSize= 1.049M, access= random  => memory= 0.000, allocs= 0 => Trial(877.501 ms)
tableSize= 1.049M, access= linear  => memory= 0.000, allocs= 0 => Trial(467.886 ms)
tableSize= 131.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(499.422 ms)
tableSize= 131.1k, access= random  => memory= 0.000, allocs= 0 => Trial(605.829 ms)
tableSize= 16.38k, access= random  => memory= 0.000, allocs= 0 => Trial(605.889 ms)
tableSize= 16.38k, access= linear  => memory= 0.000, allocs= 0 => Trial(433.698 ms)
tableSize= 33.55M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.443 s)
tableSize= 33.55M, access= random  => memory= 0.000, allocs= 0 => Trial(3.536 s)
Iteration #5, warmup= false.
Counters of the 19 different sizes: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]tableSize= 65.54k, access= linear  => memory= 0.000, allocs= 0 => Trial(497.417 ms)
tableSize= 65.54k, access= random  => memory= 0.000, allocs= 0 => Trial(891.902 ms)
tableSize= 131.1k, access= random  => memory= 0.000, allocs= 0 => Trial(680.470 ms)
...
tableSize= 1.074G, access= random  => memory= 0.000, allocs= 0 => Trial(4.935 s)
tableSize= 1.074G, access= linear  => memory= 0.000, allocs= 0 => Trial(1.102 s)
tableSize= 65.54k, access= random  => memory= 0.000, allocs= 0 => Trial(876.078 ms)
tableSize= 65.54k, access= linear  => memory= 0.000, allocs= 0 => Trial(648.372 ms)
tableSize= 8.389M, access= random  => memory= 0.000, allocs= 0 => Trial(2.715 s)
tableSize= 8.389M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.019 s)
tableSize= 2.097M, access= linear  => memory= 0.000, allocs= 0 => Trial(620.247 ms)
tableSize= 2.097M, access= random  => memory= 0.000, allocs= 0 => Trial(957.809 ms)
Iteration #13, warmup= false.
Counters of the 19 different sizes: [12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]tableSize= 8.192k, access= linear  => memory= 0.000, allocs= 0 => Trial(640.841 ms)
tableSize= 8.192k, access= random  => memory= 0.000, allocs= 0 => Trial(652.101 ms)
tableSize= 65.54k, access= random  => memory= 0.000, allocs= 0 => Trial(675.442 ms)
tableSize= 65.54k, access= linear  => memory= 0.000, allocs= 0 => Trial(504.926 ms)
tableSize= 32.77k, access= linear  => memory= 0.000, allocs= 0 => Trial(488.360 ms)
tableSize= 32.77k, access= random  => memory= 0.000, allocs= 0 => Trial(707.832 ms)
tableSize= 8.389M, access= random  => memory= 0.000, allocs= 0 => Trial(2.535 s)
tableSize= 8.389M, access= linear  => memory= 0.000, allocs= 0 => Trial(986.555 ms)
tableSize= 131.1k, access= random  => memory= 0.000, allocs= 0 => Trial(641.357 ms)
tableSize= 131.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(476.607 ms)
tableSize= 16.38k, access= linear  => memory= 0.000, allocs= 0 => Trial(537.568 ms)
tableSize= 16.38k, access= random  => memory= 0.000, allocs= 0 => Trial(720.313 ms)
tableSize= 1.074G, access= linear  => memory= 0.000, allocs= 0 => Trial(1.022 s)
tableSize= 1.074G, access= random  => memory= 0.000, allocs= 0 => Trial(4.885 s)
tableSize= 2.097M, access= random  => memory= 0.000, allocs= 0 => Trial(931.812 ms)
tableSize= 2.097M, access= linear  => memory= 0.000, allocs= 0 => Trial(454.713 ms)
tableSize= 4.096k, access= random  => memory= 0.000, allocs= 0 => Trial(597.527 ms)
tableSize= 4.096k, access= linear  => memory= 0.000, allocs= 0 => Trial(587.220 ms)
tableSize= 262.1k, access= linear  => memory= 0.000, allocs= 0 => Trial(631.912 ms)
tableSize= 262.1k, access= random  => memory= 0.000, allocs= 0 => Trial(715.864 ms)
tableSize= 524.3k, access= linear  => memory= 0.000, allocs= 0 => Trial(627.888 ms)
tableSize= 524.3k, access= random  => memory= 0.000, allocs= 0 => Trial(779.896 ms)
tableSize= 16.78M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.052 s)
tableSize= 16.78M, access= random  => memory= 0.000, allocs= 0 => Trial(3.279 s)
tableSize= 536.9M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.190 s)
tableSize= 536.9M, access= random  => memory= 0.000, allocs= 0 => Trial(4.666 s)
tableSize= 1.049M, access= linear  => memory= 0.000, allocs= 0 => Trial(685.274 ms)
tableSize= 1.049M, access= random  => memory= 0.000, allocs= 0 => Trial(911.146 ms)
tableSize= 268.4M, access= random  => memory= 0.000, allocs= 0 => Trial(3.482 s)
tableSize= 268.4M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.084 s)
tableSize= 67.11M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.065 s)
tableSize= 67.11M, access= random  => memory= 0.000, allocs= 0 => Trial(3.231 s)
tableSize= 4.194M, access= linear  => memory= 0.000, allocs= 0 => Trial(614.948 ms)
tableSize= 4.194M, access= random  => memory= 0.000, allocs= 0 => Trial(1.524 s)
tableSize= 134.2M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.076 s)
tableSize= 134.2M, access= random  => memory= 0.000, allocs= 0 => Trial(3.655 s)
tableSize= 33.55M, access= linear  => memory= 0.000, allocs= 0 => Trial(1.119 s)
tableSize= 33.55M, access= random  => memory= 0.000, allocs= 0 => Trial(3.792 s)

Finally, this was the actual function, doing the fetches on the tables
I’m not sure, whether or not, all those conditionals with output got dead-code-eliminated or not - I didn’t care for it, all that much, for a first test-run, but might try removing it, for running again…

# function being called by @benchmark...
function accessData( table, iter, type, verbose=0 )
    entrySize = sizeof(HTEntry)
    nEntries = length(table)

    # make sure, to waste a whole cacheline (64 bytes), with every fetch
    cacheLineStride = 64 ÷ entrySize
    # not using indexMask anymore, just forgot to remove these lines...
    # HashTable is masking the index, such that it suits the size of the table, now.
    if type == -1 indexMask = (cacheLineStride - 1)%UInt64 # linear access
    else indexMask = (nEntries - 1)%UInt64                 # random access
    end
    random = RNG()
    # a bunch of test-output, when verbose>0

    result = 0
    index = 0
    # ACTUAL LOOP, THAT GOT CALLED with iter = 100_000_000
    for i = 1:iter

        if type==-1 index += cacheLineStride % UInt64 # linear
        elseif type==0 index = next(random) % UInt64  # random
        # extend for prefetching with different prefetch-strides
        end

        # bunch of conditional outputs, which may or may not have gotten dead-code-eliminated...
        entry = get(table,index)
        if verbose>2 printfmt("fetched: table[{1}]= {2}\n", fD(index), fH(entry) ) end
        newEntry = HTEntry(entry.d0+1)
        putForced(table, index, newEntry)
        result ⊻= newEntry.d0
        if verbose>2 
            printfmt("changed: table[{1}] to: {2}\n", fD(index), fH( get(table,index) ) ) 
            printfmt("result for compare= {1}\n", fH(result) )
        elseif verbose>1 printfmt("fetched & changed: table[{1}]\n", fD(index) )
        end
    end
    if verbose>0 printfmt("result for compare= {1}\n", fH(result) ) end
    return result
end

RNG:

mutable struct RNG
    state::UInt64
    mask::UInt64
end
RNG() = RNG( time_ns() , typemax(UInt64) )
RNG(seed::UInt64) = RNG( seed, typemax(UInt64) )
function next(r::RNG) return r.state = _rng(r.state)  end
function _rng(x::UInt64) # (2003) Marsaglia 64-bit-xorshift, period = 2^64-1
    x ⊻= (x <<  21)
    x ⊻= (x >>> 35)
    x ⊻= (x <<   4)
    return x % UInt64
end

I removed all those outputs & also split the general loop into 2 specific ones (1x for random, 1x for linear access), which yields another conditional-elimination. I’ll let it run, overnight and see, what that changes… :yum:


…now there’s clear evidence of the 2 steps: L1=>L2, L2=>L3 & L3=>insufficient cache-capacities. :blush:
Also, with all conditionals removed (that cost a ton, obviously), I had to adjust the scaling from MB/sec. to GB/sec. The linear-access - times gained ~x5 in performance, in the L1-domain (up to 32Kbyte, on my Laptop). Overall, the linear max. throughput (with hardware-prefetches) give me a good (rough!) orientation, for reasoning, in which domains of tablesizes, it might make sense, to consider prefetching, when randomly accessing my tables, but only in cases, ofc, where the program-logic even allows for an early prefetch, do something else, and only then fetch the relevant entry for further processing.

But also, the random vs. linear throughput cannot be compared, directly, ofc, as more logic and computation is involved, for even randomly accessing (which clearly, pollutes my measurements)…

3 Likes


Finally came around, playing with the SW-prefetch, that is available through the llvm-calls (I used the VectorizationBase-wrapper of it). Really pleased - so nice, being able to do this low-level - stuff, on a platform, where I can just plot the results in a few lines, and then looking so pretty. I like julia more and more. :yum:
Note: The abs. values between the data is not directly comparable, across colors, as those implementations are inherently different. Cyan was linear access, yellow+green are both random-access and green has quite a bit of logic added in the time-critical loop to be able to test different strides (i.e. I’m filling a circular buffer with random indices, having the actual fetch-index lag k steps behind the prefetch-index, while moving both +1 per iteration, in the circular buffer).

3 Likes

Glad you’re liking it! Those benchmarks look great!

1 Like