Large amount of RAM for double for loop for dictionaries

I have a function that computes a float value from two dictionaries (basically by ‘multiplying’ the dictionaries, with a constraint that the keys (tuples) summed pairwise are in another set), but it uses a huge amount of RAM at runtime. The individual dicts (left, right) have around 200k entries each (and patterns has around 5M entries), and collectively occupy <1GB according to Base.summarysize(). Typically the tuples are size m=16.
When I run this function, it is using sometimes over 15GB of RAM (according to top). I’d like to understand this and ideally reduce the memory requirements (i’m doing this in an HPC environment and need to run this function hundreds of times, so the memory is limiting what I can process right now).

I generate the dictionaries by loading data from file inside a main() function that is called when I launch the script, and then call dict_mult. I haven’t done a detailed profiling of the code, but I was wondering if there were any julia wizards out there for whom something obvious leaps out here as bad practice or could give GC a difficult time etc.

Thanks for any help at all! (p.s., on julia 1.8.2, linux x86).

function dict_mult(left:: Dict{NTuple{m, UInt8}, ComplexF64}, right:: Dict{NTuple{m, UInt8}, ComplexF64}, patterns:: Set{NTuple{m, UInt8}}) where m
    P:: Float64 = 0.

    for (s1, a1) in left
        p1 = abs(a1)^2
        for (s2, a2) in right
            if s1 .+ s2 in patterns
                p2 = abs(a2)^2
                p12 = p1 * p2
                P += p12
            end
        end
    end
    return P
end

I think your function is non-allocating (i.e. should hardly use any RAM beyond what the dictionaries already use before this function is run), as long as you make sure that the arguments supplied, left, right and patterns are const global variables or local variables of another function calling this function.

1 Like

not related to allocations but fyi you can use abs2(x) rather than abs(x)^2

I think the s1 .+ s2 can allocate above m > 10

2 Likes

Thanks for the reply! OK, good to know i’m not missing something totally obvious. This is what i’d have thought also. The dicts are all unchanging, but I had not specifically set them as const type (didn’t know this was an option). Maybe that will help…

1 Like

Thanks, I will update that!

And actually I forgot to say what the typical m value is (will update description). Indeed it’s usually a larger than 10, up to at most m=16.

Julia version 1.9+ supports the --heap-size-hint command line option to control the aggressiveness of GC. For example, use it with --heap-size-hint=4G. Since each of your allocation, assuming that it’s from s1.+s2, is very small (albeit in a tight loop), making the GC more aggressive should quickly bring down the memory usage.

Oh that’s a great idea! I had also not known about this option. I will update my julia version and try this!

[Update]: marking this as solution. After making this simple change, I find the code is now (mostly) running within the desired memory limits.

You should be able to compute this lazily, by defining a new type that computes the hash without explicitly forming the sum. e.g.

struct SummedPair{N,T}
    s1::NTuple{N,T}
    s2::NTuple{N,T}
end
function Base.hash(p::SummedPair{N}, h::UInt=zero(UInt)) where {N}
    # equivalent to hash(p.s1 .+ p.s2, h):
    h += Base.tuplehash_seed
    @inbounds for i = N:-1:1
        h = hash(p.s1[i] + p.s2[i], h)
    end
    return h
end
function Base.:(==)(p1::SummedPair{N}, p2::SummedPair{N}) where {N}
    @inbounds for i = 1:N
        p1.s1[i] + p1.s2[i] == p2.s1[i] + p2.s2[i]  || return false
    end
    return true
end
function Base.:(==)(p::SummedPair{N}, sum::NTuple{N}) where {N}
    @inbounds for i = 1:N
        p.s1[i] + p.s2[i] == sum[i] || return false
    end
    return true
end

Then you can simply do SummedPair(s1, s2) in patterns and it should be allocation-free regardless of the tuple length.

(You can also do a similar trick with ordinary array/vector keys, which might be better than tuples anyway once the length gets long. e.g. passing a long vector around between functions and data structures just involves copying a pointer.)

Perhaps the allocation profiler could help you identify how much is being allocated, and where. I have never tried it, but it should be simple enough to google some documentation

Edit: there is a simple example in the readme of PProf.jl

Wow thanks for this! I certainly wouldn’t have come up with this method! I’ll see if I can get this working in my code.

1 Like

Ah good idea, yes it would be good to check first it’s the line s1 .+ s2 that is causing the issue. Thank you!

1 Like