Where is my memory usage coming from

While developing a package, I noticed that a simulation that used to run without problems now takes a whopping 30+ gb. The script is rather simple, I use some packages and then execute a let block. There should be no leaked variables into global scope, as everything is enclosed within the let block. Yet at the end of the simulation, julia keeps taking up this amount of memory. The usual way to analyze allocations is to track them, but I’m more interested why they aren’t being garbage collected at the end of the simulation (even after GC.gc())

The packages it depend on do use some kind of caching, and the summarysizes of those caches add up to about 600 megabytes.

Here is the script in question (but I have simplified the simulation to be runable on my laptop, but still with way too high memory usage). To run it, one needs

this branch of MPSKit

this fork of TensorKit

I’m primarily interested in the memory usage after the simulation is done, as I at the very least expect this to be manageable.

I think I know where the memory is being used for, but I don’t understand why summarysize does not show the proper amount of memory used:

julia> total = 0; for (k,v) in SUNRepresentations.CGCCACHE[(3,Float64)]

julia> total

julia> Base.summarysize(SUNRepresentations.CGCCACHE[(3,Float64)])

There are no duplicates in that cache, every key has a unique value - counterpart.

The cache is a non-constant global Dict{Any,Any} - basically anything about this is bound to be slow & hard to clean up. The compiler has to check both type and location of that global every time it is accessed since it may change at any time. Accessing global variables may also allocate:

julia> a = rand(1000);                      
julia> f(x) = sum(a) * x                    
f (generic function with 1 method)          
julia> @time f(5)                           
  0.000008 seconds (2 allocations: 32 bytes)

What kind of data are you storing in that Dict that requires it to be typed Any?

1 Like

this cache is actually a dict of dicts. The problem is that you may want to cache the clebsch gordon coefficients of SU(N), for multiple N’s. And then you may want this object at multiple levels of precision (so different eltypes). The dictionary is therefore (element_type,N) => Dictionary of (SUN sectors) => the clebsch gordon coefficients.

Right. So the type of the keys of CGCCACHE seems to be Tuple{DataType,Int}and the values are of type Dict{NTuple{nSun, SUNIrrep{N}}, SparseArray{T,4}} where {N, nSUN, T}.

Neither seems to require Dict{Any,Any} nor global constants - why not have a constructCache() function that gives the user a cache to manage themselves? That would prevent unwanted memory growth after execution since the user can put constructCache() into a function just as well, which would allow GC to clean up after the function has exited (or let the user keep the cache around for longer, by returning it again).

Looking at the code that seems to fill CGCCACHE, looks like this type is just

Dict{NTuple{3, SUNIrrep{N}}, SparseArray{T,4}} where {N, T}


Right, I mistakenly thought typeof(Float64) == Type{Float64}. I’m not entirely sure how to make a user managed cache play nice with TensorKit, with which this package is primarily meant to be used with.

I guess my confusion was that varinfo(all=true,imported=true,recursive=true) just crashes, and if you then fix this, the summarysizes only add up to a fraction of the total cost (apparantly this is now fixed on v1.7.0 beta 4 Base.summarysize() incorrect for Array of structs · Issue #41941 · JuliaLang/julia · GitHub )

There are often cases where I’d like to cache these things to disk instead of keeping old references in ram. Is there some canonical mechanism for this, like how you’re supposed to use the artifact system when working with binary dependencies?

I don’t know how TensorKit caches things (if at all, a quick search on github for “cache” in that repo didn’t turn up anything), so I’m not sure what you mean by that. A cursory look through TensorKit didn’t help in figuring out where the memory is spent - the library is a tad too big to just jump in without an explicit (somewhat short) reproducible example of high memory usage.

The artifact system can be used for anything that’s a blob of data to be served to users - it’s not limited to libraries and code. See here for more info. How you want to store the data exactly is up to you though and depending on what exactly you want to do/save, you’ll have to choose different serialization strategies.

Personally, I’m not sure why you’d want to have one cache for all element types. Intuitively it would be better to stay in one element type when doing a calculation/operation.

May I ask, do you want to introduce a cache because TensorKit is highly optimized already, but still too slow, or are you trying to speed up your own calculation by reusing results, in spite of possibly greater gains being made from improving TensorKit?

Nevermind, I found them:

$ rg "cache ="               
479:const transposecache = LRU{Any, Any}(; maxsize = 10^5)
480:const usetransposecache = Ref{Bool}(true)             
962:const braidcache = LRU{Any, Any}(; maxsize = 10^5)    

yeah… those caches have the exact same problem of being globals and Any keys and Any values. That will kill overall performance and make it much harder to profile…

I should elaborate. My own calculations require tensorcontractions of SU(3) symmetric tensors. TensorKit works independently of the particular symmetry, you just define what kind of irreps exist for you group along with some properties (one of which is the f-symbols), and of you go.

Tensorkit in turn queries certain f symbols for your irreps, which in this case can be calculated as a function of clebsch gordon coefficients (CGCs). Some f symbols re-use the same CGCs, so it makes sense to store them, in case they have to be re-used.

TensorKit on its own is fast (those caches may be Any,Any but all practical cost sits in matrix matrix multiplications), as is sunrepresentations (it’s much faster then the original c++ code from the paper it’s based on), but the combination of the two would be very slow if it has to re-calculated all relevant CGCs every time a new f-symbol is required. This calculation is expensive, much more expensive than a query from a global type unstable cache.

That’s why a user-managed cache also doesn’t entirely fit. As an example, this is the usual way you’d want to use the two:

tensor = TensorMap(rand #= initializer =#,
    ComplexF64 #= element type of that tensor=#,
    Rep[SU{3}]((0, 0, 0)=>1)#=codomain=#,
    Rep[SU{3}]((0, 0, 0)=>1) #=domain=#);
# ... calculations with tensor

There seems to me to be no clean way to use a user managed cache, and I don’t think you’d want to either. I do agree that you really don’t want some kind of object that can blow up indefinitely though (so I should use some kind of LRUcache).

In any case, the original problem seems solved. The summarysize of my caches was wrong, and this is fixed in the latest julia versions.

How big are these matrices usually? As far as I can tell, TensorKit doesn’t have matrices (as in, multidimensional arrays of values) but operates on the abstract transformations represented by them. Can you point me in the right direction of the code? I’d like to investigate where the high memory use is coming from in the first place.

Well, your question about summarysize is solved, but the other part (“why is there so much memory used in the first place”) still stands imo.

In my experience, having a type unstable frontend can mask problems further down the stack, because the type instability can prevent some optimizations from taking place if it propagates far enough (I haven’t checked with @code_warntype though).

I think know why so much memory is suddenly used. The answer is a bit technical, but the point is that some symmetries don’t allow you to “braid” different spaces. In array terminology, you are no longer allowed to permutedims(A,(2,1,3,4,5)), only cyclic permutations are allowed. I rewrote mpskit so that it indeed only used those kind of operations, but it turns out that this translates to many more f-symbols being required. The dimension of an irrep very quickly blows up if you use SU(3) (the dimension of SU3 (15,7,0) is 612, the CGCs are maps from (irrep,irrep) to irrep.

Tensorkit’s aray operations start out in the tensormap type TensorKit.jl/tensor.jl at master · Jutho/TensorKit.jl · GitHub .