Does dict's dict cause memory leak?

I tried use dict of dict to imitate the sparse matrix’s storage mode, the structure is like Dict{Int64, Dict{Int64, Float64}}, and the outer dict’s key is the colum index of sparse matrix, and inner dict’s key is row index, and the value of inner dict is the value of sparse matrix in this (col, row).
I created about thousands of thiese dict’s dict, and added some key-value pairs of outer dict in a for loop, (also do other heavy calculation in the loop). with the loop, I found my memory increased extremely when I add the key-value pairs to the outer dict, while the actual memory of these object is much smaller. I don’t know why.
And someone told me dict’s dict will cause memory leak, I don’t know if it’s true. If it’s, are there better way to imitate a sparse matrix? I didn’t use a sparse matrix directly because I only need the specific column not the all column, and new column will be needed somewhile, so I need to update the sparse matrix somewhile.
Thanks in advance!

A memory leak would be memory not being freed when it’s supposed to be, but it’s not clear if you need to remove entries that are “done”. I’m not sure what you mean by actual memory, but Dicts deliberately use more memory than its actual entries (lower “load factor”) to maintain performance. At minimum, a Dict will have space for 16 entries, even if you only put 1 entry in it.

Julia does come with a sparse matrix implementation SparseMatrixCSC{Tv, Ti}. That stores the nonzero values, their row indices, and their column indices in 3 vectors, no dictionaries. If you really want to use a dictionary, you don’t need to nest 1-entry dictionaries, you can make a Dict{Tuple{Int64, Int64}, Float64} holding (col, row) => value.

1 Like

I won’t remove the entry throughout the program, but will frequently add some entries.
I get the actual memory of the object I stored by Base.summarysize. When running the program, I found the RSS and maxrss of my program is much larger than actual memory, so I wander if memory leak happened.

I use Dict{Int64, Dict{Int64, Float64}} because I need apply the matrix to a sparse vector, such structure is convenient because I only need to read out the column that existed in the sparse vector, which is the inner dict, than multiply it by the corresponding coefficient in sparse vector, then add these dict to get result dict. It’s more convenient then (col, row) => value, I guess.
I didn’t use sparse matrix in Julia, because I don’t remember the whole sparse matrix, I only remember the needed column infered from the sparse vector it will be applied to. So as the number of sparse vector’s component increased, I need to calculated new column, and create a new sparse matrix. I guess it’s inefficient so I didn’t try it.

Honestly I’m not sure exactly what you’re needing, it does sound like you need a constant-time access to a value with the column as the primary key, so your original data structure makes sense for that. Unfortunately a dictionary must allocate empty space for good performance.

FYI you can add more entries to a SparseMatrixCSC{Tv, Ti}, at least until you fill its size completely. The CSC part is “compressed sparse columns”, which means it’s designed for efficient column retrieval. I misspoke earlier, it doesn’t store column indices straightforwardly.

I have a operator pool, all operator have matrix representation; I have a vector that operators will act on. At first the vector have only one entry, so we only need one column (equal to the entry of vector) of the operator to realise the matrix-vector multiply. Then I add more entries to the vector, so next time I need more column of operators. The dimension of my operator is big, about 2^30, so I can’t get the whole matrix at first.
Actually my structure is the same as CSC type, I guess.

Can I add entries to a SparseMatrixCSC? I didn’t find it in Julia’s document, I don’t know where to find. If there exist efficient way can do this will be best, please teach me!

yeah. you can use sparse matrices just like regular ones (although adding to them is a bunch slower since you have to shuffle everything around)

1 Like

It’s a subtype of AbstractArray, so it has setindex! like most of the others.

julia> using SparseArrays

julia> x = spzeros(Int8, 4, 3) # Int8 type, 4 rows, 3 columns
4×3 SparseMatrixCSC{Int8,Int64} with 0 stored entries

julia> x[2, 1] = Int8(5) # setindex!
5

julia> x
4×3 SparseMatrixCSC{Int8,Int64} with 1 stored entry:
  [2, 1]  =  5

julia> x[:,1] # getindex column (efficient)
4-element SparseVector{Int8,Int64} with 1 stored entry:
  [2]  =  5

Whether it’s appropriate for your case, you’ll have to try and see. There are many other possible implementations of sparse matrices that aren’t in the SparseArrays library. I know of SparseMatricesCSR.jl for compressed sparse rows, but I’m not familiar with other Julia implementations.

julia> Base.summarysize(spzeros(ComplexF64, 2^30, 2^30))/2^20 8192.000160217285
zero entries but about 8GiB memory.
And I need a pool, about thousands of matrix in such dimension, about ten thousand GiB memory cost, from the begining.

I confused if there are sth. wrong with my code?

That’s how the column pointer information is stored, it’s about as large as the number of columns. I don’t really understand why, but some CSC implementations are like that, while others store far fewer column pointers than there are nonzero values.

1 Like

A SparseMatrixCS_ is not a good storage format if your matrix will have many more columns (for CSC or transposed CSR) or rows (CSR or transposed CSC) than noneros. However, matrices with many fewer nonzeros than both rows and columns are rare in practice (most columns and rows would be entirely zero, so why are they even there?).

But if that’s the regime you find yourself in, you might do well with a Dict{Int,SparseVector{Float64,Int}} - a Dict of SparseVectors where the Dict key indexes the column (or row) and the SparseVector holds the corresponding column (or row). This allows you to have ultra-sparse columns (rows) but then to use efficient SparseVector storage and operations for each column (row).

2 Likes

Yes, I am trying to use dict of sparsevec struct, hope it won’t cause memory issue.

I used dict’s dict before, the biggest problem is the max_rss increased sharply some times. I tried reuse the dict in intermediate computation procedure not work, then I never remember the matrix elements (don’t use the dict’s dict), my rss and GC_live keep constant, and also max_rss but long time later still get a very big max_rss. I really confused.

5    RSS:         742.336 MiB        GC live:     161.313 MiB        Max. RSS:   2294.355 MiB
6    RSS:        1563.227 MiB        GC live:     166.690 MiB        Max. RSS:  72965.676 MiB

during this two step I have a scipy.optimize.minimize function. (I have thousands or even more small dict allocation here, maybe I need to reuse them too)

I want to know if I can pre-allocate a big memory block for the data I need to remember throughout the program, so the massive little memory block allocated for intermediate computation won’t disturb the memory and can be freed easily.