Huge sparse array construction

AhmedAlreweny · April 12, 2020, 8:06pm

Hello,
I have a huge sparse matrix (1 million x 1 million) which I am trying to efficiently construct. I am using the following code:

D_matrix = sparse(I, J, V)
dropzeros!(D_matrix,trim = true)

Where I,J and V are very long vectors. This line use the following resources:

  7.964829 seconds (56 allocations: 6.505 GiB, 19.91% gc time)

I was wondering if there is any way to reduce the memory consumption here. I thought of something like splitting the vectors into two and assigning the first half, then free it, then GC.gc(), then the other half. Is there any other way to save some memory here?

dpsanders · April 12, 2020, 8:09pm

How many entries does your sparse matrix have? Is there any reason to think that Julia is using much more memory than the space required to actually store your sparse matrix?

John_Gibson · April 12, 2020, 8:09pm

Is V full of zeros? If so, delete them first.

AhmedAlreweny · April 12, 2020, 8:11pm

Not really. I tried that, no big difference.

John_Gibson · April 12, 2020, 8:11pm

Ok, then how long are I,J, and V?

AhmedAlreweny · April 12, 2020, 8:16pm

length(V) = 217778220
length(I) = 217778220
length(J) = 217778220

pablosanjose · April 12, 2020, 8:21pm

If you can generate the non-zero entries of the sparse matrix by columns you can easily beat the sparse method by constructing the internal fields of SparseMatrixCSC directly. I do that often, but be advised that such an approach is not guaranteed to remain valid if somebody refactors the SparseMatrixCSC implementation at some point.

stevengj · April 12, 2020, 8:23pm

If you have 217778220 nonzero entries, then just storing the values (as 64-bit floating-point numbers) and indices (as 64-bit integers) requires

3 * 217778220 * 8 / 2^30

GiB, or 4.87GiB. So, the memory usage of 6.5GiB is not that surprising (that is, it indicates that there is a 34% overhead of allocated memory in the process of constructing the sparse-matrix data structure, which is not particularly large). You could save a bit of memory (33%) by using 32-bit integers for the indices, but fundamentally you will need several GiB for this matrix (unless there is some other structure you can exploit besides sparsity).

As @pablosanjose suggests, if you generate the CSC sparse format directly, instead of (I,J,V) data, then you can avoid allocating the sparse data twice so you can save a factor of 2 (or a little more, since the overhead goes away), and it will be faster.

RoyiAvital · April 12, 2020, 8:55pm

Could you show an example how?
Maybe if it is so useful it should become official (At least to keep the option to do so, maybe the technic could change over time).

By the way, what about representing I and J in more elegant way? Some kind of iterator or concatenation of ranges as usually they have some kind of special form.

pablosanjose · April 12, 2020, 11:25pm

I have some iterators in my (WIP) Quantica.jl package that I use for such a thing (in src/iterators.jl). The idea is to use ~~an iterator~~ a constructor s=SparseMatrixBuilder{T}() to which we can push an element val::T to a given row by subsequent calls to pushtocolumn!(s, row, val). When a column is done you jump to the next column with finalizecolumn!(s). At the end you get your sparse matrix with sparse(s). Internally, this just builds the SparseMatrixCSC struct incrementally. A small, slightly silly example

julia> using Quantica, SparseArrays

julia> nnonzeros = 217778220; n = 10^6; V = rand(nnonzeros);

julia> function growmatrix(V::Vector{T}, nrows) where {T}
           s = Quantica.SparseMatrixBuilder{T}()
           row = 0
           Δrow = nrows ÷ 10
           for val in V
               row + Δrow > nrows && Quantica.finalizecolumn!(s)
               row = rem(row + Δrow - 1, nrows) + 1
               Quantica.pushtocolumn!(s, row, val)
           end
           return sparse(s)
       end

julia> @time m = growmatrix(V, n);
 22.052439 seconds (68.88 k allocations: 3.637 GiB, 2.92% gc time)

This particular example appears quite a bit slower than your direct sparse call, but at least uses around half the memory. Not sure how useful this kind of approach is for the OP’s case

EDIT: the iterator above has itself some overhead because it sorts each column after it is pushed. If the input data is already column-sorted, that overhead could be reduced.

Topic		Replies	Views
How to construct a large sparse matrix? General Usage question , sparse , matrices	6	5352	November 4, 2021
Very sparse matrix but still out-of-memory? General Usage	6	1245	June 19, 2018
Sparse matrix 700x slower than full New to Julia sparse	7	1052	February 24, 2021
Efficient Initialization of huge sparse arrays New to Julia	6	3137	September 15, 2019
How to efficiently construct a large SparseArray? Packages for this? Performance package , performance , parallel , sparse	20	1727	May 15, 2022

Huge sparse array construction

Related topics