Large matrix with many updates

I would like to calculate a count matrix (expected to be 65,000x65,000) and then calculate PMI and then calculate SVD on it.
Unfortunately many “large matrix” samples, tutorials start with rand(n,n) where the matrix is generated at once. In that case, HDF5 or Distributed works fine. But, I don’t see examples where matrix is frequently updated in which case solutions like HDF5 might be very slow or inefficient.
Below is a MWE. The sample code picks 21 random numbers and generates pairs and then the count matrix is updated for each pair (x,y). Edit: just a reminder; below is just a minimal example, should run without a problem, it will fit memory. But when I try to create a 65,000x65,000 matrix, Julia is killed, I need solution for 65,000x65,000 case. I annotated the code so that sample code does not mislead you, this is just a minimal example, the real-life scenario is much bigger. Edit2: count_mat is not a sparse matrix

using LinearAlgebra
using StatsBase
using IterTools: subsets

# naive approach

kmers = 1:4096;                       # actually 1:65_000
count_mat = zeros(Int32, 4096,4096);  # actually 65_000 x 65_000
                                      # count_mat is NOT sparse
# get random 21 numbers and generate pairs, total of 210 pairs
function get_pairs(data)
  my_sample = sample(data, 21)
  [i for i in subsets(my_sample, 2)]

## count pairs and update matrix 
for i in 1:1_000_000                 # actually 500_000_000
  for (x,y) in get_pairs(kmers)      # 210 pairs per loop cycle
    count_mat[x,y] += 1              # actually this process will be performed ~ 100 billion times

in actual case, the count_mat is expected to be (65,000x65,000) and the loop will go over 500 million pairs. so, a distributed and parallel solution is needed. How can I achieve this for much larger matrix?

I tried SharedArrays and Distributed for toy example (4096,4096) and it was fine but does not work with 65k matrix. I guess I need a disk-based (or chunked) solution which allows frequent and parallel updates.

Just for completeness, here’s rest of the code where PMI and SVD (truncated to 100 dimensions) are calculated.

## pmi and svd
pmi_mat = zeros(Float64, 4096, 4096);

mat_sum =  sum(count_mat);
mat_colsum =  sum(count_mat, dims=1); 
mat_rowsum =  sum(count_mat, dims=2); 

rows, cols = size(count_mat)
@time for row in 1:rows
  for col in 1:cols
    tmp = (count_mat[row,col] * mat_sum) / ( mat_rowsum[row] * mat_colsum[col] )
    tmp_res = tmp == 0 ? 0 : log2(tmp)
    pmi_mat[row, col] = tmp_res;

F= svd(pmi_mat);
U, S, V = F;
println(U[:, 1:100])

So, now we have two more problems: calculate another large matrix and calculate SVD from larger-than-memory matrix. I hope there are practical solutions for all of them.

Using a sparse matrix should help some. I added a progress bar using ProgressMeter.jl since this takes a while, I ran with smaller npairs for testing.

function count_pairs(; kmers = 1:4096, npairs=1_000_000)                        
    count_mat = spzeros(Int32, kmers.stop, kmers.stop)                          
    ## count pairs and update matrix                                            
    @showprogress 1 for i in 1:npairs                                           
      for (x,y) in get_pairs(kmers)                                             
        count_mat[x,y] += 1                                                     

This still takes quite a while, but it does fit in memory. You could parallelize this by keeping independent count matrices on each thread and then summing them after they had collected all their counts.

It is quite likely that mat_colsum and mat_rowsum will be zero, making the division undefined, you should probably add a +1 in there somewhere, maybe

tmp = (count_mat .* mat_sum) ./ (1 .+ mat_rowsum) ./ (1 .+ mat_colsum)

Then findnz should allow you to take the log of only the non-zero elements.

Once you get the pmi_mat defined, IterativeSolvers.jl can help with the SVD.

Hi @contradict, really appreciate your response. Since the 65,000x65,000 does not fit the memory, I just prepared an example for 4096x4096. I’ll edit the question and emphasize that.

Regarding sparse matrix, the count matrix might be sparse in the minimal working example but in real dataset, it’s not expected to be sparse.

Thanks for the tip about IterativeSolvers, let me check and see how it can handle svd for large matrices.

Is there some structure to the pairs? With with 500e6 uniformly random pairs, 500e6/(65e3^2)~11% of the entries will be non-zero. Probably much more since some will hopefully hit the same spot and make use of the sum.

I was able to run the sparse test on my laptop

julia> @btime count_pairs(;kmers=1:65000, npairs=1000);
Progress: 100%|███████████████████████████████████████████| Time: 0:00:05
Progress: 100%|███████████████████████████████████████████| Time: 0:00:05
Progress: 100%|███████████████████████████████████████████| Time: 0:00:05
Progress: 100%|█████████████████████████████████████████ld ██| Time: 0:00:05
  5.341 s (639080 allocations: 36.62 MiB)

Extrapolating to your full problem, that would take ~1 month on one core of my laptop. So with a 60 core machine, partition the set of pairs 60 ways, count each one, and sum up the resultant sparse matrices, it should take ~12hours. Which is still a long time.

It looks like most of the time is spent updating the sparse matrix, and I suspect that is because many updates cause a re-allocation of memory. It seems you should be able to give the sparse matrix constructor some hints about how many non-zero elements you expect to improve that, but I haven’t figured out how to do that yet.

Usually it’s faster to construct them in COO form. So you construct a vector I of row indices, J of column indices, and V of values, then do sparse(I, J, V, n, m) to create the matrix at the end. I don’t have much experience with such huge matrices so I’m not sure it applies here.

You could also look at using the Mmap.mmap to store a dense array on disk.

1 Like

Ah, perfect! The sparse constructor has a combine argument that defaults to +, so repeated initialization of the same entry sums.

function count_IJV(;kmers=1:65000, npairs=1_000_000)
    I = rand(kmers, npairs)
    J = rand(kmers, npairs)
    V = ones(Int32, npairs)                               
    sparse(I, J, V, kmers.stop, kmers.stop)                   

julia> @btime count_IJV();
  72.689 ms (25 allocations: 43.45 MiB)

julia> @btime cmat = count_IJV(;npairs=100_000_000);
  14.325 s (25 allocations: 4.09 GiB)

julia> cmat = count_IJV(;npairs=100_000_000);

julia> maximum(cmat)

julia> size(cmat.nzval)[1] / 65e3^2

That’s way better than a month! I only have 8GB, so I can’t do the full problem, but it looks like with 20-30GB this would work. Also, I was way off on my sparsity estimate before, I don’t know what I was thinking.

1 Like

Hi @ericphanson, thanks for your reply. I tried Mmap.mmap but not so in detail. The problem is that, I don’t know how performant it is to update memory mapped array 100 billion times. Does it write to disk each update? In that it might be a burden for the disk.

Hi, @contradict, thanks again for the help. I did second edit in the original post, count matrix is not sparse actually. In the second edit, I added some annotations on code, giving idea about problem size for real-life scenario.

The I, J and V approach is generating million random numbers and then binding them. That approach might introduce some sparsity but the actual approach is to:

  • pick 21 numbers
  • generate pairs (combinations of 2 out of 21, which is 210)

and this approach generates a pretty dense matrix.

One more point, svd() does not support sparse matrices as far as I tried.

ERROR: MethodError: no method matching svd(::SparseMatrixCSC{Int32,Int64})

You asked about, if there’s structure to pairs. Let me describe the case: this is actually skip-gram calculation from window of 21 words from which term-term co-occurrence is calculated. Normally, term-term matrices are sparse however, in my case the words are DNA sequences, thus corpus is limited to 65000 words, generating a dense matrix. As for pairs, there are some words with higher occurrence in genome, so some words will occur more often, they’re not uniformly distributed. I did weighted sampling and still it generates dense array.