I would like to calculate a count matrix (expected to be 65,000x65,000) and then calculate PMI and then calculate SVD on it.
Unfortunately many “large matrix” samples, tutorials start with
rand(n,n) where the matrix is generated at once. In that case, HDF5 or Distributed works fine. But, I don’t see examples where matrix is frequently updated in which case solutions like HDF5 might be very slow or inefficient.
Below is a MWE. The sample code picks 21 random numbers and generates pairs and then the count matrix is updated for each pair (x,y). Edit: just a reminder; below is just a minimal example, should run without a problem, it will fit memory. But when I try to create a 65,000x65,000 matrix, Julia is killed, I need solution for 65,000x65,000 case. I annotated the code so that sample code does not mislead you, this is just a minimal example, the real-life scenario is much bigger. Edit2: count_mat is not a sparse matrix
using LinearAlgebra using StatsBase using IterTools: subsets # naive approach kmers = 1:4096; # actually 1:65_000 count_mat = zeros(Int32, 4096,4096); # actually 65_000 x 65_000 # count_mat is NOT sparse # get random 21 numbers and generate pairs, total of 210 pairs function get_pairs(data) my_sample = sample(data, 21) [i for i in subsets(my_sample, 2)] end ## count pairs and update matrix for i in 1:1_000_000 # actually 500_000_000 for (x,y) in get_pairs(kmers) # 210 pairs per loop cycle count_mat[x,y] += 1 # actually this process will be performed ~ 100 billion times end end
in actual case, the count_mat is expected to be (65,000x65,000) and the loop will go over 500 million pairs. so, a distributed and parallel solution is needed. How can I achieve this for much larger matrix?
Distributed for toy example (4096,4096) and it was fine but does not work with 65k matrix. I guess I need a disk-based (or chunked) solution which allows frequent and parallel updates.
Just for completeness, here’s rest of the code where PMI and SVD (truncated to 100 dimensions) are calculated.
## pmi and svd pmi_mat = zeros(Float64, 4096, 4096); mat_sum = sum(count_mat); mat_colsum = sum(count_mat, dims=1); mat_rowsum = sum(count_mat, dims=2); rows, cols = size(count_mat) @time for row in 1:rows for col in 1:cols tmp = (count_mat[row,col] * mat_sum) / ( mat_rowsum[row] * mat_colsum[col] ) tmp_res = tmp == 0 ? 0 : log2(tmp) pmi_mat[row, col] = tmp_res; end end F= svd(pmi_mat); U, S, V = F; println(U[:, 1:100])
So, now we have two more problems: calculate another large matrix and calculate SVD from larger-than-memory matrix. I hope there are practical solutions for all of them.