Hi,
I would like to calculate a count matrix (expected to be 65,000x65,000) and then calculate PMI and then calculate SVD on it.
Unfortunately many “large matrix” samples, tutorials start with rand(n,n)
where the matrix is generated at once. In that case, HDF5 or Distributed works fine. But, I don’t see examples where matrix is frequently updated in which case solutions like HDF5 might be very slow or inefficient.
Below is a MWE. The sample code picks 21 random numbers and generates pairs and then the count matrix is updated for each pair (x,y). Edit: just a reminder; below is just a minimal example, should run without a problem, it will fit memory. But when I try to create a 65,000x65,000 matrix, Julia is killed, I need solution for 65,000x65,000 case. I annotated the code so that sample code does not mislead you, this is just a minimal example, the real-life scenario is much bigger. Edit2: count_mat is not a sparse matrix
using LinearAlgebra
using StatsBase
using IterTools: subsets
# naive approach
kmers = 1:4096; # actually 1:65_000
count_mat = zeros(Int32, 4096,4096); # actually 65_000 x 65_000
# count_mat is NOT sparse
# get random 21 numbers and generate pairs, total of 210 pairs
function get_pairs(data)
my_sample = sample(data, 21)
[i for i in subsets(my_sample, 2)]
end
## count pairs and update matrix
for i in 1:1_000_000 # actually 500_000_000
for (x,y) in get_pairs(kmers) # 210 pairs per loop cycle
count_mat[x,y] += 1 # actually this process will be performed ~ 100 billion times
end
end
in actual case, the count_mat is expected to be (65,000x65,000) and the loop will go over 500 million pairs. so, a distributed and parallel solution is needed. How can I achieve this for much larger matrix?
I tried SharedArrays
and Distributed
for toy example (4096,4096) and it was fine but does not work with 65k matrix. I guess I need a disk-based (or chunked) solution which allows frequent and parallel updates.
Just for completeness, here’s rest of the code where PMI and SVD (truncated to 100 dimensions) are calculated.
## pmi and svd
pmi_mat = zeros(Float64, 4096, 4096);
mat_sum = sum(count_mat);
mat_colsum = sum(count_mat, dims=1);
mat_rowsum = sum(count_mat, dims=2);
rows, cols = size(count_mat)
@time for row in 1:rows
for col in 1:cols
tmp = (count_mat[row,col] * mat_sum) / ( mat_rowsum[row] * mat_colsum[col] )
tmp_res = tmp == 0 ? 0 : log2(tmp)
pmi_mat[row, col] = tmp_res;
end
end
F= svd(pmi_mat);
U, S, V = F;
println(U[:, 1:100])
So, now we have two more problems: calculate another large matrix and calculate SVD from larger-than-memory matrix. I hope there are practical solutions for all of them.