Hi,

I would like to calculate a count matrix (expected to be 65,000x65,000) and then calculate PMI and then calculate SVD on it.

Unfortunately many “large matrix” samples, tutorials start with `rand(n,n)`

where the matrix is generated at once. In that case, HDF5 or Distributed works fine. But, I don’t see examples where matrix is frequently updated in which case solutions like HDF5 might be very slow or inefficient.

Below is a MWE. The sample code picks 21 random numbers and generates pairs and then the count matrix is updated for each pair (x,y). *Edit: just a reminder; below is just a minimal example, should run without a problem, it will fit memory. But when I try to create a 65,000x65,000 matrix, Julia is killed, I need solution for 65,000x65,000 case. I annotated the code so that sample code does not mislead you, this is just a minimal example, the real-life scenario is much bigger.* *Edit2: count_mat is not a sparse matrix*

```
using LinearAlgebra
using StatsBase
using IterTools: subsets
# naive approach
kmers = 1:4096; # actually 1:65_000
count_mat = zeros(Int32, 4096,4096); # actually 65_000 x 65_000
# count_mat is NOT sparse
# get random 21 numbers and generate pairs, total of 210 pairs
function get_pairs(data)
my_sample = sample(data, 21)
[i for i in subsets(my_sample, 2)]
end
## count pairs and update matrix
for i in 1:1_000_000 # actually 500_000_000
for (x,y) in get_pairs(kmers) # 210 pairs per loop cycle
count_mat[x,y] += 1 # actually this process will be performed ~ 100 billion times
end
end
```

in actual case, the count_mat is expected to be (65,000x65,000) and the loop will go over 500 million pairs. so, a distributed and parallel solution is needed. How can I achieve this for much larger matrix?

I tried `SharedArrays`

and `Distributed`

for toy example (4096,4096) and it was fine but does not work with 65k matrix. I guess I need a disk-based (or chunked) solution which allows frequent and parallel updates.

Just for completeness, here’s rest of the code where PMI and SVD (truncated to 100 dimensions) are calculated.

```
## pmi and svd
pmi_mat = zeros(Float64, 4096, 4096);
mat_sum = sum(count_mat);
mat_colsum = sum(count_mat, dims=1);
mat_rowsum = sum(count_mat, dims=2);
rows, cols = size(count_mat)
@time for row in 1:rows
for col in 1:cols
tmp = (count_mat[row,col] * mat_sum) / ( mat_rowsum[row] * mat_colsum[col] )
tmp_res = tmp == 0 ? 0 : log2(tmp)
pmi_mat[row, col] = tmp_res;
end
end
F= svd(pmi_mat);
U, S, V = F;
println(U[:, 1:100])
```

So, now we have two more problems: calculate another large matrix and calculate SVD from larger-than-memory matrix. I hope there are practical solutions for all of them.