How to efficiently construct a large SparseArray? Packages for this?

Problem

Does Julia have an efficient way of constructing a huge sparse matrix from a given list of entries u,v,w, some of which can have the same locations u,v, and in that case, their weights w must be summed. Thus u,v,w are input vectors and I wish to create a sparse matrix that has value w[i] at position u[i],v[i]. For instance, the Mathematica code

n=10^6;   m=500*n;   
u=RandomInteger[{1,n},m];   
v=RandomInteger[{1,n},m];   
w=RandomInteger[{-9, 9}, m];   AbsoluteTiming[
SetSystemOptions["SparseArrayOptions"->{"TreatRepeatedEntries"->1}]; 
a= SparseArray[{u,v}\[Transpose] -> w, {n,n}]; ]

requires 135sec and 60GB of RAM. The equivalent Python code

import scipy.sparse as sp   
import numpy as np
import time
def ti(): return time.perf_counter() 
n=10**6; m=500*n;
u=np.random.randint(0,n,size=m);            
v=np.random.randint(0,n,size=m);            
w=np.random.randint(-9,10,size=m); t0=ti(); 
a=sp.csr_matrix((w,(u,v)),shape=(n,n),dtype=int); t1=ti(); print(t1-t0)

needs 36sec and 20GB, but doesn’t support (2). The equivalent Julia code

using SparseArrays;
m=n=10^6; r=500*n;   
u=rand(1:m,r);   
v=rand(1:n,r);   
w=rand(-9:9,r);
function f() a=sparse(u,v,w,m,n,+); end;   
@time f()

needs 155sec and 26GB (the @time macro incorrectly reports that it used only 15GB).

Is there a way to make this construction more efficient?
Are there any Julia packages that are better at this?

Background

I have created a package for linear algebra and homological algebra computations. I did it in Mathematica, since the SparseArray implementation (0) there is

  1. very efficient (fast and low RAM usage),
  2. supports exact fractions as matrix entries.
  3. supports polynomials as matrix entries.

However, it is not

  1. parallelized
  2. open source
  3. GPU or cluster enabled.

I am considering migrating to a different programming language. I have tried Python SciPy.sparse, but it doesn’t satisfy (2). I’m not sure if C++ libraries support (2). As a test, I tried sorting an array of floats of size 10^9 and compared the performance of Mathematica, Python numpy, and Julia ThreadsX. I was incredibly impressed by the latter (3x faster than numpy, 10x than Mathematica). I am considering migrating to Julia, but I wish to first make sure that my library would perform better than in Mathematica.

P.S. I also asked this question here.

2 Likes

Very interesting.

They must have used a very efficient algorithm. But in the aboe, where does the addition happen? In the doc, I didn’t see anything (didn’t look very hard) that says it will sum up the values if u v coordinates are duplicated.

If you didn’t have to sum up then the algorithm could be faster.

But I am looking at this problem now and thinking about it.

No, the documentation explicitly says that it will sum up.https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html . Search “duplicate” in the documentation and you get:

Duplicate entries are summed together:

No, I think they just use a standard algorithm followed by a reduced sum of duplicate items (check coo_csr, the workhorse of the algorithm, which only comprises of less than 100 lines).

2 Likes

I can’t get this performance in my Python. I get closer to 200s. Hmm, and I think it’s quite variable due to some GC in python?

FWIW, if you change the types in Julia you might get better performance. E.g Change the int to 32 bit int.

But suspect the main bottle is the reduce here.

image

update: just noticed a major error in my algorithms But i think the general idea might improve things. Fixed

I got 27s using my approach vs 133s for the sparse approach.

I think my approach will be faster, but it’s very hacky and only works if the u, v, w can fit in a 64bit value, which may not be true if you want to use fractions.

I think I made a mistake in the algorithm somewhere, but couldn’t be bother looking up where the issue might be (that’s why I said I think it’s faster not sure if after the bug is fixed. but I think it works). Hopefully, the general approach is clear.

The biggest time will be spent on doing group by u,v and coming up with the colptr and row id and values for the sparse matrix.

But you can sort them much faster using radix sort if you can use some bit twiddle technique to “fuse” the u, v, w values into a 64bit value, then sorting that and unfusing to get back the u, v , w will be much faster than doing group by on u, v and summarizing by summing w

Expand to see code here

m=n=10^6; r=500*n;
u=rand(Int32.(1:m),r);
v=rand(Int32.(1:n),r);
w=rand(Int16.(-9:9),r);

# @time a = sparse(u, v, w, m, n, + );
# @time a = sparse(u, v, w, m, n, + );

const udigits = ceil(Int, log(2, m+1))
const vdigits = ceil(Int, log(2, n+1))

function fuse(u, v, w)
    (((UInt64(v) << udigits) | UInt32(u)) << 8) | UInt8(w + 9)
end

function unfuse(vuw)
    w = vuw & typemax(UInt8)
    u = (vuw >> 8) & UInt32(2^udigits-1)
    v = (vuw >> (8+udigits)) & UInt32(2^vdigits-1)
    Int32(u), Int32(v), Int8(w)-Int8(9)
end


@time vuw = fuse.(u, v, w); # 1.5s
# @time sort!(vuw);
# @time vuw = fuse.(u, v, w);
@time sort!(vuw, alg=RadixSort); # 15s


function faster_sparse(u, v, w, m, n)
    vuw = fuse.(u, v, w); # 1.5s
    sort!(vuw, alg=RadixSort); # 15s

    colptr = zeros(Int32, m+1)
    l = m*10
    rowind = zeros(Int32, l)
    vals = zeros(Int16, l)



    last_u1, last_v1, w1 = unfuse(vuw[1])

    rowind[1] = last_u1
    vals[1] = w1

    j=0
    colptr_cnt = 0
    i = 1

    for vuw1 in Iterators.drop(vuw, 1)
        # the data is already sorted by v1, then u1
        if false
            j += 1
            u1, v1, w1 = unfuse(vuw[j])
        end

        u1, v1, w1 = unfuse(vuw1)

        # the value has changed
        if (last_u1 != u1) || (last_v1 != v1)
            # set the next value
            i += 1
            if i > l
                # double the memory
                resize!(rowind, l*2)
                resize!(vals, l*2)
                l = l*2
            end

            # the last ui should already be set
            rowind[i] = u1
            vals[i] = w1
            last_u1 = u1
            colptr_cnt += 1
        else
            vals[i] += w1
        end

        # if column also switched
        if last_v1 != v1
            colptr[last_v1+1] = colptr_cnt+1
            last_v1 = v1
        end
    end
    resize!(rowind, i)
    resize!(vals, i)


    colptr[1] = 1
    colptr[end] = i+1

    if false
        # for testing colptr`
        tmp = combine(groupby(unique(DataFrame(;v, u)), :v), nrow)
        colptr1 = accumulate(+, tmp.nrow)

        return colptr, vcat(0, colptr1) .+ 1
    end

    #m, n, colptr, rowind, vals
    SparseMatrixCSC{Int16, Int32}(m, n, colptr, rowind, vals)
end

# faster_sparse(u, v, w, m, n)
@time a2 = faster_sparse(u, v, w, m, n)
@time a = sparse(u, v, w, m, n, + )
a2==a # true

I don’t think it needs to be so complicated. You can just follow the standard algorithm defined in scipy and should get slightly better performance than scipy.

Hmm, ok. but I can’t replicate the scipy number that the OP mentioned.

Can you? I got closer to 100s. If that’s confirmed then my solution is like 4x faster than scipy’s for this very specialized data example.

My computer has a small RAM (8GB). So I can’t run the code efficiently here (the vector u,v,w already use more than 8GB)…Maybe the difference is caused by the hardware. I guess OP has a larger RAM than you, and scipy consumes much more memory than Julia to get better performance. Once you use up all your RAM memory, the performance drops.

Possibly. But if the OP can try my algorithm and see if there are improvements.

Or a typo missing a 1

I tried comparing Julia (with int32 now, as you suggested) and scipy, I get:
Python 3.8.10, scipy 1.5.4: 20.5GB, 36s
Julia 1.5.3: 13.2GB, 117s

This is on my laptop HP ZBook 17 G5 with 6-core i7-8850H CPU processor, 64GB DDR4 2666MT/s RAM, PCIe NVMe TLC SSD disk, and Linux Mint 20 Kernel 5.4.0-77-generic x86_64.

I am pleased with the reduced RAM consumption, but it is still slow. Next, I will try out your new algorithm and report…

I also have 64g ram but I cant replicate ur scipy numbers

Here is the screenshot:

As for you new algorithm, I’m excited that you tried improving the current sparse function : ). Your code above returned:

2.015481 seconds (293.93 k allocations: 3.740 GiB, 0.98% gc time)
12.143673 seconds (1.42 M allocations: 3.798 GiB, 0.66% gc time)
21.388375 seconds (222.07 k allocations: 11.042 GiB, 3.87% gc time)
108.455555 seconds (594.07 k allocations: 5.628 GiB, 0.02% gc time)
true

After running only ˙faster_sparse˙, I am delighted to report that I spent only 13.2GB and 22s!!! : )

Since I just began learning Julia, I don’t yet know what the code is doing. Would it be possible to make this faster_sparse not depend on external udigits,vdigits? (a complete substitute for the built-in sparse) Does it accept fractions as entries?

What happens when the limitations of int32 or int16 are exceedeed? Is your function silent and returns a wrong result? Does it crash and not return anything? Does it just upgrade to int64 or int32, use more RAM and return the correct result?

By any chance, would it be possible to improve it further by using multiprocessing/multithreading? Would that use significantly more RAM?

faster_sparse is a very quick and dirty implementation. I think it’s possible to make it so that it can handle large indices, cos there’s always UInt128 to pack thing in. udigit and vdigits can be generalized.

To accept fractions, what are your fractions represented with? If small values in both numerator and denominator then it can be, but need to modify the code.

Now I am wondering why scipy is so fast without these tricks that I do. Hmmm

Indices u,v will never exceed 10^9, so int32 will always be sufficient.

As for values, the typical use-case in linear algebra and homological algebra (computing rank or the bases for kernel and cokernel) is the following.

I start with a very large (e.g. 10^8 x 10^8) matrix with very low density (e.g. 10 entries per each column). Often, these entries at the start are all just 1 or -1. Then iteratively, I reduce said matrix to a smaller size (e.g. 10^6 x 10^6) with increased density. There are two options: working over the ring of integers of the field of fractions. In the first case, the matrix values typically remain small. In the second, the fractions can become quite large.

It would be helpful if it were possible to increase the datatype during computations, when needed. Or if Julia even did that automatically, but not burn unnecessary RAM.

do you need more than Rational{Int16}? And how many times do you have to create the matrix? If alot then utilizing the tricks I showed and customizing it to fractions would help, even if you need to use UInt128

Wouldn’t think anything like this automatically will happen. you need to check yourself I think.

In the beginning (starting with a 10^8 x 10^8 matrix of 1s and -1s), an iteration can remove 99% of rows/columns and constructs a smaller matrix. In later stages, as the matrix becomes smaller/denser (10^6 x 10^6 or 10^5 x 10^5 or lower), an iteration may remove only 10% of rows/columns. In total, typically there are between 10 and 100 iterations.

If I work over rationals, then sometimes, the numerators/denominators might exceed Int16, but this happens after a few iterations, when the matrix is smaller/denser.

It would be ideal if the datatype of entries could be adjusted during calculations (even if manually) to not burn through all the RAM in the start, and to not overflow/crash in later iterations.

anyway, if you think the approach is promising for you. I can show u how to modify it for rationals.

But you might as well just switch to a larger rational size when the matrix is small enough and that you are sure to have enough ram. So you don’t have to over-engineer this and worry about “automatic” promotions and stuff

The approach is very promising, so I would appreciate if you could finish your function for rationals, yes.

Do I convert to higher datatype by running a=convert(Int32,a)? Hopefully, this changes only entries?

Also, is there any chance for further improvements with parallelization?

I can replicate your numbers if I increase the RAM for Linux. So that indeed seems to be the issue.

a = Int32.(a)

would work. But you might want to set up some function boundaries to make sure your code is still efficient.

Can you have a crack at understanding it and see which bits you don’t understand? Cos you might be able to replicate the approach yourself once you understand the tricks. Happy to answer any questions you have.

I will certainly try to comprehend your function, tweak it, and perhaps even try to parallelize it. But first I should get more comfortable with Julia, I’ve used it for only 2 weeks.

Hopefully, when I have questions regarding sparse (which is sure to happen), you will be able to reply. Please give me a few days/weeks and I’ll get back to you. Also, thank you very much for your help! : )

1 Like