Shuffle values and sample indices of a sparse matrix

eliassno · March 22, 2021, 12:54am

I need to shuffle entries in selected columns of a large, sparse matrix X with dimensions M x N . I have written a function that does this in-place, as I need to perform the shuffling repeatedly on the same data.

function shuffle_sparse_columns!(X::SparseMatrixCSC, columns)
    M, N = size(X)
    for col in columns
        colptr_l = X.colptr[col]
        colptr_u = X.colptr[col+1] - 1
        rowvals = @view X.rowval[colptr_l:colptr_u]
        nzvals = @view X.nzval[colptr_l:colptr_u]
        sample!(1:M, rowvals; replace=false, ordered=true)
        shuffle!(nzvals)
    end
end

This is significantly faster than slicing the array with numpy/scipy, but trails behind scikit learn’s implementation, shuffle, when the size of X increases (despite the fact that X is automatically converted to CSR format and manually converted back to CSC).

Below is a (hidden) MWE and some benchmark results.

Load packages, define functions and benchmark

using BenchmarkTools
using PyCall
using Random
using SparseArrays
using StatsBase


"""
    shuffle_columns!(X, cols)

Entries in each of the provided columns in the array are shuffled in-place.
"""
function shuffle_columns!(X, cols)
    for col in cols
        shuffle!(@view X[:,col])
    end
end


"""
    shuffle_sparse_columns!(X::SparseMatrixCSC, columns)

The nonzero entires in each of the provided columns in the array are shuffled in-place.
"""
function shuffle_sparse_columns!(X::SparseMatrixCSC, columns)
    M, N = size(X)
    for col in columns
        colptr_l = X.colptr[col]
        colptr_u = X.colptr[col+1] - 1
        rowvals = @view X.rowval[colptr_l:colptr_u]
        nzvals = @view X.nzval[colptr_l:colptr_u]
        sample!(1:M, rowvals; replace=false, ordered=true)
        shuffle!(nzvals)
    end
end

py"""
import numpy as np
from sklearn.utils import shuffle
from scipy.sparse import csc_matrix, csr_matrix

def sklearn_shuffle_and_convert_csc(X):
    X_shuffled = shuffle(X)
    return csc_matrix(X_shuffled)

def py_shuffle_col(X, columns):
    (M, N) = X.shape
    for col in columns:
        colptr_l = X.indptr[col]
        colptr_u = X.indptr[col+1]
        X.indices[colptr_l:colptr_u] = np.random.choice(M, colptr_u - colptr_l, replace=False)
        X.indices[colptr_l:colptr_u].sort()
        np.random.shuffle(X.data[colptr_l:colptr_u])
    return X
"""

function bench(M, N, p, cols=nothing)
    # Benchmark julia
    Random.seed!(0); 
    X = sprand(M, N, p)
    println("Shuffle each column of a CSC matrix in Julia")
    @btime shuffle_sparse_columns!($X, $cols)
    
    # Benchmark sklearn (CSC -> CSR -> CSC)
    Random.seed!(0);
    X = sprand(M, N, p)
    py_sklearn_shuffle_and_convert = py"sklearn_shuffle_and_convert_csc"
    X_py_csc = py"csc_matrix"(X)
    println("Shuffle CSC with sklearn (to CSR) and convert to CSC")
    @btime $py_sklearn_shuffle_and_convert($X_py_csc)
        
    # Benchmark sklearn (CSC -> CSR)
    py_sklearn_shuffle = py"shuffle"
    X_py_csc = py"csc_matrix"(X)  # Restore original matrix
    println("Shuffle CSC with sklearn")
    @btime $py_sklearn_shuffle($X_py_csc)
    
    # Benchmark sklearn (CSR)
    X_py_csr = py"csr_matrix"(X)
    println("Shuffle CSR with sklearn")
    @btime $py_sklearn_shuffle($X_py_csr)

    # Benchmark numpy slicing (CSC) of rowval and nzval with 0-indexed columns
    py_shuffle_col = py"py_shuffle_col"
    X_py_csc = py"csc_matrix"(X)
    println("Shuffle CSC with numpy slicing")
    @btime $py_shuffle_col($X_py_csc, $(cols .- 1))

    # Dense shuffle for comparison
    Xdense = Array(X)
    println("Dense shuffle")
    @btime shuffle_columns!($Xdense, $cols)
end

Small `X`

M, N, p = 50, 10, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
bench(M, N, p, [1:N...])

### Prints out
Shuffle each column of a CSC matrix in Julia
  444.879 ns (0 allocations: 0 bytes)
Shuffle CSC with sklearn (to CSR) and convert to CSC
  354.833 μs (3 allocations: 144 bytes)
Shuffle CSC with sklearn
  251.791 μs (3 allocations: 144 bytes)
Shuffle CSR with sklearn
  174.165 μs (3 allocations: 144 bytes)
Shuffle CSC with numpy slicing
  203.838 μs (7 allocations: 448 bytes)
Dense shuffle
  5.475 μs (0 allocations: 0 bytes)

Large X:

M, N, p = 5000, 10000, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
bench(M, N, p, [1:N...])

### Prints out
Shuffle each column of a CSC matrix in Julia
  80.976 ms (0 allocations: 0 bytes)
Shuffle CSC with sklearn (to CSR) and convert to CSC
  12.371 ms (3 allocations: 144 bytes)
Shuffle CSC with sklearn
  5.398 ms (3 allocations: 144 bytes)
Shuffle CSR with sklearn
  1.432 ms (3 allocations: 144 bytes)
Shuffle CSC with numpy slicing
  1.020 s (7 allocations: 448 bytes)
Dense shuffle
  611.732 ms (0 allocations: 0 bytes)

I have some questions:

Is there a faster way to sample! from the row indices 1:M than is done in shuffle_sparse_columns!?
Is the CSR format more efficient for M < N? If so, are there any suggestions on how to adapt shuffle_sparse_columns! for a custom “SparseMatrixCSR”?

Slightly unrelated:

Is there a suitable way to benchmark memory usage/allocations with PyCall?

Any suggestions on how to get the performance comparable with scikit learn are appreciated.

Tamas_Papp · March 22, 2021, 7:27am

First, I would try profiling the code to see where the bottleneck is. I find the following helpful:

eliassno · March 22, 2021, 2:23pm

sample! is the bottleneck for this matrix size.

using ProfileVega
M, N, p = 5000, 10000, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
columns = 1:N
ProfileVega.@profview shuffle_sparse_columns!(X, columns)

About 90% of the backtraces are in seqsample_a! in StatsBase. Most of it is spent in the innermost while loop.

The smaller example ends up using seqsample_c!.

Tamas_Papp · March 23, 2021, 9:24am

If you are sampling without replacement from 1:M, you can try a rand!(rowvals, 1:M) and then order it in place with sort!.

eliassno · March 23, 2021, 11:25am

I’m not aware of any function in the standard library that samples without replacement.

julia> Random.seed!(0); x = fill(0, 5); rand!(x, 1:6)
5-element Array{Int64,1}:
 5
 1
 3
 1
 6

stevengj · March 23, 2021, 12:34pm

What is the typical length of rowvals (i.e. the typically number of nonzeros per column)? If it is very small compared to M, there might be a much faster sampling algorithm.

stevengj · March 23, 2021, 12:36pm

Is it clear that sklearn.utils.shuffle is doing the same thing? The documentation seems rather vague.

Looking at the source code, it seems that it’s using a much simpler algorithm with a single shuffle! call.

eliassno · March 23, 2021, 12:56pm

Nice catch! sklearn.utils.shuffle is just shuffling the rows instead of shuffling the row values in each column.

M = 5; N = 3;
X = zeros(Int,M,N);
for idx in eachindex(X)
    X[idx] = idx
end
X_py_csc = py"csc_matrix"(X)

julia> py"shuffle"(X_py_csc).A
5×3 Array{Int64,2}:
 5  10  15
 3   8  13
 1   6  11
 4   9  14
 2   7  12

eliassno · March 23, 2021, 1:50pm

I’ve got a TF-IDF matrix of size 3500 x 44917 from a corpus of 3500 documents. Later, I may need to scale up to ~100,000 documents.

Lengths of rowval:

50-th percentile: 1
51-th percentile: 2
75-th percentile: 7
90-th percentile: 50
95-th percentile: 152
99-th percentile: 805
Average: 37, density: 0.011

It may be overkill to call sample! when half of the columns only have a single non-zero value. What sampling algorithm did you have in mind? seqsample_d!?

stevengj · March 23, 2021, 2:16pm

I’m thinking of something much simpler. When you are sampling from a range (or any array with unique elements), and the fraction of samples is small, you can just sample uniformly and check that the samples are unique; the probability of a collision is quite low so you will only need to re-sample a few times.

function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray)
    n, k = length(a), length(x)
    k <= n || error("length(x) should not exceed length(a)")
    while true
        for i in 1:k; x[i] = rand(a); end
        sort!(x)
        allunique = true
        for i = 2:k
            if x[i] == x[i-1]
                allunique = false
                continue
            end
        end
        allunique && return x
    end
end

(You can save the final shuffle! call in your code if you are willing to pass in a buffer array, or if you don’t care about the order. In your case, since you are calling this repeatedly, and only for lengths ≤ some threshold, I would just pass a preallocated buffer array and use it to cache the non-sorted values.)

eliassno · March 23, 2021, 2:23pm

While the sparsity pattern of the matrix X is not constant, X.colptr is constant.
I have to evaluate shuffle_sparse_columns!(X, columns) repeatedly for a smaller and smaller subset of all the columns. I wonder if the for-loop might somehow benefit from memoization?

#pseduocode
X = sprand(M,N,p)
Xcopy = copy(X)
function f(X, Xcopy)
    M,N = size(X)
    columns = 1:N
    for _ in range(num_iters)
        shuffle_sparse_columns!(Xcopy, columns)
        # Some more work with X and Xcopy
        columns = get_next_columns(X, Xcopy)
    end
    ...
end
f(X, Xcopy)

stevengj · March 23, 2021, 2:26pm

i.e.

function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray, buffer::AbstractArray)
    k = length(x)
    k ≤ length(a) || error("length(x) should not exceed length(a)")
    k ≤ length(buffer) || throw(DimensionMismatch()) 
    while true
        for i in 1:k; x[i] = buffer[i] = rand(a); end
        sort!(x)
        allunique = true
        for i = 2:k
            if x[i] == x[i-1]
                allunique = false
                continue
            end
        end
        allunique && return copyto!(x, 1, buffer, 1, k)
    end
end

Note that the probability of a collision here is exactly the birthday problem. If length(x) == k and length(a) == n, the odds of succeeding on the first try are therefore ≈ exp(-k^2/2n) for k << n. If you want this to be, say, > 90%, then you should use the trivial algorithm for k < sqrt(-2n*log(0.9)). For n=3500 as in your example above, this means using the trivial algorithm for k < 28.

stevengj · March 23, 2021, 2:54pm

Note that StatsBase already special-cases sampling 1 or 2 elements.

It might be nice to add an optimized algorithm for small samples of ranges to StatsBase similar to above.

stevengj · March 23, 2021, 4:13pm

(Random.randsubseq and Random.randsubseq! sample without replacement, but they perform Bernoulli sampling, which doesn’t produce a fixed number of samples. They’re used in the implementation of sprand.)

eliassno · March 23, 2021, 5:30pm

I didn’t consider those cases in sample! as they are only evaluated when setting ordered=false.
Surely this may apply to the ordered case as well.

stevengj · March 23, 2021, 5:50pm

Oh right; they could have easily included those in the ordered case, but apparently didn’t for some reason.