Shuffle values and sample indices of a sparse matrix

I need to shuffle entries in selected columns of a large, sparse matrix X with dimensions M x N . I have written a function that does this in-place, as I need to perform the shuffling repeatedly on the same data.

function shuffle_sparse_columns!(X::SparseMatrixCSC, columns)
    M, N = size(X)
    for col in columns
        colptr_l = X.colptr[col]
        colptr_u = X.colptr[col+1] - 1
        rowvals = @view X.rowval[colptr_l:colptr_u]
        nzvals = @view X.nzval[colptr_l:colptr_u]
        sample!(1:M, rowvals; replace=false, ordered=true)
        shuffle!(nzvals)
    end
end

This is significantly faster than slicing the array with numpy/scipy, but trails behind scikit learn’s implementation, shuffle, when the size of X increases (despite the fact that X is automatically converted to CSR format and manually converted back to CSC).

Below is a (hidden) MWE and some benchmark results.

Load packages, define functions and benchmark
using BenchmarkTools
using PyCall
using Random
using SparseArrays
using StatsBase


"""
    shuffle_columns!(X, cols)

Entries in each of the provided columns in the array are shuffled in-place.
"""
function shuffle_columns!(X, cols)
    for col in cols
        shuffle!(@view X[:,col])
    end
end


"""
    shuffle_sparse_columns!(X::SparseMatrixCSC, columns)

The nonzero entires in each of the provided columns in the array are shuffled in-place.
"""
function shuffle_sparse_columns!(X::SparseMatrixCSC, columns)
    M, N = size(X)
    for col in columns
        colptr_l = X.colptr[col]
        colptr_u = X.colptr[col+1] - 1
        rowvals = @view X.rowval[colptr_l:colptr_u]
        nzvals = @view X.nzval[colptr_l:colptr_u]
        sample!(1:M, rowvals; replace=false, ordered=true)
        shuffle!(nzvals)
    end
end

py"""
import numpy as np
from sklearn.utils import shuffle
from scipy.sparse import csc_matrix, csr_matrix

def sklearn_shuffle_and_convert_csc(X):
    X_shuffled = shuffle(X)
    return csc_matrix(X_shuffled)

def py_shuffle_col(X, columns):
    (M, N) = X.shape
    for col in columns:
        colptr_l = X.indptr[col]
        colptr_u = X.indptr[col+1]
        X.indices[colptr_l:colptr_u] = np.random.choice(M, colptr_u - colptr_l, replace=False)
        X.indices[colptr_l:colptr_u].sort()
        np.random.shuffle(X.data[colptr_l:colptr_u])
    return X
"""

function bench(M, N, p, cols=nothing)
    # Benchmark julia
    Random.seed!(0); 
    X = sprand(M, N, p)
    println("Shuffle each column of a CSC matrix in Julia")
    @btime shuffle_sparse_columns!($X, $cols)
    
    # Benchmark sklearn (CSC -> CSR -> CSC)
    Random.seed!(0);
    X = sprand(M, N, p)
    py_sklearn_shuffle_and_convert = py"sklearn_shuffle_and_convert_csc"
    X_py_csc = py"csc_matrix"(X)
    println("Shuffle CSC with sklearn (to CSR) and convert to CSC")
    @btime $py_sklearn_shuffle_and_convert($X_py_csc)
        
    # Benchmark sklearn (CSC -> CSR)
    py_sklearn_shuffle = py"shuffle"
    X_py_csc = py"csc_matrix"(X)  # Restore original matrix
    println("Shuffle CSC with sklearn")
    @btime $py_sklearn_shuffle($X_py_csc)
    
    # Benchmark sklearn (CSR)
    X_py_csr = py"csr_matrix"(X)
    println("Shuffle CSR with sklearn")
    @btime $py_sklearn_shuffle($X_py_csr)

    # Benchmark numpy slicing (CSC) of rowval and nzval with 0-indexed columns
    py_shuffle_col = py"py_shuffle_col"
    X_py_csc = py"csc_matrix"(X)
    println("Shuffle CSC with numpy slicing")
    @btime $py_shuffle_col($X_py_csc, $(cols .- 1))

    # Dense shuffle for comparison
    Xdense = Array(X)
    println("Dense shuffle")
    @btime shuffle_columns!($Xdense, $cols)
end
Small `X`
M, N, p = 50, 10, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
bench(M, N, p, [1:N...])

### Prints out
Shuffle each column of a CSC matrix in Julia
  444.879 ns (0 allocations: 0 bytes)
Shuffle CSC with sklearn (to CSR) and convert to CSC
  354.833 μs (3 allocations: 144 bytes)
Shuffle CSC with sklearn
  251.791 μs (3 allocations: 144 bytes)
Shuffle CSR with sklearn
  174.165 μs (3 allocations: 144 bytes)
Shuffle CSC with numpy slicing
  203.838 μs (7 allocations: 448 bytes)
Dense shuffle
  5.475 μs (0 allocations: 0 bytes)

Large X:

M, N, p = 5000, 10000, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
bench(M, N, p, [1:N...])

### Prints out
Shuffle each column of a CSC matrix in Julia
  80.976 ms (0 allocations: 0 bytes)
Shuffle CSC with sklearn (to CSR) and convert to CSC
  12.371 ms (3 allocations: 144 bytes)
Shuffle CSC with sklearn
  5.398 ms (3 allocations: 144 bytes)
Shuffle CSR with sklearn
  1.432 ms (3 allocations: 144 bytes)
Shuffle CSC with numpy slicing
  1.020 s (7 allocations: 448 bytes)
Dense shuffle
  611.732 ms (0 allocations: 0 bytes)

I have some questions:

  1. Is there a faster way to sample! from the row indices 1:M than is done in shuffle_sparse_columns!?
  2. Is the CSR format more efficient for M < N? If so, are there any suggestions on how to adapt shuffle_sparse_columns! for a custom ā€œSparseMatrixCSRā€?

Slightly unrelated:

  1. Is there a suitable way to benchmark memory usage/allocations with PyCall?

Any suggestions on how to get the performance comparable with scikit learn are appreciated.

First, I would try profiling the code to see where the bottleneck is. I find the following helpful:

sample! is the bottleneck for this matrix size.

using ProfileVega
M, N, p = 5000, 10000, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
columns = 1:N
ProfileVega.@profview shuffle_sparse_columns!(X, columns)

About 90% of the backtraces are in seqsample_a! in StatsBase. Most of it is spent in the innermost while loop.

The smaller example ends up using seqsample_c!.

If you are sampling without replacement from 1:M, you can try a rand!(rowvals, 1:M) and then order it in place with sort!.

I’m not aware of any function in the standard library that samples without replacement.

julia> Random.seed!(0); x = fill(0, 5); rand!(x, 1:6)
5-element Array{Int64,1}:
 5
 1
 3
 1
 6

What is the typical length of rowvals (i.e. the typically number of nonzeros per column)? If it is very small compared to M, there might be a much faster sampling algorithm.

Is it clear that sklearn.utils.shuffle is doing the same thing? The documentation seems rather vague.

Looking at the source code, it seems that it’s using a much simpler algorithm with a single shuffle! call.

Nice catch! sklearn.utils.shuffle is just shuffling the rows instead of shuffling the row values in each column.

M = 5; N = 3;
X = zeros(Int,M,N);
for idx in eachindex(X)
    X[idx] = idx
end
X_py_csc = py"csc_matrix"(X)
julia> py"shuffle"(X_py_csc).A
5Ɨ3 Array{Int64,2}:
 5  10  15
 3   8  13
 1   6  11
 4   9  14
 2   7  12

I’ve got a TF-IDF matrix of size 3500 x 44917 from a corpus of 3500 documents. Later, I may need to scale up to ~100,000 documents.

Lengths of rowval:

50-th percentile: 1
51-th percentile: 2
75-th percentile: 7
90-th percentile: 50
95-th percentile: 152
99-th percentile: 805
Average: 37, density: 0.011

It may be overkill to call sample! when half of the columns only have a single non-zero value. What sampling algorithm did you have in mind? seqsample_d!?

I’m thinking of something much simpler. When you are sampling from a range (or any array with unique elements), and the fraction of samples is small, you can just sample uniformly and check that the samples are unique; the probability of a collision is quite low so you will only need to re-sample a few times.

function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray)
    n, k = length(a), length(x)
    k <= n || error("length(x) should not exceed length(a)")
    while true
        for i in 1:k; x[i] = rand(a); end
        sort!(x)
        allunique = true
        for i = 2:k
            if x[i] == x[i-1]
                allunique = false
                continue
            end
        end
        allunique && return x
    end
end

(You can save the final shuffle! call in your code if you are willing to pass in a buffer array, or if you don’t care about the order. In your case, since you are calling this repeatedly, and only for lengths ≤ some threshold, I would just pass a preallocated buffer array and use it to cache the non-sorted values.)

While the sparsity pattern of the matrix X is not constant, X.colptr is constant.
I have to evaluate shuffle_sparse_columns!(X, columns) repeatedly for a smaller and smaller subset of all the columns. I wonder if the for-loop might somehow benefit from memoization?

#pseduocode
X = sprand(M,N,p)
Xcopy = copy(X)
function f(X, Xcopy)
    M,N = size(X)
    columns = 1:N
    for _ in range(num_iters)
        shuffle_sparse_columns!(Xcopy, columns)
        # Some more work with X and Xcopy
        columns = get_next_columns(X, Xcopy)
    end
    ...
end
f(X, Xcopy)

i.e.

function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray, buffer::AbstractArray)
    k = length(x)
    k ≤ length(a) || error("length(x) should not exceed length(a)")
    k ≤ length(buffer) || throw(DimensionMismatch()) 
    while true
        for i in 1:k; x[i] = buffer[i] = rand(a); end
        sort!(x)
        allunique = true
        for i = 2:k
            if x[i] == x[i-1]
                allunique = false
                continue
            end
        end
        allunique && return copyto!(x, 1, buffer, 1, k)
    end
end

Note that the probability of a collision here is exactly the birthday problem. If length(x) == k and length(a) == n, the odds of succeeding on the first try are therefore ā‰ˆ exp(-k^2/2n) for k << n. If you want this to be, say, > 90%, then you should use the trivial algorithm for k < sqrt(-2n*log(0.9)). For n=3500 as in your example above, this means using the trivial algorithm for k < 28.

Note that StatsBase already special-cases sampling 1 or 2 elements.

It might be nice to add an optimized algorithm for small samples of ranges to StatsBase similar to above.

(Random.randsubseq and Random.randsubseq! sample without replacement, but they perform Bernoulli sampling, which doesn’t produce a fixed number of samples. They’re used in the implementation of sprand.)

I didn’t consider those cases in sample! as they are only evaluated when setting ordered=false.
Surely this may apply to the ordered case as well.

Oh right; they could have easily included those in the ordered case, but apparently didn’t for some reason.