# Shuffle values and sample indices of a sparse matrix

I need to shuffle entries in selected columns of a large, sparse matrix X with dimensions M x N . I have written a function that does this in-place, as I need to perform the shuffling repeatedly on the same data.

function shuffle_sparse_columns!(X::SparseMatrixCSC, columns)
M, N = size(X)
for col in columns
colptr_l = X.colptr[col]
colptr_u = X.colptr[col+1] - 1
rowvals = @view X.rowval[colptr_l:colptr_u]
nzvals = @view X.nzval[colptr_l:colptr_u]
sample!(1:M, rowvals; replace=false, ordered=true)
shuffle!(nzvals)
end
end

This is significantly faster than slicing the array with numpy/scipy, but trails behind scikit learnās implementation, shuffle, when the size of X increases (despite the fact that X is automatically converted to CSR format and manually converted back to CSC).

Below is a (hidden) MWE and some benchmark results.

Load packages, define functions and benchmark
using BenchmarkTools
using PyCall
using Random
using SparseArrays
using StatsBase

"""
shuffle_columns!(X, cols)

Entries in each of the provided columns in the array are shuffled in-place.
"""
function shuffle_columns!(X, cols)
for col in cols
shuffle!(@view X[:,col])
end
end

"""
shuffle_sparse_columns!(X::SparseMatrixCSC, columns)

The nonzero entires in each of the provided columns in the array are shuffled in-place.
"""
function shuffle_sparse_columns!(X::SparseMatrixCSC, columns)
M, N = size(X)
for col in columns
colptr_l = X.colptr[col]
colptr_u = X.colptr[col+1] - 1
rowvals = @view X.rowval[colptr_l:colptr_u]
nzvals = @view X.nzval[colptr_l:colptr_u]
sample!(1:M, rowvals; replace=false, ordered=true)
shuffle!(nzvals)
end
end

py"""
import numpy as np
from sklearn.utils import shuffle
from scipy.sparse import csc_matrix, csr_matrix

def sklearn_shuffle_and_convert_csc(X):
X_shuffled = shuffle(X)
return csc_matrix(X_shuffled)

def py_shuffle_col(X, columns):
(M, N) = X.shape
for col in columns:
colptr_l = X.indptr[col]
colptr_u = X.indptr[col+1]
X.indices[colptr_l:colptr_u] = np.random.choice(M, colptr_u - colptr_l, replace=False)
X.indices[colptr_l:colptr_u].sort()
np.random.shuffle(X.data[colptr_l:colptr_u])
return X
"""

function bench(M, N, p, cols=nothing)
# Benchmark julia
Random.seed!(0);
X = sprand(M, N, p)
println("Shuffle each column of a CSC matrix in Julia")
@btime shuffle_sparse_columns!(\$X, \$cols)

# Benchmark sklearn (CSC -> CSR -> CSC)
Random.seed!(0);
X = sprand(M, N, p)
py_sklearn_shuffle_and_convert = py"sklearn_shuffle_and_convert_csc"
X_py_csc = py"csc_matrix"(X)
println("Shuffle CSC with sklearn (to CSR) and convert to CSC")
@btime \$py_sklearn_shuffle_and_convert(\$X_py_csc)

# Benchmark sklearn (CSC -> CSR)
py_sklearn_shuffle = py"shuffle"
X_py_csc = py"csc_matrix"(X)  # Restore original matrix
println("Shuffle CSC with sklearn")
@btime \$py_sklearn_shuffle(\$X_py_csc)

# Benchmark sklearn (CSR)
X_py_csr = py"csr_matrix"(X)
println("Shuffle CSR with sklearn")
@btime \$py_sklearn_shuffle(\$X_py_csr)

# Benchmark numpy slicing (CSC) of rowval and nzval with 0-indexed columns
py_shuffle_col = py"py_shuffle_col"
X_py_csc = py"csc_matrix"(X)
println("Shuffle CSC with numpy slicing")
@btime \$py_shuffle_col(\$X_py_csc, \$(cols .- 1))

# Dense shuffle for comparison
Xdense = Array(X)
println("Dense shuffle")
@btime shuffle_columns!(\$Xdense, \$cols)
end
Small `X`
M, N, p = 50, 10, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
bench(M, N, p, [1:N...])

### Prints out
Shuffle each column of a CSC matrix in Julia
444.879 ns (0 allocations: 0 bytes)
Shuffle CSC with sklearn (to CSR) and convert to CSC
354.833 Ī¼s (3 allocations: 144 bytes)
Shuffle CSC with sklearn
251.791 Ī¼s (3 allocations: 144 bytes)
Shuffle CSR with sklearn
174.165 Ī¼s (3 allocations: 144 bytes)
Shuffle CSC with numpy slicing
203.838 Ī¼s (7 allocations: 448 bytes)
Dense shuffle
5.475 Ī¼s (0 allocations: 0 bytes)

Large X:

M, N, p = 5000, 10000, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
bench(M, N, p, [1:N...])

### Prints out
Shuffle each column of a CSC matrix in Julia
80.976 ms (0 allocations: 0 bytes)
Shuffle CSC with sklearn (to CSR) and convert to CSC
12.371 ms (3 allocations: 144 bytes)
Shuffle CSC with sklearn
5.398 ms (3 allocations: 144 bytes)
Shuffle CSR with sklearn
1.432 ms (3 allocations: 144 bytes)
Shuffle CSC with numpy slicing
1.020 s (7 allocations: 448 bytes)
Dense shuffle
611.732 ms (0 allocations: 0 bytes)

I have some questions:

1. Is there a faster way to sample! from the row indices 1:M than is done in shuffle_sparse_columns!?
2. Is the CSR format more efficient for M < N? If so, are there any suggestions on how to adapt shuffle_sparse_columns! for a custom āSparseMatrixCSRā?

Slightly unrelated:

1. Is there a suitable way to benchmark memory usage/allocations with PyCall?

Any suggestions on how to get the performance comparable with scikit learn are appreciated.

First, I would try profiling the code to see where the bottleneck is. I find the following helpful:

1 Like

sample! is the bottleneck for this matrix size.

using ProfileVega
M, N, p = 5000, 10000, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
columns = 1:N
ProfileVega.@profview shuffle_sparse_columns!(X, columns)

About 90% of the backtraces are in seqsample_a! in StatsBase. Most of it is spent in the innermost while loop.

The smaller example ends up using seqsample_c!.

If you are sampling without replacement from 1:M, you can try a rand!(rowvals, 1:M) and then order it in place with sort!.

Iām not aware of any function in the standard library that samples without replacement.

julia> Random.seed!(0); x = fill(0, 5); rand!(x, 1:6)
5-element Array{Int64,1}:
5
1
3
1
6

What is the typical length of rowvals (i.e. the typically number of nonzeros per column)? If it is very small compared to M, there might be a much faster sampling algorithm.

1 Like

Is it clear that sklearn.utils.shuffle is doing the same thing? The documentation seems rather vague.

Looking at the source code, it seems that itās using a much simpler algorithm with a single shuffle! call.

3 Likes

Nice catch! sklearn.utils.shuffle is just shuffling the rows instead of shuffling the row values in each column.

M = 5; N = 3;
X = zeros(Int,M,N);
for idx in eachindex(X)
X[idx] = idx
end
X_py_csc = py"csc_matrix"(X)
julia> py"shuffle"(X_py_csc).A
5Ć3 Array{Int64,2}:
5  10  15
3   8  13
1   6  11
4   9  14
2   7  12

Iāve got a TF-IDF matrix of size 3500 x 44917 from a corpus of 3500 documents. Later, I may need to scale up to ~100,000 documents.

Lengths of rowval:

50-th percentile: 1
51-th percentile: 2
75-th percentile: 7
90-th percentile: 50
95-th percentile: 152
99-th percentile: 805
Average: 37, density: 0.011

It may be overkill to call sample! when half of the columns only have a single non-zero value. What sampling algorithm did you have in mind? seqsample_d!?

Iām thinking of something much simpler. When you are sampling from a range (or any array with unique elements), and the fraction of samples is small, you can just sample uniformly and check that the samples are unique; the probability of a collision is quite low so you will only need to re-sample a few times.

function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray)
n, k = length(a), length(x)
k <= n || error("length(x) should not exceed length(a)")
while true
for i in 1:k; x[i] = rand(a); end
sort!(x)
allunique = true
for i = 2:k
if x[i] == x[i-1]
allunique = false
continue
end
end
allunique && return x
end
end

(You can save the final shuffle! call in your code if you are willing to pass in a buffer array, or if you donāt care about the order. In your case, since you are calling this repeatedly, and only for lengths ā¤ some threshold, I would just pass a preallocated buffer array and use it to cache the non-sorted values.)

1 Like

While the sparsity pattern of the matrix X is not constant, X.colptr is constant.
I have to evaluate shuffle_sparse_columns!(X, columns) repeatedly for a smaller and smaller subset of all the columns. I wonder if the for-loop might somehow benefit from memoization?

#pseduocode
X = sprand(M,N,p)
Xcopy = copy(X)
function f(X, Xcopy)
M,N = size(X)
columns = 1:N
for _ in range(num_iters)
shuffle_sparse_columns!(Xcopy, columns)
# Some more work with X and Xcopy
columns = get_next_columns(X, Xcopy)
end
...
end
f(X, Xcopy)

i.e.

function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray, buffer::AbstractArray)
k = length(x)
k ā¤ length(a) || error("length(x) should not exceed length(a)")
k ā¤ length(buffer) || throw(DimensionMismatch())
while true
for i in 1:k; x[i] = buffer[i] = rand(a); end
sort!(x)
allunique = true
for i = 2:k
if x[i] == x[i-1]
allunique = false
continue
end
end
allunique && return copyto!(x, 1, buffer, 1, k)
end
end

Note that the probability of a collision here is exactly the birthday problem. If length(x) == k and length(a) == n, the odds of succeeding on the first try are therefore ā exp(-k^2/2n) for k << n. If you want this to be, say, > 90%, then you should use the trivial algorithm for k < sqrt(-2n*log(0.9)). For n=3500 as in your example above, this means using the trivial algorithm for k < 28.

2 Likes

Note that StatsBase already special-cases sampling 1 or 2 elements.

It might be nice to add an optimized algorithm for small samples of ranges to StatsBase similar to above.

3 Likes

(Random.randsubseq and Random.randsubseq! sample without replacement, but they perform Bernoulli sampling, which doesnāt produce a fixed number of samples. Theyāre used in the implementation of sprand.)

1 Like

I didnāt consider those cases in sample! as they are only evaluated when setting ordered=false.
Surely this may apply to the ordered case as well.

Oh right; they could have easily included those in the ordered case, but apparently didnāt for some reason.

1 Like