# Shuffle values and sample indices of a sparse matrix

I need to shuffle entries in selected `columns` of a large, sparse matrix `X` with dimensions `M x N` . I have written a function that does this in-place, as I need to perform the shuffling repeatedly on the same data.

``````function shuffle_sparse_columns!(X::SparseMatrixCSC, columns)
M, N = size(X)
for col in columns
colptr_l = X.colptr[col]
colptr_u = X.colptr[col+1] - 1
rowvals = @view X.rowval[colptr_l:colptr_u]
nzvals = @view X.nzval[colptr_l:colptr_u]
sample!(1:M, rowvals; replace=false, ordered=true)
shuffle!(nzvals)
end
end
``````

This is significantly faster than slicing the array with numpy/scipy, but trails behind scikit learn’s implementation, `shuffle`, when the size of `X` increases (despite the fact that `X` is automatically converted to CSR format and manually converted back to CSC).

Below is a (hidden) MWE and some benchmark results.

Load packages, define functions and benchmark
``````using BenchmarkTools
using PyCall
using Random
using SparseArrays
using StatsBase

"""
shuffle_columns!(X, cols)

Entries in each of the provided columns in the array are shuffled in-place.
"""
function shuffle_columns!(X, cols)
for col in cols
shuffle!(@view X[:,col])
end
end

"""
shuffle_sparse_columns!(X::SparseMatrixCSC, columns)

The nonzero entires in each of the provided columns in the array are shuffled in-place.
"""
function shuffle_sparse_columns!(X::SparseMatrixCSC, columns)
M, N = size(X)
for col in columns
colptr_l = X.colptr[col]
colptr_u = X.colptr[col+1] - 1
rowvals = @view X.rowval[colptr_l:colptr_u]
nzvals = @view X.nzval[colptr_l:colptr_u]
sample!(1:M, rowvals; replace=false, ordered=true)
shuffle!(nzvals)
end
end

py"""
import numpy as np
from sklearn.utils import shuffle
from scipy.sparse import csc_matrix, csr_matrix

def sklearn_shuffle_and_convert_csc(X):
X_shuffled = shuffle(X)
return csc_matrix(X_shuffled)

def py_shuffle_col(X, columns):
(M, N) = X.shape
for col in columns:
colptr_l = X.indptr[col]
colptr_u = X.indptr[col+1]
X.indices[colptr_l:colptr_u] = np.random.choice(M, colptr_u - colptr_l, replace=False)
X.indices[colptr_l:colptr_u].sort()
np.random.shuffle(X.data[colptr_l:colptr_u])
return X
"""

function bench(M, N, p, cols=nothing)
# Benchmark julia
Random.seed!(0);
X = sprand(M, N, p)
println("Shuffle each column of a CSC matrix in Julia")
@btime shuffle_sparse_columns!(\$X, \$cols)

# Benchmark sklearn (CSC -> CSR -> CSC)
Random.seed!(0);
X = sprand(M, N, p)
py_sklearn_shuffle_and_convert = py"sklearn_shuffle_and_convert_csc"
X_py_csc = py"csc_matrix"(X)
println("Shuffle CSC with sklearn (to CSR) and convert to CSC")
@btime \$py_sklearn_shuffle_and_convert(\$X_py_csc)

# Benchmark sklearn (CSC -> CSR)
py_sklearn_shuffle = py"shuffle"
X_py_csc = py"csc_matrix"(X)  # Restore original matrix
println("Shuffle CSC with sklearn")
@btime \$py_sklearn_shuffle(\$X_py_csc)

# Benchmark sklearn (CSR)
X_py_csr = py"csr_matrix"(X)
println("Shuffle CSR with sklearn")
@btime \$py_sklearn_shuffle(\$X_py_csr)

# Benchmark numpy slicing (CSC) of rowval and nzval with 0-indexed columns
py_shuffle_col = py"py_shuffle_col"
X_py_csc = py"csc_matrix"(X)
println("Shuffle CSC with numpy slicing")
@btime \$py_shuffle_col(\$X_py_csc, \$(cols .- 1))

# Dense shuffle for comparison
Xdense = Array(X)
println("Dense shuffle")
@btime shuffle_columns!(\$Xdense, \$cols)
end
``````
Small `X`
``````M, N, p = 50, 10, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
bench(M, N, p, [1:N...])

### Prints out
Shuffle each column of a CSC matrix in Julia
444.879 ns (0 allocations: 0 bytes)
Shuffle CSC with sklearn (to CSR) and convert to CSC
354.833 μs (3 allocations: 144 bytes)
Shuffle CSC with sklearn
251.791 μs (3 allocations: 144 bytes)
Shuffle CSR with sklearn
174.165 μs (3 allocations: 144 bytes)
Shuffle CSC with numpy slicing
203.838 μs (7 allocations: 448 bytes)
Dense shuffle
5.475 μs (0 allocations: 0 bytes)
``````

Large `X`:

``````M, N, p = 5000, 10000, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
bench(M, N, p, [1:N...])

### Prints out
Shuffle each column of a CSC matrix in Julia
80.976 ms (0 allocations: 0 bytes)
Shuffle CSC with sklearn (to CSR) and convert to CSC
12.371 ms (3 allocations: 144 bytes)
Shuffle CSC with sklearn
5.398 ms (3 allocations: 144 bytes)
Shuffle CSR with sklearn
1.432 ms (3 allocations: 144 bytes)
Shuffle CSC with numpy slicing
1.020 s (7 allocations: 448 bytes)
Dense shuffle
611.732 ms (0 allocations: 0 bytes)
``````

I have some questions:

1. Is there a faster way to `sample!` from the row indices `1:M` than is done in `shuffle_sparse_columns!`?
2. Is the CSR format more efficient for `M < N`? If so, are there any suggestions on how to adapt `shuffle_sparse_columns!` for a custom “`SparseMatrixCSR`”?

Slightly unrelated:

1. Is there a suitable way to benchmark memory usage/allocations with PyCall?

Any suggestions on how to get the performance comparable with scikit learn are appreciated.

First, I would try profiling the code to see where the bottleneck is. I find the following helpful:

1 Like

`sample!` is the bottleneck for this matrix size.

``````using ProfileVega
M, N, p = 5000, 10000, 0.01;
Random.seed!(0);
X = sprand(M, N, p);
columns = 1:N
ProfileVega.@profview shuffle_sparse_columns!(X, columns)
``````

About 90% of the backtraces are in `seqsample_a!` in StatsBase. Most of it is spent in the innermost while loop.

The smaller example ends up using `seqsample_c!`.

If you are sampling without replacement from `1:M`, you can try a `rand!(rowvals, 1:M)` and then order it in place with `sort!`.

I’m not aware of any function in the standard library that samples without replacement.

``````julia> Random.seed!(0); x = fill(0, 5); rand!(x, 1:6)
5-element Array{Int64,1}:
5
1
3
1
6
``````

What is the typical length of `rowvals` (i.e. the typically number of nonzeros per column)? If it is very small compared to `M`, there might be a much faster sampling algorithm.

1 Like

Is it clear that `sklearn.utils.shuffle` is doing the same thing? The documentation seems rather vague.

Looking at the source code, it seems that it’s using a much simpler algorithm with a single `shuffle!` call.

3 Likes

Nice catch! `sklearn.utils.shuffle` is just shuffling the rows instead of shuffling the row values in each column.

``````M = 5; N = 3;
X = zeros(Int,M,N);
for idx in eachindex(X)
X[idx] = idx
end
X_py_csc = py"csc_matrix"(X)
``````
``````julia> py"shuffle"(X_py_csc).A
5×3 Array{Int64,2}:
5  10  15
3   8  13
1   6  11
4   9  14
2   7  12
``````

I’ve got a TF-IDF matrix of size 3500 x 44917 from a corpus of 3500 documents. Later, I may need to scale up to ~100,000 documents.

``````Lengths of rowval:

50-th percentile: 1
51-th percentile: 2
75-th percentile: 7
90-th percentile: 50
95-th percentile: 152
99-th percentile: 805
Average: 37, density: 0.011
``````

It may be overkill to call `sample!` when half of the columns only have a single non-zero value. What sampling algorithm did you have in mind? `seqsample_d!`?

I’m thinking of something much simpler. When you are sampling from a range (or any array with unique elements), and the fraction of samples is small, you can just sample uniformly and check that the samples are unique; the probability of a collision is quite low so you will only need to re-sample a few times.

``````function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray)
n, k = length(a), length(x)
k <= n || error("length(x) should not exceed length(a)")
while true
for i in 1:k; x[i] = rand(a); end
sort!(x)
allunique = true
for i = 2:k
if x[i] == x[i-1]
allunique = false
continue
end
end
allunique && return x
end
end
``````

(You can save the final `shuffle!` call in your code if you are willing to pass in a buffer array, or if you don’t care about the order. In your case, since you are calling this repeatedly, and only for lengths ≤ some threshold, I would just pass a preallocated buffer array and use it to cache the non-sorted values.)

1 Like

While the sparsity pattern of the matrix `X` is not constant, `X.colptr` is constant.
I have to evaluate `shuffle_sparse_columns!(X, columns)` repeatedly for a smaller and smaller subset of all the columns. I wonder if the `for`-loop might somehow benefit from memoization?

``````#pseduocode
X = sprand(M,N,p)
Xcopy = copy(X)
function f(X, Xcopy)
M,N = size(X)
columns = 1:N
for _ in range(num_iters)
shuffle_sparse_columns!(Xcopy, columns)
# Some more work with X and Xcopy
columns = get_next_columns(X, Xcopy)
end
...
end
f(X, Xcopy)
``````

i.e.

``````function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray, buffer::AbstractArray)
k = length(x)
k ≤ length(a) || error("length(x) should not exceed length(a)")
k ≤ length(buffer) || throw(DimensionMismatch())
while true
for i in 1:k; x[i] = buffer[i] = rand(a); end
sort!(x)
allunique = true
for i = 2:k
if x[i] == x[i-1]
allunique = false
continue
end
end
allunique && return copyto!(x, 1, buffer, 1, k)
end
end
``````

Note that the probability of a collision here is exactly the birthday problem. If `length(x) == k` and `length(a) == n`, the odds of succeeding on the first try are therefore `≈ exp(-k^2/2n)` for `k << n`. If you want this to be, say, `> 90%`, then you should use the trivial algorithm for `k < sqrt(-2n*log(0.9))`. For `n=3500` as in your example above, this means using the trivial algorithm for `k < 28`.

2 Likes

Note that StatsBase already special-cases sampling 1 or 2 elements.

It might be nice to add an optimized algorithm for small samples of ranges to StatsBase similar to above.

3 Likes

(`Random.randsubseq` and `Random.randsubseq!` sample without replacement, but they perform Bernoulli sampling, which doesn’t produce a fixed number of samples. They’re used in the implementation of `sprand`.)

1 Like

I didn’t consider those cases in `sample!` as they are only evaluated when setting `ordered=false`.
Surely this may apply to the ordered case as well.

Oh right; they could have easily included those in the ordered case, but apparently didn’t for some reason.

1 Like