Faster conversion from boolean `SparseMatrixCSC` to `BitMatrix`

I’m running into a bottleneck in the conversion of boolean SparseMatrixCSC objects into BitMatrix objects. I’m using a view of the matrix to select the columns that need to be converted into a BitMatrix, so all rows will be included in the view.

I have a function: fast_convert that currently performs much better than the naive conversion, but I’m looking for some guidance on how to make it even faster. I know there are limitations on parallel processing with bitmatrices due to their internal structure, but are there any obvious ways to improve this?

using SparseArrays

function fast_convert(S::SubArray{Bool, 2, <:SparseMatrixCSC{Bool}})::BitMatrix
    parent_mat = parent(S)
    row_range, col_range = parentindices(S)
    B = falses(size(S))
    @inbounds for (j, parent_col) in enumerate(col_range)
        for k in parent_mat.colptr[parent_col]:(parent_mat.colptr[parent_col+1]-1)
            parent_row = parent_mat.rowval[k]
            B[parent_row, j] = parent_mat.nzval[k]
        end
    end
    return B
end

m = sprand(Bool, 100_000, 100, 0.15)
v = view(m,:,1:10)

@assert fast_convert(v) == BitMatrix(v)

@btime fast_convert(v);
@btime BitMatrix(v);

Can you comment on the context here? Why are you discarding the sparsity?

If you want to make it faster you’ll want to work directly with the underlying array of 64-bit “chunks” rather than flipping one bit at a time.

2 Likes

I’m developing a data mining package that uses matrices to implement frequent itemset mining algorithms.

I’m storing itemsets in SparseMatrixCSC originally because itemset data is generally very sparse. Trying to store the data as a BitMatrix from the start would take a huge amount of memory.

But, when running these mining algorithms, we end up operating on a much smaller set of columns that the entire transactional dataset. I’m discarding sparsity because I need to do row-wise and column-wise all reductions on the boolean matrix for computing set closures of many unique itemsets. The SparseMatrixCSC format is significantly slower and more memory-intensive for computing all than the bit-wise operations in a BitMatrix.

Here’s my closure function (altered to take an AbstractMatrix, rather than a BitMatrix)

function closure(matrix::AbstractMatrix, itemset::Vector{Int})
    rows = vec(all(view(matrix,:, itemset), dims=2))
    return findall(vec(all(view(matrix, rows, :), dims=1)))
end

And some benchmarks:

m = sprand(Bool,100000, 100, 0.15)
v = view(m,:,1:30)
b = BitMatrix(v)
@btime closure(v, [1]);
  26.911 ms (19 allocations: 996.48 KiB)
@btime closure(b, [1]);
  510.625 μs (16 allocations: 129.22 KiB)

So essentially find a way to translate between the coordinates in the SparseMatrixCSC colptr and rowval fields and the relative positions in the UInt64 chunks that make up the BitMatrix?

If I understand correctly this can only help if you expect to have in the selected columns a density greater than about \frac1{64}\approx 0.015.

If you can assume that stored entries in the sparse matrix are all true (otherwise, you’d need to pass it through dropzeros!, but this will add cost), you can possibly gain something by just manipulating the indices of the sparse matrix and avoiding the conversion altogether, something like (hoping I didn’t make any mistake in the logic of what your code was doing :sweat_smile:)


function closure2(m, cols, itemset)
    col(c) = @view rowvals(m)[nzrange(m, c)]
    rows = reduce(intersect, (Set(col(c)) for c in cols[itemset]))
    [c for c in cols if all(insorted(x, col(c)) for x in rows)]
end
1 Like

Hello,

The problem with the BitMatrix approach is a bunch of execution time that is won in clousure is moved to BitMatrix creation, right? You can run a fast closure still using the SparseCSC matrix.

If you do this and what I mentioned in the other post to fastly compute the sum over the cols you should get a decent speedup. If there is not other place where BitMatrix is used probably you can avoid going from SparseCSC to BitMatrix. I can do a PR with both ideas to the package.

I think to iterate and facilitate improving it the package might benefit form some script containing data generated automatically and @btime/@benchmark of important function calls.

using SparseArrays, BenchmarkTools

function closure(matrix::AbstractMatrix, itemset::Vector{Int})
    row_mask = vec(all(view(matrix,:, itemset), dims=2))
    return findall(vec(all(view(matrix, row_mask, :), dims=1)))
end

function find_complete_rows(X_csc, col_indices)
    
    if hasproperty(X_csc, :parent)
        X_csc = X_csc.parent
    end
    
    n_rows, n_cols  = size(X_csc,1), length(col_indices)
    counts = zeros(Int64, n_rows);

    for (j,c) in enumerate(col_indices[1:end])
        start_ind = X_csc.colptr[c]
        end_index = X_csc.colptr[c + 1] -1

        for i in start_ind:end_index
            counts[X_csc.rowval[i]] += 1
        end
    end

    result = []
    for (i,c) in enumerate(counts)
        if c == n_cols
            push!(result, i)
        end
    end

    return result
end

function closure2(m, cols, itemset)
    col(c) = @view rowvals(m)[nzrange(m, c)]
    rows = reduce(intersect, (Set(col(c)) for c in cols[itemset]))
    [c for c in cols if all(insorted(x, col(c)) for x in rows)]
end

function closure_optimized(matrix::AbstractMatrix, itemset::Vector{Int})
    row_mask = find_complete_rows(matrix, itemset)
    return findall(vec(all(view(matrix, row_mask, :), dims=1)))
end

m = sprand(Bool, 100_000, 100, 0.15)
v = view(m,:,1:30)
col_indices = [1,2,4,5,6,7,8,9,10,11,12];

@assert closure_optimized(SparseMatrixCSC(v), col_indices) == closure(b, col_indices)
@assert closure_optimized(v, col_indices) == closure(b, col_indices)

println("\nTime closure with view of SparseMatrixCSC")
@btime closure(v, col_indices);

println("\nTime closure with view BitMatrix")
@btime b = BitMatrix(v);
@btime closure(b, col_indices)

println("\nTime closure2")
@btime closure_optimized(m, 1:30, col_indices); 

println("\nTime closure_optimized with SparseMatrixCSC")
# we want to avoid instanciating SparseMatrixCSC(v)
@btime closure_optimized(SparseMatrixCSC(v), col_indices); 

println("\nTime closure_optimized with view of SparseMatrixCSC")
@btime closure_optimized(v, col_indices); 

This will print

Time closure with view of SparseMatrixCSC
  97.138 ms (17 allocations: 880.62 KiB)

Time closure with view BitMatrix
  82.914 ms (3 allocations: 366.34 KiB)
  3.007 ms (14 allocations: 13.28 KiB)

Time closure2
  2.066 ms (151 allocations: 3.17 MiB)

Time closure_optimized with SparseMatrixCSC
  343.916 μs (16 allocations: 4.63 MiB)

Time closure_optimized with view of SparseMatrixCSC
  234.833 μs (10 allocations: 782.50 KiB)

Making the closure much faster. This should go from 85 ms (time to convert to Bitmatrix + time to call closure) to 234.833 μs this is a 362x speedup ?!

1 Like

The “best” approach to take will surely depend upon the actual sparsity of the source matrix. You’re dealing with two very different classes of datastructure, for whom neither performs optimally with a straightforward eachindex-style iteration. And the optimal (internals-leaning) techniques take you in opposite directions!

SparseMatrix is fastest if you can just deal with the true indices themselves — what it looks like you’re doing. But BitArray is fastest if you can gather up 64 dense bits at a time and write them all at once.

Sorry I don’t think I made it clear in my post. The closure function is called dozens, hundreds, or even thousands of times in the closed rule mining algorithms that use it whereas the bitmatrix conversion only happens once during the preprocessing step.

I’m trying to optimize the conversion to BitMatrix because in my testing, I’ve found that converting to bitmatrix up-front dramatically lowers the memory usage of the functions, and for all except one of the mining functions provided massive increases in execution speed. I’ll take a look at your implementation, but these memory results are why I’m using the BitMatrix in the first place:

I actually was able to get a thread-safe multithreaded conversion implementation working. It uses slightly more memory and more allocations (probably from the threading overhead), but it is significantly faster. I’m seeing ~3x performance gain on my Julia env with 4 threads.

function faster_convert(S::SubArray{Bool, 2, <:SparseMatrixCSC{Bool}})::BitMatrix
    parent_mat = parent(S)
    _, col_range = parentindices(S)
    
    m, n = size(S)
    B = falses(m, n)
    chunks = unsafe_wrap(Array{UInt64}, pointer(B.chunks), length(B.chunks))
    total_chunks = length(chunks)
    
    # Calculate chunks per thread
    num_threads = nthreads()
    chunks_per_thread = cld(total_chunks, num_threads)
    
    @threads for thread_id in 1:num_threads
        # Calculate chunk range for this thread
        chunk_start = (thread_id - 1) * chunks_per_thread + 1
        chunk_end = min(thread_id * chunks_per_thread, total_chunks)
        
        # Calculate bit range this thread is responsible for
        bit_start = (chunk_start - 1) << 6
        bit_end = (chunk_end << 6) - 1
        
        # Calculate which columns intersect with this bit range
        col_start = bit_start ÷ m + 1
        col_end = min(bit_end ÷ m + 1, n)
        
        # Process each column that intersects with this thread's chunks
        @inbounds for (j, parent_col) in enumerate(col_range[col_start:col_end])
            col_offset = (col_start + j - 2) * m
            
            # Process each nonzero in the column
            for k in parent_mat.colptr[parent_col]:(parent_mat.colptr[parent_col+1]-1)
                row = parent_mat.rowval[k]
                abs_pos = col_offset + (row - 1)
                
                # Only process if this bit belongs to one of this thread's chunks
                abs_pos < bit_start >= bit_end && continue

                chunk_idx = abs_pos >> 6
                bit_pos = abs_pos & 63
                chunks[chunk_idx + 1] |= UInt64(1) << bit_pos
            end
        end
    end
    return B
end
m = sprand(Bool,100000, 100, 0.15)
v = view(m, :, 1:50)
@btime BitMatrix(v);
  139.115 ms (3 allocations: 610.47 KiB)
@btime fast_convert(v);
  1.849 ms (3 allocations: 610.47 KiB)
@btime faster_convert(v);
  480.750 μs (25 allocations: 612.97 KiB)
1 Like

Still, depending on the density, working on the sparse matrix could be (much) faster and take less memory than working on the BitMatrix.

Just for reference, here’s a little improvement on closure2 which intersects sorted vectors instead of using Sets . In my (admittely quick) benchmarking it beats the other implementations of closure in this thread (without taking into account the conversion step).

function sortedintersect!(X, Y)
    i = j = k = 1
    @inbounds while i <= length(X) && j <= length(Y)
        if X[i] == Y[j]
            X[k] = X[i]
            k += 1
            i += 1
            j += 1
        elseif X[i] < Y[j]
            i += 1
        else
            j += 1
        end
    end
    resize!(X, k - 1)
end

function closure3(m, cols, itemset)
    col(c) = @view rowvals(m)[nzrange(m, c)]
    rows = rowvals(m)[nzrange(m, cols[itemset[1]])]
    foreach(c->sortedintersect!(rows, col(c)), cols[itemset[2:end]])
    [i for (i,c) in enumerate(cols) if all(insorted(x, col(c)) for x in rows)]
end

EDIT: fixed a small bug

2 Likes

This is getting a bit out of hand :sweat_smile:, sorry!; implementing a simple issubset for sorted collections (I couldn’t find it in the library) and optimizing a bit gives a large further improvement. Long story short, the benchmarks by @davidbp give the results below, i.e. about x200 speed improvement for closure with respect to your version (without even taking into account the BitMatrix conversion). And this is for density 0.15, the difference becomes larger for sparser matrices.

Time closure with view of SparseMatrixCSC
  195.608 ms (27 allocations: 881.69 KiB)

Time closure with view BitMatrix
  190.988 ms (4 allocations: 720.36 KiB)
  3.227 ms (14 allocations: 13.14 KiB)

Time closure4
  14.421 μs (7 allocations: 119.22 KiB)

Time closure_optimized with SparseMatrixCSC
  1.340 ms (27 allocations: 8.36 MiB)

Time closure_optimized with view of SparseMatrixCSC
  256.001 μs (18 allocations: 783.55 KiB)

Implementation of closure4:

function intersectsorted!(X, Y)
    if isempty(X) || isempty(Y)
        return empty!(X)
    end
    i = j = 1
    k = 0
    @inbounds while true
        if X[i] == Y[j]
            X[k += 1] = X[i]
            (i += 1) > length(X) && break
            (j += 1) > length(Y) && break
        elseif X[i] > Y[j]
            (i += 1) > length(X) && break
        else
            (j += 1) > length(Y) && break
        end
    end
    resize!(X, k)
end

function issubsetsorted(X, Y)
    isempty(X) && return true
    isempty(Y) && return false
    i = j = 1
    @inbounds while true
        if X[i] < Y[j]
            return false
        elseif X[i] > Y[j] 
            (j += 1) > length(Y) && return false
        else
            (i += 1) > length(X) && return true
        end
    end
end


function closure4(m, cols, itemset)
    col(c) = @view rowvals(m)[nzrange(m, c)]
    rows = rowvals(m)[nzrange(m, cols[itemset[1]])]
    foreach(c->intersectsorted!(rows, col(c)), @views cols[itemset[2:end]]) 
    filter(i->issubsetsorted(rows, col(cols[i])), eachindex(cols))
end
2 Likes

It nevers get out of hand. This discourse is a goldmine to me because of this sort of discussions, back and forth, new ideas, new implementations, easy to reproduce benchmarks… If there is ever an LLM that is great at generating julia code… I am sure we are all contributing to it :D.

3 Likes