Multiplication of `BitMatrix`s for linear algebra modulo 2

Leo_I · February 25, 2024, 10:13pm

When dealing with matrices over \mathbb{Z}_2 that are enormours (e.g. of size 10^5\times10^5), it is desirable to use implementations that minimize the allocated memory. For instance:

julia> using FiniteFields
julia> sizeof(rand(FiniteFields.FFE{2}.(0:1),1000,1000))
4000000
julia> sizeof(bitrand(1000,1000))
125000

The latter offers a 32x saveup of space. Hence I would like to implement an efficient matrix-multiply function. My attempt:

using Random, Polyester, LoopVectorization, Tullio
function mult(x ::BitMatrix, y ::BitMatrix) ::BitMatrix
    l, m = size(x)
    m_, n = size(y)
    @assert m==m_
    z = falses(l, n)
    for i=1:l 
        for k=1:n 
            for j=1:m  
                z[i,k] ⊻= x[i,j] & y[j,k] 
            end 
        end 
    end 
    return z 
end

julia> x, y = (bitrand(10^3,10^3) for _=1:2)
julia> @time z=mult(x,y);
 11.132636 seconds (3 allocations: 122.219 KiB)

Not very fast. Any way to improve?

I know that parallelizing on BitArrays is not thread-safe. This is written in its docstring: Due to its packed storage format, concurrent access to the elements of a BitArray where at least one of them is a write is not thread-safe.

When I tried to put @simd or LoopVectorization.@turbo before the for i=1:l loop, I get errors.

An alternative implementation is Matrix{Bool}, which requires 8x more space.

function mult(x ::Matrix{Bool}, y ::Matrix{Bool}) ::Matrix{Bool}
    l, m = size(x)
    m_, n = size(y)
    @assert m==m_
    z = fill(false, l, n)
    Polyester.@batch for i=1:l 
        for k=1:n 
            for j=1:m  
                z[i,k] ⊻= x[i,j] & y[j,k] 
            end 
        end 
    end 
    return z 
end

julia> x, y = (rand(Bool,10^3,10^3) for _=1:2)
julia> @time z=mult(x,y);
  0.439341 seconds (10 allocations: 976.922 KiB)

Here, using LoopVectorization.@tturbo for i= or Tullio.@tullio z[i,k] ⊻= x[i,j] & y[j,k] causes errors.

One would think that binary arithmetic is simplest and therefore fastest. Yet I am running out of ideas on how to speed this up. For comparison, multiplying Floats:

julia> x,y = (randn(n,n) for _=1:2);
julia> @time x*y;
  0.015831 seconds (2 allocations: 7.629 MiB)

P.S. Is there any hope that Octavian.jl would allow Bool entries with operations +=xor and *=&?

stevengj · February 25, 2024, 10:23pm

So many ways.

The key to operating efficiently on a BitArray is to work directly with the underlying packed storage in matrix.chunks, which lets you do operations on 64 bits at a time. (This will also let you parallelize safely.)

Once you implement the packed-optimized matrix multiply, then you can start thinking about all of the other matmult optimizations, from cache blocking and SIMD to parallelization.

Dan · February 26, 2024, 1:36am

A middle ground optimization can save some space and gain some speed:

julia> x, y = (rand((zero(UInt8),one(UInt8)),10^3,10^3) for _=1:2);

julia> sizeof(x)
1000000

julia> @btime ($x*$y .& true);
  65.374 ms (8 allocations: 1.94 MiB)

(your timing may vary depending on machine)

The idea is to use UInt8 which takes 8x the memory of a BitMatrix, but not 32-bits which the FiniteFields method uses.

The bit matrix multiply is calculated using the embedding of Z_2 as the least significant bit of UInt8 arithmetic.

It’s not a maximally optimized solution, but it looks pretty simple in code.

Oscar_Smith · February 26, 2024, 3:49am

Here’s a pretty bad implementation of the optimized BitMatrix version

function mult2(x ::BitMatrix, y ::BitMatrix) ::BitMatrix
    l, m = size(x)
    m_, n = size(y)
    @assert m==m_
    z = falses(l, n)
    xt = BitMatrix(x')
    colsy = BitVector.(eachcol(y))
    colx = falses(m)
    @inbounds for i=1:l
	    colx .= @view xt[:,i]
	    for k in 1:n
	 	   zik = 0
	 	   for j in eachindex(colx.chunks)
	 		   zik ⊻= colsy[k].chunks[j] & colx.chunks[j]
	 	   end
	 	   z[i,k] = count_ones(zik) %2
	    end
    end 
    return z 
end

I benchmark this at roughly 300x faster than the @Leo_I’s BitMatrix mult, and ~6x faster than @Dan’s.

It would probably be a better to swap the loops and use BitVector.(eachrow(x)) rather than transposing x and BitVector.(eachcol(y)), but I think this makes the point.

foobar_lv2 · February 26, 2024, 10:45am

I’d do it the following way: First, you pad the relevant matrix sizes to multiples of 64. Then

julia> using LinearAlgebra, Random, BenchmarkTools

julia> function bitmulAndTranspose(A, B)
              res = falses((size(A,1), size(B,2)))
              AT = falses((size(A,2), size(A,1)))
              transpose!(AT, A)
              bitmul(res, AT, B)
              res
              end
bitmulAndTranspose (generic function with 1 method)

julia> function bitmul(res, AT, B)
                            (size(B,1) % 64 == 0 && size(AT,1) == size(B,1) && size(res) == (size(AT,2), size(B,2))) || begin 
                            @show size(AT), size(B), size(res)
                            throw("dim mismatch")
                            end
                            ATc = reshape(AT.chunks, (cld(size(AT, 1), 64), size(AT,2)))
                            Bc = reshape(B.chunks, (cld(size(B, 1), 64), size(B,2)))
                            #here you can do blocking / recursion / hilbert-curve / multithreading. Make it cache oblivious and make all cores busy.
                            @inbounds for i in 1:size(AT, 2)
                            for j in 1:size(B,2)
                            r=0
                            @simd for k = 1:size(ATc, 1)
                            r += count_ones(ATc[k, i] & Bc[k, j])
                            end
                            res[i, j] = isodd(r)
                            end
                            end
                            end
bitmul (generic function with 1 method)

julia> B0=bitrand(64, 3); A0=bitrand(4, 64); reference = isodd.(A0*B0); reference == bitmulAndTranspose(A0,B0)
true
julia> basemul(A,B)=isodd.(A*B);
julia> N=1<<10; B0=bitrand(N,N); A0=bitrand(N,N);
julia> @benchmark basemul($A0,$B0)
BenchmarkTools.Trial: 18 samples with 1 evaluation.
 Range (min … max):  285.791 ms … 289.420 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     287.569 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   287.544 ms ±   1.063 ms  ┊ GC (mean ± σ):  0.02% ± 0.04%

  ▁      █ ▁  ▁   ▁       █ ▁      ▁▁    ▁▁▁     ▁▁    ▁      ▁  
  █▁▁▁▁▁▁█▁█▁▁█▁▁▁█▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁██▁▁▁▁███▁▁▁▁▁██▁▁▁▁█▁▁▁▁▁▁█ ▁
  286 ms           Histogram: frequency by time          289 ms <

 Memory estimate: 8.14 MiB, allocs estimate: 9.
julia> @benchmark bitmulAndTranspose($A0, $B0)
BenchmarkTools.Trial: 774 samples with 1 evaluation.
 Range (min … max):  6.095 ms …   9.673 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.431 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.460 ms ± 226.942 μs  ┊ GC (mean ± σ):  0.03% ± 0.36%

                  ▂█▃▄▄▄▆▅▃▂▁                                  
  ▂▁▁▂▁▁▂▃▃▅▄▃▅▄▆████████████▇▅▆▄▃▄▃▃▃▂▄▃▂▂▃▃▃▃▂▂▂▃▁▂▁▁▂▂▁▁▂▂ ▃
  6.1 ms          Histogram: frequency by time        7.01 ms <

 Memory estimate: 256.38 KiB, allocs estimate: 10.


julia> function nobitmulAndTranspose(A, B)
              res = falses((size(A,1), size(B,2)))
              AT = falses((size(A,2), size(A,1)))
              transpose!(AT, A)
              #bitmul(res, AT, B)
              res
              end
julia> @benchmark nobitmulAndTranspose($A0, $B0)
BenchmarkTools.Trial: 8794 samples with 1 evaluation.
 Range (min … max):  548.395 μs …  1.398 ms  ┊ GC (min … max): 0.00% … 32.02%
 Time  (median):     561.562 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   567.101 μs ± 35.237 μs  ┊ GC (mean ± σ):  0.35% ±  2.80%

  ▅▃▁  ██▆▅▄▃▂▂▁                                               ▂
  ███▇▆██████████▇▇▇▇▇█▇█▇▇▆▅▆▄▄▅██▇▇▆▆▄▅▆▅▆▅▄▄▅▁▃▃▄▁▁▄▃▁▄▁▁▄▃ █
  548 μs        Histogram: log(frequency) by time       678 μs <

 Memory estimate: 256.19 KiB, allocs estimate: 6.
julia> function mult2(x ::BitMatrix, y ::BitMatrix) ::BitMatrix
           l, m = size(x)
           m_, n = size(y)
           @assert m==m_
           z = falses(l, n)
           xt = BitMatrix(x')
           colsy = BitVector.(eachcol(y))
           colx = falses(m)
           @inbounds for i=1:l
                   colx .= @view xt[:,i]
                   for k in 1:n
                           zik = 0
                           for j in eachindex(colx.chunks)
                                   zik ⊻= colsy[k].chunks[j] & colx.chunks[j]
                           end
                           z[i,k] = count_ones(zik) %2
                   end
           end 
           return z 
       end
mult2 (generic function with 1 method)

julia> @benchmark mult2($A0, $B0)
BenchmarkTools.Trial: 365 samples with 1 evaluation.
 Range (min … max):  13.225 ms …  15.297 ms  ┊ GC (min … max): 0.00% … 7.62%
 Time  (median):     13.599 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.721 ms ± 390.461 μs  ┊ GC (mean ± σ):  0.84% ± 2.31%

         ▁▆▂▂▃▄█▃▃                                              
  ▄▁▁▃▅▃▅██████████▆▆▇▆▅▄▄▂▃▂▃▂▃▁▁▂▁▁▁▁▁▁▁▁▂▁▂▁▁▃▂▂▁▂▂▃▃▂▆▄▂▃▄ ▃
  13.2 ms         Histogram: frequency by time         14.9 ms <

 Memory estimate: 4.66 MiB, allocs estimate: 3081.

The julia implementation of transpose! for bit-matrices is surprisingly good! I was very close to passing on this question because doing fast bitmatrix transpose is very annoying, but it’s batteries included for that.

PS. Oh, shiny trick in mult2: Accumulate with xor instead of +count_ones, only do the POPCNT at the end.

Alas, that’s slower on my machine. It should not be slower (popcount is hard to vectorize on x86, the built-in instruction only exists for 64bit words, not for big fat 256 or 512 bit registers). This is very very wrong and indicates that llvm messes up somewhere. You may need to write the inner loop by hand, using SIMD.jl

Leo_I · February 28, 2024, 6:39pm

Running the above solutions for the larger case of 16384\times16384 has timings:

julia> begin n=2^14
    x,y = (bitrand(n,n) for _=1:2);
    @time isodd.(x*y);
    @time mult2(x,y);
    @time bitmulAndTranspose(x,y);
    end
7856.312863 seconds (20 allocations: 2.031 GiB, 0.00% gc time)
166.081837 seconds (49.17 k allocations: 165.628 MiB, 0.07% gc time)
175.128970 seconds (19 allocations: 64.001 MiB)

So far, the most efficient solution seems to be Oscar_Smith’s. I’m leaving the thread open, perhaps there are even faster solutions. For comparison, multiplying matrices of same size consisting of Float64’s needs:

julia> n=2^14;   x,y = (randn(n,n) for _=1:2);   
julia> @time x*y;
 75.166939 seconds (2 allocations: 2.000 GiB, 0.01% gc time)

Oscar_Smith · February 28, 2024, 7:38pm

to get it actually fast for big matrices, you would have to implement blocking and multithreading. Multithreading should be pretty easy here. Blocking would be more annoying.

matthias314 · February 28, 2024, 7:49pm

I have a small package for linear algebra mod 2, called Modulo2.jl. With this package, matrix multiplication mod 2 is actually faster than with floats:

julia> n = 2^14; a = Modulo2.randomarray(n, n); b = Modulo2.randomarray(n, n); @time a * b;
 35.638093 seconds (3 allocations: 32.000 MiB)

julia> n = 2^14; a = rand(n, n); b = rand(n, n); @time a * b;
 43.566516 seconds (2 allocations: 2.000 GiB, 0.27% gc time)

This seems to be about 2.8 times faster than @Oscar_Smith’s (quickly written) solution.

The package provides a type ZZ2 for integers mod 2 and ZZ2Array (ZZ2Vector, ZZ2Matrix) for packed, padded arrays. Apart from basic matrix operations it has functions related to Gaussian elimination (rref, rank, inv, det) as well as some support for broadcasting.

The package is not yet in the registry, so one needs to install it via

pkg> add https://github.com/matthias314/Modulo2.jl

So far, the only documentation is via docstrings.

UPDATE: Modulo2.jl is now available from the registry.

Oscar_Smith · February 28, 2024, 8:46pm

This is very cool, but isn’t ZZ2Array just the same as BitArray? From what I can tell, the speed gain here comes from Modulo2.jl/src/Modulo2.jl at 9450574e928ff11ba6a6a64499ce6f4457688582 · matthias314/Modulo2.jl · GitHub which is manually vectorizing the colwise xor.

matthias314 · February 28, 2024, 8:57pm

I see two differences:

The elements of a ZZ2Array are of type ZZ2 instead of Bool, so that one gets the arithmetic operations via + and * instead of xor and &.
The size along the first axis of a ZZ2Array is padded to (currently) a multiple of 256 to make vectorized operations more efficient. There is no padding at all in a BitArray, not even to a byte boundary. For column operations (the internal function xor! you mentioned) this makes a big difference.

Of course, this is not rocket science, but the package has already been quite useful for me, and it might be for others, too.

Leo_I · February 29, 2024, 11:23am

With this package, matrix multiplication mod 2 is actually faster than with floats:

That is quite impressive!!! Thank you!

Topic		Replies	Views
Why mul! is so fast? General Usage question , linearalgebra	7	6881	November 26, 2019
Specialized matrix-matrix multiplication algorithm New to Julia question , performance , linearalgebra	5	420	July 9, 2024
[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays Package Announcements performance , blas	32	5785	July 5, 2020
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2910	July 16, 2018
Parallel matrix col/row operations give incorrect results General Usage bug , parallel	8	220	July 29, 2024

Multiplication of `BitMatrix`s for linear algebra modulo 2

Related topics