# Multiplication of BitMatrixs for linear algebra modulo 2

When dealing with matrices over \mathbb{Z}_2 that are enormours (e.g. of size 10^5\times10^5), it is desirable to use implementations that minimize the allocated memory. For instance:

julia> using FiniteFields
julia> sizeof(rand(FiniteFields.FFE{2}.(0:1),1000,1000))
4000000
julia> sizeof(bitrand(1000,1000))
125000


The latter offers a 32x saveup of space. Hence I would like to implement an efficient matrix-multiply function. My attempt:

using Random, Polyester, LoopVectorization, Tullio
function mult(x ::BitMatrix, y ::BitMatrix) ::BitMatrix
l, m = size(x)
m_, n = size(y)
@assert m==m_
z = falses(l, n)
for i=1:l
for k=1:n
for j=1:m
z[i,k] β»= x[i,j] & y[j,k]
end
end
end
return z
end

julia> x, y = (bitrand(10^3,10^3) for _=1:2)
julia> @time z=mult(x,y);
11.132636 seconds (3 allocations: 122.219 KiB)


Not very fast. Any way to improve?

I know that parallelizing on BitArrays is not thread-safe. This is written in its docstring:  Due to its packed storage format, concurrent access to the elements of a BitArray where at least one of them is a write is not thread-safe.

When I tried to put @simd or LoopVectorization.@turbo before the for i=1:l loop, I get errors.

An alternative implementation is Matrix{Bool}, which requires 8x more space.

function mult(x ::Matrix{Bool}, y ::Matrix{Bool}) ::Matrix{Bool}
l, m = size(x)
m_, n = size(y)
@assert m==m_
z = fill(false, l, n)
Polyester.@batch for i=1:l
for k=1:n
for j=1:m
z[i,k] β»= x[i,j] & y[j,k]
end
end
end
return z
end

julia> x, y = (rand(Bool,10^3,10^3) for _=1:2)
julia> @time z=mult(x,y);
0.439341 seconds (10 allocations: 976.922 KiB)


Here, using LoopVectorization.@tturbo for i= or Tullio.@tullio z[i,k] β»= x[i,j] & y[j,k] causes errors.

One would think that binary arithmetic is simplest and therefore fastest. Yet I am running out of ideas on how to speed this up. For comparison, multiplying Floats:

julia> x,y = (randn(n,n) for _=1:2);
julia> @time x*y;
0.015831 seconds (2 allocations: 7.629 MiB)


P.S. Is there any hope that Octavian.jl would allow Bool entries with operations +=xor and *=&?

So many ways.

The key to operating efficiently on a BitArray is to work directly with the underlying packed storage in matrix.chunks, which lets you do operations on 64 bits at a time. (This will also let you parallelize safely.)

Once you implement the packed-optimized matrix multiply, then you can start thinking about all of the other matmult optimizations, from cache blocking and SIMD to parallelization.

2 Likes

A middle ground optimization can save some space and gain some speed:

julia> x, y = (rand((zero(UInt8),one(UInt8)),10^3,10^3) for _=1:2);

julia> sizeof(x)
1000000

julia> @btime ($x*$y .& true);
65.374 ms (8 allocations: 1.94 MiB)


(your timing may vary depending on machine)

The idea is to use UInt8 which takes 8x the memory of a BitMatrix, but not 32-bits which the FiniteFields method uses.

The bit matrix multiply is calculated using the embedding of Z_2 as the least significant bit of UInt8 arithmetic.

Itβs not a maximally optimized solution, but it looks pretty simple in code.

1 Like

Hereβs a pretty bad implementation of the optimized BitMatrix version

function mult2(x ::BitMatrix, y ::BitMatrix) ::BitMatrix
l, m = size(x)
m_, n = size(y)
@assert m==m_
z = falses(l, n)
xt = BitMatrix(x')
colsy = BitVector.(eachcol(y))
colx = falses(m)
@inbounds for i=1:l
colx .= @view xt[:,i]
for k in 1:n
zik = 0
for j in eachindex(colx.chunks)
zik β»= colsy[k].chunks[j] & colx.chunks[j]
end
z[i,k] = count_ones(zik) %2
end
end
return z
end


I benchmark this at roughly 300x faster than the @Leo_Iβs BitMatrix mult, and ~6x faster than @Danβs.

It would probably be a better to swap the loops and use BitVector.(eachrow(x)) rather than transposing x and BitVector.(eachcol(y)), but I think this makes the point.

3 Likes

Iβd do it the following way: First, you pad the relevant matrix sizes to multiples of 64. Then

julia> using LinearAlgebra, Random, BenchmarkTools

julia> function bitmulAndTranspose(A, B)
res = falses((size(A,1), size(B,2)))
AT = falses((size(A,2), size(A,1)))
transpose!(AT, A)
bitmul(res, AT, B)
res
end
bitmulAndTranspose (generic function with 1 method)

julia> function bitmul(res, AT, B)
(size(B,1) % 64 == 0 && size(AT,1) == size(B,1) && size(res) == (size(AT,2), size(B,2))) || begin
@show size(AT), size(B), size(res)
throw("dim mismatch")
end
ATc = reshape(AT.chunks, (cld(size(AT, 1), 64), size(AT,2)))
Bc = reshape(B.chunks, (cld(size(B, 1), 64), size(B,2)))
#here you can do blocking / recursion / hilbert-curve / multithreading. Make it cache oblivious and make all cores busy.
@inbounds for i in 1:size(AT, 2)
for j in 1:size(B,2)
r=0
@simd for k = 1:size(ATc, 1)
r += count_ones(ATc[k, i] & Bc[k, j])
end
res[i, j] = isodd(r)
end
end
end
bitmul (generic function with 1 method)

julia> B0=bitrand(64, 3); A0=bitrand(4, 64); reference = isodd.(A0*B0); reference == bitmulAndTranspose(A0,B0)
true
julia> basemul(A,B)=isodd.(A*B);
julia> N=1<<10; B0=bitrand(N,N); A0=bitrand(N,N);
julia> @benchmark basemul($A0,$B0)
BenchmarkTools.Trial: 18 samples with 1 evaluation.
Range (min β¦ max):  285.791 ms β¦ 289.420 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     287.569 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   287.544 ms Β±   1.063 ms  β GC (mean Β± Ο):  0.02% Β± 0.04%

β      β β  β   β       β β      ββ    βββ     ββ    β      β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
286 ms           Histogram: frequency by time          289 ms <

Memory estimate: 8.14 MiB, allocs estimate: 9.
julia> @benchmark bitmulAndTranspose($A0,$B0)
BenchmarkTools.Trial: 774 samples with 1 evaluation.
Range (min β¦ max):  6.095 ms β¦   9.673 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     6.431 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   6.460 ms Β± 226.942 ΞΌs  β GC (mean Β± Ο):  0.03% Β± 0.36%

βββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
6.1 ms          Histogram: frequency by time        7.01 ms <

Memory estimate: 256.38 KiB, allocs estimate: 10.

julia> function nobitmulAndTranspose(A, B)
res = falses((size(A,1), size(B,2)))
AT = falses((size(A,2), size(A,1)))
transpose!(AT, A)
#bitmul(res, AT, B)
res
end
julia> @benchmark nobitmulAndTranspose($A0,$B0)
BenchmarkTools.Trial: 8794 samples with 1 evaluation.
Range (min β¦ max):  548.395 ΞΌs β¦  1.398 ms  β GC (min β¦ max): 0.00% β¦ 32.02%
Time  (median):     561.562 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   567.101 ΞΌs Β± 35.237 ΞΌs  β GC (mean Β± Ο):  0.35% Β±  2.80%

βββ  βββββββββ                                               β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
548 ΞΌs        Histogram: log(frequency) by time       678 ΞΌs <

Memory estimate: 256.19 KiB, allocs estimate: 6.
julia> function mult2(x ::BitMatrix, y ::BitMatrix) ::BitMatrix
l, m = size(x)
m_, n = size(y)
@assert m==m_
z = falses(l, n)
xt = BitMatrix(x')
colsy = BitVector.(eachcol(y))
colx = falses(m)
@inbounds for i=1:l
colx .= @view xt[:,i]
for k in 1:n
zik = 0
for j in eachindex(colx.chunks)
zik β»= colsy[k].chunks[j] & colx.chunks[j]
end
z[i,k] = count_ones(zik) %2
end
end
return z
end
mult2 (generic function with 1 method)

julia> @benchmark mult2($A0,$B0)
BenchmarkTools.Trial: 365 samples with 1 evaluation.
Range (min β¦ max):  13.225 ms β¦  15.297 ms  β GC (min β¦ max): 0.00% β¦ 7.62%
Time  (median):     13.599 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   13.721 ms Β± 390.461 ΞΌs  β GC (mean Β± Ο):  0.84% Β± 2.31%

βββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
13.2 ms         Histogram: frequency by time         14.9 ms <

Memory estimate: 4.66 MiB, allocs estimate: 3081.



The julia implementation of transpose! for bit-matrices is surprisingly good! I was very close to passing on this question because doing fast bitmatrix transpose is very annoying, but itβs batteries included for that.

PS. Oh, shiny trick in mult2: Accumulate with xor instead of +count_ones, only do the POPCNT at the end.

Alas, thatβs slower on my machine. It should not be slower (popcount is hard to vectorize on x86, the built-in instruction only exists for 64bit words, not for big fat 256 or 512 bit registers). This is very very wrong and indicates that llvm messes up somewhere. You may need to write the inner loop by hand, using SIMD.jl

1 Like

Running the above solutions for the larger case of 16384\times16384 has timings:

julia> begin n=2^14
x,y = (bitrand(n,n) for _=1:2);
@time isodd.(x*y);
@time mult2(x,y);
@time bitmulAndTranspose(x,y);
end
7856.312863 seconds (20 allocations: 2.031 GiB, 0.00% gc time)
166.081837 seconds (49.17 k allocations: 165.628 MiB, 0.07% gc time)
175.128970 seconds (19 allocations: 64.001 MiB)


So far, the most efficient solution seems to be Oscar_Smithβs. Iβm leaving the thread open, perhaps there are even faster solutions. For comparison, multiplying matrices of same size consisting of Float64βs needs:

julia> n=2^14;   x,y = (randn(n,n) for _=1:2);
julia> @time x*y;
75.166939 seconds (2 allocations: 2.000 GiB, 0.01% gc time)


to get it actually fast for big matrices, you would have to implement blocking and multithreading. Multithreading should be pretty easy here. Blocking would be more annoying.

I have a small package for linear algebra mod 2, called Modulo2.jl. With this package, matrix multiplication mod 2 is actually faster than with floats:

julia> n = 2^14; a = Modulo2.randomarray(n, n); b = Modulo2.randomarray(n, n); @time a * b;
35.638093 seconds (3 allocations: 32.000 MiB)

julia> n = 2^14; a = rand(n, n); b = rand(n, n); @time a * b;
43.566516 seconds (2 allocations: 2.000 GiB, 0.27% gc time)


This seems to be about 2.8 times faster than @Oscar_Smithβs (quickly written) solution.

The package provides a type ZZ2 for integers mod 2 and ZZ2Array (ZZ2Vector, ZZ2Matrix) for packed, padded arrays. Apart from basic matrix operations it has functions related to Gaussian elimination (rref, rank, inv, det) as well as some support for broadcasting.

The package is not yet in the registry, so one needs to install it via

pkg> add https://github.com/matthias314/Modulo2.jl


So far, the only documentation is via docstrings.

UPDATE: Modulo2.jl is now available from the registry.

2 Likes

This is very cool, but isnβt ZZ2Array just the same as BitArray? From what I can tell, the speed gain here comes from Modulo2.jl/src/Modulo2.jl at 9450574e928ff11ba6a6a64499ce6f4457688582 Β· matthias314/Modulo2.jl Β· GitHub which is manually vectorizing the colwise xor.

I see two differences:

• The elements of a ZZ2Array are of type ZZ2 instead of Bool, so that one gets the arithmetic operations via + and * instead of xor and &.

• The size along the first axis of a ZZ2Array is padded to (currently) a multiple of 256 to make vectorized operations more efficient. There is no padding at all in a BitArray, not even to a byte boundary. For column operations (the internal function xor! you mentioned) this makes a big difference.

Of course, this is not rocket science, but the package has already been quite useful for me, and it might be for others, too.

1 Like

With this package, matrix multiplication mod 2 is actually faster than with floats:

That is quite impressive!!! Thank you!