Why is there such a performance difference between multiplying float matrices and other kinds of matrices? I understand float matrices hit BLAS which is heavily optimized. But is this a fundamental difference or or just that we are lacking a “BLAS for boolean matrices” for example?

Can we expect that in the future with a generic Julia implementation of BLAS this difference will disappear?

Is there a package one can use now to speedup boolean matrices?

As suggested by @antoine-levitt, it is interesting to look at the non-BLAS float matmul currently implemented in Julia.

Interesting this is also quite faster than the boolean matrix matmul. Is there some intrinsic difficulty in multiplying against boolean matrices?

Notice that a fast boolean matrix matmul could be useful in deep learning for one-hot represented data.
If the one-hot data could be represented as BitArray the memory usage would be quite smaller.

Are you interested in a Matmul with an addition where 1+1 = 1 or 1+1=0?
The performance difference comes from indexing into a BitArray which involves slicing to get Int with the correct value. It could be implement a lot faster by bit level boolean operation by and-ing the two vectors and then either checking whether it is non-zero (when 1+1=1) or checking the last bit of the intrinsic popcount (when you want 1+1=0).