Parallel computing with *

Definitely not. You only get BLAS acceleration for floating-point types.

I would just convert it to Matrix{Float32} (which is exact for integers up to 16777216) or Matrix{Float64} (which is exact for integers up to 9007199254740992). These will benefit from optimized multithreaded BLAS, SIMD, etcetera.

For your matrix sizes, Float32 is almost 200x faster on my machine:

julia> M = rand(0:2, 2000,50000);

julia> using BenchmarkTools;

julia> @btime $M * $M';
  126.120 s (5 allocations: 30.55 MiB)

julia> A = Matrix{Float32}(M);

julia> M * M' == A * A'  # Float32 is exact for integers of this size
true

julia> @btime $A * $A';
  688.669 ms (2 allocations: 15.26 MiB)
6 Likes