Why is the matrix multiplication with integer matrices much slower than with float ones?

The following is the simple function for the test.

function vec_prod(n)
output = 0;
vec = collect(1:n)
mat = Array{Float64}(undef, n, n)

mat .= vec .* vec'

matt1 = mat .* mat
matt2 = mat * mat

output = dot(matt1, matt2)
return output

end

@btime vec_prod(50)
14.800 ΞΌs (7 allocations: 59.31 KiB)
1.1343603629882813e17

If I change mat = Array{Int64}(undef, n, n),
59.800 ΞΌs (10 allocations: 90.06 KiB)
113436036298828125

From my investigation, I figured out that the speed bottleneck for the case of {Int64} was matt2 = mat * mat. If I remove this matrix multiplication part, like switching that part with matt2 = matt1,
{Float64}
4.571 ΞΌs (5 allocations: 39.70 KiB)
4.312110892222225e15
{Int64}
3.786 ΞΌs (5 allocations: 39.70 KiB)
4312110892222225
The case of {Int64} is faster.

I’ve tested with pre-defining matt2 and/or using mul! of LinearAlgebra. The results were the same.
Why is the matrix multiplication with integer matrices much slower than with float ones? It looks like this is some kind of bug.

The reasoning is that floating point matmul goes to BLAS while the integer one is a fallback method. The fallback could always be made faster though.

2 Likes

Try Octavian or Tullio. I guess they’re faster than whatever’s in the stdlib.

https://juliahub.com/ui/Packages/Tullio/PIgzC/0.3.5

https://juliahub.com/ui/Packages/Octavian/RyOyv/0.3.24

2 Likes

To add a bit more explanation β€” decades of work have gone into optimizing floating-point matrix multiplication, because this is the most important case in most applications, resulting in BLAS libraries that approach the theoretical peak performance of your CPU for large matrices. Julia and most other languages/libraries (e.g. Numpy in Python, R, Matlab) call one of these BLAS libraries for multiplying floating-point libraries (and in consequence, operations on large dense floating-point matrices run at basically the same speed in all languages).

Matrix multiplication for other numeric types is not nearly so well optimized β€” this is true in both Julia and in other languages (e.g. numpy). There are native Julia libraries like Octavian (mentioned by @nsajko above) that can generate heavily optimized matrix products for a wider variety of numeric types.

However, in most applications, you should probably simply use floating-point matrices for linear algebra. There are basically two cases:

  1. If all of the integers (including intermediate steps) have magnitudes < 2^{53}, then integer multiplications/additions are exact in Float64 arithmetic β€” so floating-point matrix multiplication will give exactly the same results as integer matrix multiplication, just more quickly.
  2. If you generate values > 2^{53}, then you are risking integer overflow with Int and you might get undesired results like (2^40)^2 == 0 (unless you switch to a wider integer type, which is even slower). Float64 quantities will incur roundoff errors in this case, but the errors are much smaller and better behaved than integer overflow.
13 Likes

I think integers are generally best indexing, when you want modular arithmetic, and when you want to play with bits (like in a random number generator).

If you want to do arithmetic on integers, Float64 will often be a better choice. As said above, it is exact for all values \le 2^53, so rounding isn’t a concern.

But what if you do want to check for it?
Credit to @c42f for much of this example:

const FE_INVALID = 0x1
const FE_DIVBYZERO = 0x4
const FE_OVERFLOW = 0x8
const FE_UNDERFLOW = 0x10
const FE_INEXACT = 0x20

fpexceptions() = ccall(:fegetexcept, Cint, ())

function setfpexceptions(f, mode)
  prev = ccall(:feenableexcept, Cint, (Cint,), mode)
  try
    f()
  finally
    ccall(:fedisableexcept, Cint, (Cint,), mode & ~prev)
  end
end

fpinexactexceptions(f::F) where {F} = setfpexceptions(f, FE_INEXACT)

x::Float64 = 1e12
fpinexactexceptions() do
  (x + 65.0) - x
end
x = 1e18
fpinexactexceptions() do
  (x + 65.0) - x
end

I get

julia> x::Float64 = 1e12
1.0e12

julia> fpinexactexceptions() do
         (x + 65.0) - x
       end
65.0

julia> x = 1e18
1.0e18

julia> fpinexactexceptions() do
         (x + 65.0) - x
       end
ERROR: DivideError: integer division error
Stacktrace:
 [1] +(x::Float64, y::Float64)
   @ Base ./float.jl:409 [inlined]
 [2] (::var"#3#4")()
   @ Main ./REPL[12]:2 [inlined]
 [3] setfpexceptions(f::var"#3#4", mode::UInt8)
   @ Main ./REPL[7]:4
 [4] fpinexactexceptions(f::var"#3#4")
   @ Main ./REPL[8]:1
 [5] top-level scope
   @ REPL[12]:1

DivideError: integer division error isn’t exactly a good error message. I think that’s the only exception turned on by default, so when an exception gets triggered, that’s the message we see.

But here, we know that the actual cause was

julia> (x + 65.0) - x # oops
128.0

which was cheaply and conveniently caught!

Turning on exceptions is cheap, you only have to do it once around your computation that must be exact.
And that’s it – the computations themselves should run at full speed; no exception means no overflows, and you can have some trust in the answers.

2 Likes

But as the author of Octavian, I may as well still add a benchmark using this PR branch:

julia> @time using Octavian
  0.246463 seconds (398.32 k allocations: 20.927 MiB, 4.20% compilation time)

julia> A = rand(-1_000:1_000, 200, 200);

julia> B = rand(-1_000:1_000, 200, 200);

julia> C = similar(A);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 420 samples with 1 evaluation.
 Range (min … max):  2.340 ms …  2.555 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.394 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.378 ms Β± 29.314 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ƒβ–ˆβ–ƒβ–ƒβ–‚                         β–‚β–ˆβ–†β–‡β–…β–„                      
  β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–…β–ƒβ–‚β–‚β–‚β–ƒβ–β–‚β–ƒβ–β–β–β–β–β–β–β–β–β–‚β–‚β–ƒβ–ƒβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–„β–‚β–β–ƒβ–‚β–β–β–„β–ƒβ–β–β–β–β–β–β–β–ƒ β–„
  2.34 ms        Histogram: frequency by time        2.44 ms <

 Memory estimate: 30.77 KiB, allocs estimate: 4.

julia> @benchmark matmul!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  90.901 ΞΌs … 112.142 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     94.696 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   94.340 ΞΌs Β±   1.490 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                    β–β–†β–ˆβ–†β–‚                       
  β–‚β–‚β–‚β–ƒβ–„β–…β–…β–„β–ƒβ–‚β–‚β–‚β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–ƒβ–‚β–‚β–‚β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒ β–ƒ
  90.9 ΞΌs         Histogram: frequency by time         97.2 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 2.369e-3 / 91.302e-6
25.946857681102276

julia> Af64 = Float64.(A); Bf64 = Float64.(B); Cf64 = similar(Af64);

julia> @benchmark mul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.839 ΞΌs … 34.865 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     15.278 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   15.407 ΞΌs Β±  1.052 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–‚β–†β–ˆβ–…β–                                                    
  β–‚β–‚β–ƒβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  14.8 ΞΌs         Histogram: frequency by time        18.7 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark matmul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.794 ΞΌs …  33.747 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     24.623 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   24.702 ΞΌs Β± 534.346 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           β–β–ƒβ–„β–†β–‡β–‡β–ˆβ–‡β–†β–„β–‚β–                                         
  β–‚β–‚β–‚β–‚β–ƒβ–„β–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–„
  23.8 ΞΌs         Histogram: frequency by time           27 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> Cf64 == C
true

Octavian ramps thread usage up slower than MKL, apparently. It does better single threaded at this size:

julia> using LinearAlgebra; BLAS.set_num_threads(1);

julia> @benchmark mul!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 4972 samples with 1 evaluation.
 Range (min … max):  186.730 ΞΌs … 278.905 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     200.794 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   196.320 ΞΌs Β±   8.450 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ˆβ–…                                β–‚β–†β–‚                         
  β–…β–ˆβ–ˆβ–†β–‚β–β–‚β–„β–…β–„β–ƒβ–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ˆβ–ˆβ–ˆβ–ƒβ–‚β–‚β–‚β–„β–…β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  187 ΞΌs           Histogram: frequency by time          214 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark matmul_serial!($Cf64, $Af64, $Bf64)
BenchmarkTools.Trial: 5877 samples with 1 evaluation.
 Range (min … max):  163.874 ΞΌs … 232.942 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     164.972 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   166.591 ΞΌs Β±   5.089 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–‡β–ˆβ–†β–‚  ▄▅▄▁                          β–‚β–‚                       ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–β–„β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–„β–…β–ˆβ–ˆβ–ˆβ–‡β–…β–…β–…β–‡β–ˆβ–‡β–†β–„β–…β–…β–ƒβ–†β–ƒβ–…β–…β–„β–…β–†β–„β–ƒβ–… β–ˆ
  164 ΞΌs        Histogram: log(frequency) by time        190 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

For integers, some of the performance benefit of course also comes from multithreading:

julia> @benchmark matmul_serial!($C, $A, $B)
BenchmarkTools.Trial: 1365 samples with 1 evaluation.
 Range (min … max):  672.498 ΞΌs … 839.838 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     738.069 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   728.982 ΞΌs Β±  28.663 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–…                         β–†β–ˆβ–                                 
  β–ˆβ–ˆβ–β–β–β–β–ƒβ–β–…β–„β–ƒβ–…β–„β–ƒβ–„β–…β–„β–ƒβ–„β–β–„β–ƒβ–…β–„β–„β–†β–†β–ˆβ–ˆβ–ˆβ–†β–†β–…β–†β–‡β–‡β–…β–ƒβ–„β–ƒβ–ƒβ–„β–ƒβ–…β–ƒβ–ƒβ–„β–…β–ƒβ–β–ƒβ–„β–ƒβ–β–…β–…β–†β–ƒβ–ƒβ–…β–… β–ˆ
  672 ΞΌs        Histogram: log(frequency) by time        814 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like
julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> using LinuxPerf

julia> @time @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(mul!, 100000, Cf64, Af64, Bf64)
       end
  2.814530 seconds (33.44 k allocations: 2.648 MiB, 0.49% gc time, 0.69% compilation time)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               1.38e+11   60.0%  #  2.8 cycles per ns
β”Œ instructions             2.37e+11   60.0%  #  1.7 insns per cycle
β”‚ branch-instructions      1.67e+10   60.0%  #  7.1% of insns
β”” branch-misses            1.36e+07   60.0%  #  0.1% of branch insns
β”Œ task-clock               4.96e+10  100.0%  # 49.6 s
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    1.91e+10   20.0%  # 21.0% of dcache loads
β”‚ L1-dcache-loads          9.10e+10   20.0%
β”” L1-icache-load-misses    7.22e+07   20.0%
β”Œ dTLB-load-misses         4.70e+03   20.0%  #  0.0% of dTLB loads
β”” dTLB-loads               9.09e+10   20.0%
β”Œ iTLB-load-misses         4.39e+05   40.0%  # 54.7% of iTLB loads
β”” iTLB-loads               8.03e+05   40.0%
                 aggregated from 18 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @time @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(matmul!, 100000, Cf64, Af64, Bf64)
       end
  2.483556 seconds (33.44 k allocations: 2.648 MiB, 0.77% compilation time)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               1.22e+11   60.0%  #  2.8 cycles per ns
β”Œ instructions             1.87e+11   60.0%  #  1.5 insns per cycle
β”‚ branch-instructions      6.60e+09   60.0%  #  3.5% of insns
β”” branch-misses            3.50e+07   60.0%  #  0.5% of branch insns
β”Œ task-clock               4.39e+10  100.0%  # 43.9 s
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    1.74e+10   20.0%  # 28.3% of dcache loads
β”‚ L1-dcache-loads          6.14e+10   20.0%
β”” L1-icache-load-misses    1.80e+06   20.0%
β”Œ dTLB-load-misses         5.85e+02   20.0%  #  0.0% of dTLB loads
β”” dTLB-loads               6.14e+10   20.0%
β”Œ iTLB-load-misses         4.31e+03   40.0%  # 32.9% of iTLB loads
β”” iTLB-loads               1.31e+04   40.0%
                 aggregated from 18 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Apparently not. Both used 18 threads, but Octavian required less wall time, less CPU time, fewer instructions, but did have 2.5x the branch misses.
For larger arrays, MKL does better. Octavian’s kernels have a problem limiting scaling to huge matrices.

1 Like