Int numerical calculation speed slower than Float?

nuclear718 · February 16, 2020, 6:01am

Why the calculation speed of “Int” matrix is much slower than “Float” matrix?

IDE: Jupyter
Julia version: 1.1.0

Code for Float:
s=1000
d1=rand(s,s)
d2=rand(s,s)
@time (d1*d2);
Result: 0.032192 seconds (6 allocations: 7.630 MiB)

Code for Int:
s=1000
d3=rand(Int,s,s)
d4=rand(Int, s,s)
@time (d3*d4);
Result: 1.131875 seconds (12 allocations: 7.630 MiB)

Oscar_Smith · February 16, 2020, 7:14am

Float matrices use blas. Int uses a generic fallback. Making the fallback method multithreaded for large matrices would fix much of the problem.

Tamas_Papp · February 16, 2020, 7:28am

I think that for floats, BLAS is used, while for integers it is native Julia code. The latter could probably be made faster, but it is not a common use case so it is waiting for someone to do it.

(also, please quote code)

Oscar_Smith · February 16, 2020, 8:08am

Do you think it would be worth it for mixed type matmul to convert arguments before multiplying? I think that should speed things up a lot (with some memory downsides). The other big thing the fallback needs is better cache aware looping.

Tamas_Papp · February 16, 2020, 8:27am

No, I would not convert. First, integers have specific overflow semantics in Julia different from float, so I am not sure what is intended and what isn’t.

Second (and more importantly), you really have to go out of your way to get a matrix with a non-concrete element type when writing idiomatic code, so I am not sure it is a common use case. I would leave it up to the user to promote if that is needed.

mschauer · February 16, 2020, 8:59am

We should have a specialized very effective Int multiplication kernel though.

improbable22 · February 16, 2020, 9:14am

One way to do this is LoopVectorization’s example, which is faster but still not as fast as floats:

julia> C1 = Matrix{Int}(undef, M, N); A = rand(1:100, M, K); B = rand(1:100, K, N);

julia> C2 = similar(C1); C3 = similar(C1);

julia> @btime mygemmavx!($C1, $A, $B)
  77.412 μs (0 allocations: 0 bytes)

julia> @btime mygemm!($C2, $A, $B)
  245.869 μs (0 allocations: 0 bytes)

julia> @btime mul!($C3, $A, $B); # julia's generic_matmul
  164.278 μs (6 allocations: 336 bytes)

compared to Float64:

julia> @btime mygemmavx!($C1, $A, $B)
  14.599 μs (0 allocations: 0 bytes)

julia> @btime mygemm!($C2, $A, $B)
  290.296 μs (0 allocations: 0 bytes)

julia> @btime mul!($C3, $A, $B); # openblas, not MKL
  22.635 μs (0 allocations: 0 bytes)

Note BTW that in your example, rand(Int, ...) produces lots of large numbers, which will overflow:

julia> (float(d3) * float(d4)) .- (d3 * d4) |> extrema
(-4.2418818662328814e39, 4.4115065145744565e39)

julia> extrema(A)
(1, 100)

julia> (float(A) * float(B)) .- (A * B) |> extrema
(0.0, 0.0)

Elrod · February 16, 2020, 4:33pm

Using

julia> M, K, N = 72, 75, 71;

My results with Float64 are:

julia> BLAS.set_num_threads(1)

julia> @btime mygemmavx!($C1, $A, $B)
  7.380 μs (0 allocations: 0 bytes)

julia> @btime mygemm!($C2, $A, $B)
  231.900 μs (0 allocations: 0 bytes)

julia> @btime mul!($C3, $A, $B); # julia's generic_matmul
  6.780 μs (0 allocations: 0 bytes)

And with Int:

julia> @btime mygemmavx!($C1, $A, $B)
  26.158 μs (0 allocations: 0 bytes)

julia> @btime mygemm!($C2, $A, $B)
  190.645 μs (0 allocations: 0 bytes)

julia> @btime mul!($C3, $A, $B); # julia's generic_matmul
  101.748 μs (6 allocations: 336 bytes)

So Int is about 3.5x slower than Float for me, while it is 5.3x slower for you.
With integers, it uses the vpmullq instruction for integer multiplication. But this instruction appears to be slow, with a reciprocal throughput of around 1.5-3, while the vpaddq instruction is around 0.33 or 0.5.
The floating point versions use fused multiply-add instructions to combine both the multiplication and addition, and have a reciprical throughput of about 0.5.
You can think of “reciprical throughput” as how many clock cycles it takes per completed instruction if a core is executing many simultaneously. It generally takes many more clock cycles to complete any given instruction (e.g., 4 for the fma instructions), but a core can work on many simultaneously, thus the rate at which they’re completed can be much faster.

stevengj · February 16, 2020, 6:55pm

Realize that implementing a highly optimized matrix–matrix multiplication is nontrivial. Optimized BLAS libraries typically involve tens of thousands of lines of code and painstaking performance tuning. While there is no theoretical reason why this cannot be replicated in Julia, it is a huge undertaking.

improbable22 · February 16, 2020, 9:41pm

Thanks, interesting. And the reason for this difference in the first place, perhaps that there’s just more demand for vectorised floating point stuff, which justifies spending a lot of silicon on it?

mschauer · February 17, 2020, 9:29am

I have no doubt. “We should” in open source can perhaps only mean “it would be appreciated”.

oxinabox · February 17, 2020, 9:45am

Not having BLAS implementation for a type really hurts.

This is why Arraymancer (a BLAS written in Nim that supports a bunch of types/is generic?),
trashes everyone at integer matmul

github.com

mratsim/Arraymancer/blob/71ccad01953060fad121a03b49b97e81ed8ac12c/benchmarks/integer_matmul.jl#L40


      
          # Python Numpy: 9.49s, 55.8 MB
          
          
#########
          # Results on Linux + E3-1230v5 (Skylake quad-core 3.4 GHz, turbo 3.8)
          # Input 1500x1500 random large int64 matrix
          
          
# Nim 0.17.3 (devel) WITH OpenMP. Compilation option: "-d:release --passc:-march=native -d:openmp"
          # Julia v6.0
          # Python 3.6.2 + Numpy 1.12 compiled from source with march=native and openBLAS
          
          
# Nim + OpenMP: 0.36s, 55.5 MB
          # Julia: 3.11s, 207.6 MB
          # Python Numpy: 8.03s, 58.9 MB

mschauer · February 17, 2020, 10:10am

Out of curiosity, how comes that JuliaBLAS is closing in on BLAS for floats but is not fundamentally faster than generic matmul for ints?

Topic		Replies	Views
Why is the matrix multiplication with integer matrices much slower than with float ones? Performance	6	407	June 27, 2023
Fast small integer matrix multiplication Performance	11	483	September 9, 2022
Parallel computing with * Performance question	27	1116	December 29, 2022
Drastic performance hit matrix multiply different types. Internal cast julia vs numpy? Numerics	15	2446	November 4, 2018
Challenge: Can you beat Python and C++ in Int4 Matrix-Vector Multiply Op? Performance bitpacking , llm , quantization , integer	10	1502	July 25, 2023

Int numerical calculation speed slower than Float?

Related topics