Flux vs pytorch cpu performance

PyTorch doesn’t support running on (Intel) iGPUs or OpenCL devices anyhow (it does have ROCm support, but that’s Linux only)

1 Like

00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (Whiskey Lake) (rev 02)

I don’t think pytorch uses that

Can it be that Torch does a sort of lazy calculation?

I guess you are right. Sorry for this bad clue. The cpu usage looks the same (all cores at 100%) ?

I suspect that because tanh is much slower than the matrix multiplication here (quoting @d1cker’s benchmarks ):

julia> @btime $ww * $xx;
  21.679 μs (2 allocations: 125.08 KiB)

julia> @btime tanh.($xx);
  1.256 ms (2 allocations: 250.08 KiB)

julia> @btime @avx tanh.($xx);
  142.913 μs (2 allocations: 250.08 KiB)

Julia is mostly single threaded, because @avx tanh.(x) is singlethreaded. I’ll add a vmapnt to LoopVectorization soon to make it easier to run this multithreaded.
There is currently a vmapntt which is threaded, but it uses non-temporal stores which are likely to hurt performance unless the batches are very large.

TL;DR ignoring tanh and setting Julia’s BLAS thread count = # of physical cores, Flux actually edges out PyTorch!

I’m assuming @d1cker’s CPU may have SMT enabled after testing on a processor of a similar vintage:

julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

PyTorch uses 4 threads by default. I assume this is calculated using the number of physical cores:

In [3]: xx = torch.randn(1000, 8, dtype=torch.float32)                                                             

In [4]: nn = torch.nn.Sequential( 
  ...:     torch.nn.Linear(8, 64), 
  ...:     torch.nn.Linear(64, 32), 
  ...:     torch.nn.Linear(32, 2) 
  ...: )                                                                                                          

In [5]: %timeit nn(xx)[0,0]                                                                                        
115 µs ± 474 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [6]: torch.get_num_threads()                                                                                    
Out[6]: 4

In [7]: nn2 = torch.nn.Sequential( 
  ...:     torch.nn.Linear(8, 64), 
  ...:     torch.nn.Tanh(), 
  ...:     torch.nn.Linear(64, 32), 
  ...:     torch.nn.Tanh(), 
  ...:     torch.nn.Linear(32, 2), 
  ...:     torch.nn.Tanh() 
  ...: )                                                                                                          

In [8]: %timeit nn2(xx)[0,0]                                                                                       
239 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Julia uses 8 threads, which I assume comes from the logical core count (2x4):

julia> xx = randn(Float32, 8, 1000)
8×1000 Array{Float32,2}:
  0.503908   1.18022   -1.06559     1.84436    …  -0.46637   -0.876909   0.82367    0.706861   0.440678
  1.44368   -1.49167   -0.0692981   0.174324      -1.43919    1.38345    0.667294   1.19638   -0.097154
 -0.151144  -0.264827  -1.69988     1.00654        0.402308   1.49821    0.123561   1.09967   -1.00765
 -1.14175   -0.231032   0.110278    0.0363431      0.101058   0.372235  -0.511466   0.526048  -1.4937
  0.614555  -0.713141  -2.32049     0.608342       0.376743   1.15385   -0.235271  -0.220544   1.01734
  1.6772    -0.538957  -1.28689    -2.1525     …  -0.679366   0.269263  -0.902019  -1.71012   -1.95944
  0.318371  -0.838747  -0.158875    0.407624       0.682344   0.293501  -0.964282  -0.795304  -0.719654
 -0.831405  -0.36298   -0.341576   -1.46556        0.927488  -0.485381   0.170069  -1.17026   -0.617091

julia> const nn = Chain(Dense(8,64),
                       Dense(64,32),
                       Dense(32,2))
Chain(Dense(8, 64), Dense(64, 32), Dense(32, 2))

julia> @benchmark nn($xx)
BenchmarkTools.Trial: 
  memory estimate:  766.19 KiB
  allocs estimate:  10
  --------------
  minimum time:     124.091 μs (0.00% GC)
  median time:      152.492 μs (0.00% GC)
  mean time:        170.617 μs (9.51% GC)
  maximum time:     1.460 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> import LinearAlgebra.BLAS

julia> BLAS.set_num_threads(4)

julia> @benchmark nn($xx)
BenchmarkTools.Trial: 
  memory estimate:  766.19 KiB
  allocs estimate:  10
  --------------
  minimum time:     90.263 μs (0.00% GC)
  median time:      100.746 μs (0.00% GC)
  mean time:        112.331 μs (9.46% GC)
  maximum time:     836.423 μs (75.47% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> const nn2 = Chain(Dense(8,64,tanh),
                       Dense(64,32,tanh),
                       Dense(32,2,tanh))
Chain(Dense(8, 64, tanh), Dense(64, 32, tanh), Dense(32, 2, tanh))

julia> @benchmark nn2($xx)
BenchmarkTools.Trial: 
  memory estimate:  766.19 KiB
  allocs estimate:  10
  --------------
  minimum time:     1.575 ms (0.00% GC)
  median time:      1.589 ms (0.00% GC)
  mean time:        1.602 ms (0.68% GC)
  maximum time:     2.335 ms (27.41% GC)
  --------------
  samples:          3119
  evals/sample:     1

Not sure why SMT has such a large performance impact…

1 Like

You could also try MKL.jl. MKL won’t use more threads than the number of physical cores, and performs better for small matrices in general.

2 Likes

I set the number of cores that pytorch uses to 1

using PyCall
using Flux
using BenchmarkTools

torch = pyimport("torch")
torch.set_num_threads(1)

NN = torch.nn.Sequential(
    torch.nn.Linear(8, 64),
    torch.nn.Tanh(),
    torch.nn.Linear(64, 32),
    torch.nn.Tanh(),
    torch.nn.Linear(32, 2),
    torch.nn.Tanh()
)

torch_nn(in) = NN(in)

Flux_nn = Chain(Dense(8,64,tanh),
                Dense(64,32,tanh),
                Dense(32,2,tanh))

for i in [1, 10, 100, 1000]
    println("Batch size: $i")
    torch_in = torch.rand(i,8)
    flux_in = rand(Float32,8,i)
    print("pytorch     :")
    @btime torch_nn($torch_in)
    print("flux        :")
    @btime Flux_nn($flux_in)    
end
Batch size: 1
pytorch     :  88.087 μs (6 allocations: 192 bytes)
flux        :  3.567 μs (6 allocations: 1.25 KiB)
Batch size: 10
pytorch     :  100.711 μs (6 allocations: 192 bytes)
flux        :  18.236 μs (6 allocations: 8.22 KiB)
Batch size: 100
pytorch     :  140.269 μs (6 allocations: 192 bytes)
flux        :  162.120 μs (8 allocations: 77.16 KiB)
Batch size: 1000
pytorch     :  465.119 μs (6 allocations: 192 bytes)
flux        :  4.485 ms (10 allocations: 766.19 KiB)
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> torch.get_num_threads()
1

the result are about the same so I don’t think it is only the threads stuff. Pytorch probably has some other tricks. I will check MKL.jl later today, maybe that’s that.

about the SMT, I think it is disables on my laptop. but I’m sure I checked it right

dicker@dicker-X1:~$ grep -o '^flags\b.*: .*\bht\b' /proc/cpuinfo | tail -1
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht

I will try that, thanks!

One general comment would be to convert the pytorch models over to their inference mode (specifically via torch.no_grad; python - What does model.eval() do in pytorch? - Stack Overflow) for example. It might not make too much of a difference in the pytorch times, but something to keep in mind.

SMT == ht or “hyper-threading” in your cpuinfo output. I agree with @Elrod that it would be worth testing out MKL.jl, as PyTorch uses it by default when installed through Conda (not sure about the pip wheel).

Good point, thankfully no_grad only has a minimal (~20µs) impact on the run time.

I was trying to trigger this while benchmarking yesterday but failed to do so. It seems like the custom broadcast definition is being ignored because of a fusion between + and σ?

julia> Base.broadcasted(::typeof(tanh), x) = typeof(x)

julia> Dense(8, 64, tanh)(xx)
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(+),Tuple{Array{Float32,2},Array{Float32,1}}}

Ahh thats a shame, I think it’s because Torch_jll only builds the CUDA version. You’d have add the CPU version to it. If you feel up for it this video might be useful, it explains how BinaryBuilder works.

SMT == ht or “hyper-threading”

I didn’t know that, thanks.

I install MKL.jl, it caused some problem with torch using PyCall.
But the results are about the same:

Batch size: 1
flux        :  1.822 μs (6 allocations: 1.25 KiB)
Batch size: 10
flux        :  15.199 μs (6 allocations: 8.22 KiB)
Batch size: 100
flux        :  151.172 μs (8 allocations: 77.16 KiB)
Batch size: 1000
flux        :  1.540 ms (10 allocations: 766.19 KiB)

Still on pytorch is about 4x faster. I guess that LoopVectorization will solve that.

I will set a meeting with the wizard

I have stumbled upon IntelVectorMath.jl and it looks like that is the trick that pytorch uses (I guess).

IVM_nn = Chain(Dense(8, 64),
               IVM.tanh,
               Dense(64,32),
               IVM.tanh,
               Dense(32, 2),
               IVM.tanh)

results:

Batch size: 1
pytorch     :  80.571 μs (6 allocations: 192 bytes)
flux        :  3.076 μs (6 allocations: 1.25 KiB)
IVM         :  2.108 μs (9 allocations: 1.88 KiB)
Batch size: 10
pytorch     :  91.051 μs (6 allocations: 192 bytes)
flux        :  17.028 μs (6 allocations: 8.22 KiB)
IVM         :  5.418 μs (9 allocations: 12.33 KiB)
Batch size: 100
pytorch     :  117.251 μs (6 allocations: 192 bytes)
flux        :  164.212 μs (8 allocations: 77.16 KiB)
IVM         :  32.514 μs (12 allocations: 115.73 KiB)
Batch size: 1000
pytorch     :  343.641 μs (6 allocations: 192 bytes)
flux        :  2.755 ms (10 allocations: 766.19 KiB)
IVM         :  402.848 μs (15 allocations: 1.12 MiB)

LoopVectorization’s tanh will get faster in Julia 1.5+.

julia> using IntelVectorMath

julia> using LoopVectorization

julia> x = randn(Float32, 200);

julia> y = similar(x);

julia> @benchmark IntelVectorMath.tanh!($y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     120.267 ns (0.00% GC)
  median time:      121.149 ns (0.00% GC)
  mean time:        121.448 ns (0.00% GC)
  maximum time:     167.460 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     907

julia> @benchmark @avx $y .= tanh.($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     107.539 ns (0.00% GC)
  median time:      107.754 ns (0.00% GC)
  mean time:        107.898 ns (0.00% GC)
  maximum time:     158.927 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     932

It will use xsimd’s implementation, via llvmcall. However, as llvm ir isn’t particularly stable, this gives syntax errors with LLVM 8 or below. Julia 1.5 (which ships with LLVM 9) will hopefully be released soon.

I should also add a caveat that relative performance is system dependent. Performance may vary between AVX512, AVX2, and AVX implementations, and for some functions (like tanh_fast) by OS.
You can also try tanh_fast, which should do well on Linux:

julia> @benchmark @avx $y .= LoopVectorization.SLEEFPirates.tanh_fast.($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     91.065 ns (0.00% GC)
  median time:      91.241 ns (0.00% GC)
  mean time:        91.347 ns (0.00% GC)
  maximum time:     136.881 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     955
1 Like

Wow, mind blown.

Are there downsides to tanh_fast? It’s 3x faster for me.

It looks like this isn’t the lower-accuracy Sleef_tanh_u35 from SLEEF - Math library reference (scalar) , if I’m reading this right: https://github.com/chriselrod/SLEEFPirates.jl/blob/70baec2c9435c183abfa30e2c7c1ae6a479d273e/src/SLEEFPirates.jl#L213 . But perhaps still slightly less accurate?

I just ran a few tests (of the simple definition using exp), and it can be less accurate than I feared. In particular:

julia> tanh(0.0001)
9.999999966666668e-5

julia> prevfloat(SLEEFPirates.tanh_fast(0.0001), 2076)
9.999999966666668e-5

julia> prevfloat(SLEEFPirates.tanh_fast(0.0001), 2076) == tanh(0.0001)
true

2076 is much greater than 3.5.