PyTorch doesn’t support running on (Intel) iGPUs or OpenCL devices anyhow (it does have ROCm support, but that’s Linux only)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (Whiskey Lake) (rev 02)
I don’t think pytorch uses that
Can it be that Torch does a sort of lazy calculation?
I guess you are right. Sorry for this bad clue. The cpu usage looks the same (all cores at 100%) ?
I suspect that because tanh is much slower than the matrix multiplication here (quoting @d1cker’s benchmarks ):
julia> @btime $ww * $xx;
21.679 μs (2 allocations: 125.08 KiB)
julia> @btime tanh.($xx);
1.256 ms (2 allocations: 250.08 KiB)
julia> @btime @avx tanh.($xx);
142.913 μs (2 allocations: 250.08 KiB)
Julia is mostly single threaded, because @avx tanh.(x) is singlethreaded. I’ll add a vmapnt to LoopVectorization soon to make it easier to run this multithreaded.
There is currently a vmapntt which is threaded, but it uses non-temporal stores which are likely to hurt performance unless the batches are very large.
TL;DR ignoring tanh and setting Julia’s BLAS thread count = # of physical cores, Flux actually edges out PyTorch!
I’m assuming @d1cker’s CPU may have SMT enabled after testing on a processor of a similar vintage:
julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
PyTorch uses 4 threads by default. I assume this is calculated using the number of physical cores:
In [3]: xx = torch.randn(1000, 8, dtype=torch.float32)
In [4]: nn = torch.nn.Sequential(
...: torch.nn.Linear(8, 64),
...: torch.nn.Linear(64, 32),
...: torch.nn.Linear(32, 2)
...: )
In [5]: %timeit nn(xx)[0,0]
115 µs ± 474 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [6]: torch.get_num_threads()
Out[6]: 4
In [7]: nn2 = torch.nn.Sequential(
...: torch.nn.Linear(8, 64),
...: torch.nn.Tanh(),
...: torch.nn.Linear(64, 32),
...: torch.nn.Tanh(),
...: torch.nn.Linear(32, 2),
...: torch.nn.Tanh()
...: )
In [8]: %timeit nn2(xx)[0,0]
239 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Julia uses 8 threads, which I assume comes from the logical core count (2x4):
julia> xx = randn(Float32, 8, 1000)
8×1000 Array{Float32,2}:
0.503908 1.18022 -1.06559 1.84436 … -0.46637 -0.876909 0.82367 0.706861 0.440678
1.44368 -1.49167 -0.0692981 0.174324 -1.43919 1.38345 0.667294 1.19638 -0.097154
-0.151144 -0.264827 -1.69988 1.00654 0.402308 1.49821 0.123561 1.09967 -1.00765
-1.14175 -0.231032 0.110278 0.0363431 0.101058 0.372235 -0.511466 0.526048 -1.4937
0.614555 -0.713141 -2.32049 0.608342 0.376743 1.15385 -0.235271 -0.220544 1.01734
1.6772 -0.538957 -1.28689 -2.1525 … -0.679366 0.269263 -0.902019 -1.71012 -1.95944
0.318371 -0.838747 -0.158875 0.407624 0.682344 0.293501 -0.964282 -0.795304 -0.719654
-0.831405 -0.36298 -0.341576 -1.46556 0.927488 -0.485381 0.170069 -1.17026 -0.617091
julia> const nn = Chain(Dense(8,64),
Dense(64,32),
Dense(32,2))
Chain(Dense(8, 64), Dense(64, 32), Dense(32, 2))
julia> @benchmark nn($xx)
BenchmarkTools.Trial:
memory estimate: 766.19 KiB
allocs estimate: 10
--------------
minimum time: 124.091 μs (0.00% GC)
median time: 152.492 μs (0.00% GC)
mean time: 170.617 μs (9.51% GC)
maximum time: 1.460 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> import LinearAlgebra.BLAS
julia> BLAS.set_num_threads(4)
julia> @benchmark nn($xx)
BenchmarkTools.Trial:
memory estimate: 766.19 KiB
allocs estimate: 10
--------------
minimum time: 90.263 μs (0.00% GC)
median time: 100.746 μs (0.00% GC)
mean time: 112.331 μs (9.46% GC)
maximum time: 836.423 μs (75.47% GC)
--------------
samples: 10000
evals/sample: 1
julia> const nn2 = Chain(Dense(8,64,tanh),
Dense(64,32,tanh),
Dense(32,2,tanh))
Chain(Dense(8, 64, tanh), Dense(64, 32, tanh), Dense(32, 2, tanh))
julia> @benchmark nn2($xx)
BenchmarkTools.Trial:
memory estimate: 766.19 KiB
allocs estimate: 10
--------------
minimum time: 1.575 ms (0.00% GC)
median time: 1.589 ms (0.00% GC)
mean time: 1.602 ms (0.68% GC)
maximum time: 2.335 ms (27.41% GC)
--------------
samples: 3119
evals/sample: 1
Not sure why SMT has such a large performance impact…
You could also try MKL.jl. MKL won’t use more threads than the number of physical cores, and performs better for small matrices in general.
I set the number of cores that pytorch uses to 1
using PyCall
using Flux
using BenchmarkTools
torch = pyimport("torch")
torch.set_num_threads(1)
NN = torch.nn.Sequential(
torch.nn.Linear(8, 64),
torch.nn.Tanh(),
torch.nn.Linear(64, 32),
torch.nn.Tanh(),
torch.nn.Linear(32, 2),
torch.nn.Tanh()
)
torch_nn(in) = NN(in)
Flux_nn = Chain(Dense(8,64,tanh),
Dense(64,32,tanh),
Dense(32,2,tanh))
for i in [1, 10, 100, 1000]
println("Batch size: $i")
torch_in = torch.rand(i,8)
flux_in = rand(Float32,8,i)
print("pytorch :")
@btime torch_nn($torch_in)
print("flux :")
@btime Flux_nn($flux_in)
end
Batch size: 1
pytorch : 88.087 μs (6 allocations: 192 bytes)
flux : 3.567 μs (6 allocations: 1.25 KiB)
Batch size: 10
pytorch : 100.711 μs (6 allocations: 192 bytes)
flux : 18.236 μs (6 allocations: 8.22 KiB)
Batch size: 100
pytorch : 140.269 μs (6 allocations: 192 bytes)
flux : 162.120 μs (8 allocations: 77.16 KiB)
Batch size: 1000
pytorch : 465.119 μs (6 allocations: 192 bytes)
flux : 4.485 ms (10 allocations: 766.19 KiB)
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.4.1 (2020-04-14)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> torch.get_num_threads()
1
the result are about the same so I don’t think it is only the threads stuff. Pytorch probably has some other tricks. I will check MKL.jl later today, maybe that’s that.
about the SMT, I think it is disables on my laptop. but I’m sure I checked it right
dicker@dicker-X1:~$ grep -o '^flags\b.*: .*\bht\b' /proc/cpuinfo | tail -1
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
I will try that, thanks!
One general comment would be to convert the pytorch models over to their inference mode (specifically via torch.no_grad; python - What does model.eval() do in pytorch? - Stack Overflow) for example. It might not make too much of a difference in the pytorch times, but something to keep in mind.
SMT == ht or “hyper-threading” in your cpuinfo output. I agree with @Elrod that it would be worth testing out MKL.jl, as PyTorch uses it by default when installed through Conda (not sure about the pip wheel).
Good point, thankfully no_grad only has a minimal (~20µs) impact on the run time.
I was trying to trigger this while benchmarking yesterday but failed to do so. It seems like the custom broadcast definition is being ignored because of a fusion between + and σ?
julia> Base.broadcasted(::typeof(tanh), x) = typeof(x)
julia> Dense(8, 64, tanh)(xx)
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(+),Tuple{Array{Float32,2},Array{Float32,1}}}
SMT ==
htor “hyper-threading”
I didn’t know that, thanks.
I install MKL.jl, it caused some problem with torch using PyCall.
But the results are about the same:
Batch size: 1
flux : 1.822 μs (6 allocations: 1.25 KiB)
Batch size: 10
flux : 15.199 μs (6 allocations: 8.22 KiB)
Batch size: 100
flux : 151.172 μs (8 allocations: 77.16 KiB)
Batch size: 1000
flux : 1.540 ms (10 allocations: 766.19 KiB)
Still on pytorch is about 4x faster. I guess that LoopVectorization will solve that.
I will set a meeting with the wizard
I have stumbled upon IntelVectorMath.jl and it looks like that is the trick that pytorch uses (I guess).
IVM_nn = Chain(Dense(8, 64),
IVM.tanh,
Dense(64,32),
IVM.tanh,
Dense(32, 2),
IVM.tanh)
results:
Batch size: 1
pytorch : 80.571 μs (6 allocations: 192 bytes)
flux : 3.076 μs (6 allocations: 1.25 KiB)
IVM : 2.108 μs (9 allocations: 1.88 KiB)
Batch size: 10
pytorch : 91.051 μs (6 allocations: 192 bytes)
flux : 17.028 μs (6 allocations: 8.22 KiB)
IVM : 5.418 μs (9 allocations: 12.33 KiB)
Batch size: 100
pytorch : 117.251 μs (6 allocations: 192 bytes)
flux : 164.212 μs (8 allocations: 77.16 KiB)
IVM : 32.514 μs (12 allocations: 115.73 KiB)
Batch size: 1000
pytorch : 343.641 μs (6 allocations: 192 bytes)
flux : 2.755 ms (10 allocations: 766.19 KiB)
IVM : 402.848 μs (15 allocations: 1.12 MiB)
LoopVectorization’s tanh will get faster in Julia 1.5+.
julia> using IntelVectorMath
julia> using LoopVectorization
julia> x = randn(Float32, 200);
julia> y = similar(x);
julia> @benchmark IntelVectorMath.tanh!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 120.267 ns (0.00% GC)
median time: 121.149 ns (0.00% GC)
mean time: 121.448 ns (0.00% GC)
maximum time: 167.460 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 907
julia> @benchmark @avx $y .= tanh.($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 107.539 ns (0.00% GC)
median time: 107.754 ns (0.00% GC)
mean time: 107.898 ns (0.00% GC)
maximum time: 158.927 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 932
It will use xsimd’s implementation, via llvmcall. However, as llvm ir isn’t particularly stable, this gives syntax errors with LLVM 8 or below. Julia 1.5 (which ships with LLVM 9) will hopefully be released soon.
I should also add a caveat that relative performance is system dependent. Performance may vary between AVX512, AVX2, and AVX implementations, and for some functions (like tanh_fast) by OS.
You can also try tanh_fast, which should do well on Linux:
julia> @benchmark @avx $y .= LoopVectorization.SLEEFPirates.tanh_fast.($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 91.065 ns (0.00% GC)
median time: 91.241 ns (0.00% GC)
mean time: 91.347 ns (0.00% GC)
maximum time: 136.881 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 955
Wow, mind blown.
Are there downsides to tanh_fast? It’s 3x faster for me.
It looks like this isn’t the lower-accuracy Sleef_tanh_u35 from SLEEF - Math library reference (scalar) , if I’m reading this right: https://github.com/chriselrod/SLEEFPirates.jl/blob/70baec2c9435c183abfa30e2c7c1ae6a479d273e/src/SLEEFPirates.jl#L213 . But perhaps still slightly less accurate?
I just ran a few tests (of the simple definition using exp), and it can be less accurate than I feared. In particular:
julia> tanh(0.0001)
9.999999966666668e-5
julia> prevfloat(SLEEFPirates.tanh_fast(0.0001), 2076)
9.999999966666668e-5
julia> prevfloat(SLEEFPirates.tanh_fast(0.0001), 2076) == tanh(0.0001)
true
2076 is much greater than 3.5.