PyTorch doesn’t support running on (Intel) iGPUs or OpenCL devices anyhow (it does have ROCm support, but that’s Linux only)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (Whiskey Lake) (rev 02)
I don’t think pytorch uses that
Can it be that Torch does a sort of lazy calculation?
I guess you are right. Sorry for this bad clue. The cpu usage looks the same (all cores at 100%) ?
I suspect that because tanh
is much slower than the matrix multiplication here (quoting @d1cker’s benchmarks ):
julia> @btime $ww * $xx;
21.679 μs (2 allocations: 125.08 KiB)
julia> @btime tanh.($xx);
1.256 ms (2 allocations: 250.08 KiB)
julia> @btime @avx tanh.($xx);
142.913 μs (2 allocations: 250.08 KiB)
Julia is mostly single threaded, because @avx tanh.(x)
is singlethreaded. I’ll add a vmapnt
to LoopVectorization soon to make it easier to run this multithreaded.
There is currently a vmapntt
which is threaded, but it uses non-temporal stores which are likely to hurt performance unless the batches are very large.
TL;DR ignoring tanh
and setting Julia’s BLAS thread count = # of physical cores, Flux actually edges out PyTorch!
I’m assuming @d1cker’s CPU may have SMT enabled after testing on a processor of a similar vintage:
julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
PyTorch uses 4 threads by default. I assume this is calculated using the number of physical cores:
In [3]: xx = torch.randn(1000, 8, dtype=torch.float32)
In [4]: nn = torch.nn.Sequential(
...: torch.nn.Linear(8, 64),
...: torch.nn.Linear(64, 32),
...: torch.nn.Linear(32, 2)
...: )
In [5]: %timeit nn(xx)[0,0]
115 µs ± 474 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [6]: torch.get_num_threads()
Out[6]: 4
In [7]: nn2 = torch.nn.Sequential(
...: torch.nn.Linear(8, 64),
...: torch.nn.Tanh(),
...: torch.nn.Linear(64, 32),
...: torch.nn.Tanh(),
...: torch.nn.Linear(32, 2),
...: torch.nn.Tanh()
...: )
In [8]: %timeit nn2(xx)[0,0]
239 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Julia uses 8 threads, which I assume comes from the logical core count (2x4):
julia> xx = randn(Float32, 8, 1000)
8×1000 Array{Float32,2}:
0.503908 1.18022 -1.06559 1.84436 … -0.46637 -0.876909 0.82367 0.706861 0.440678
1.44368 -1.49167 -0.0692981 0.174324 -1.43919 1.38345 0.667294 1.19638 -0.097154
-0.151144 -0.264827 -1.69988 1.00654 0.402308 1.49821 0.123561 1.09967 -1.00765
-1.14175 -0.231032 0.110278 0.0363431 0.101058 0.372235 -0.511466 0.526048 -1.4937
0.614555 -0.713141 -2.32049 0.608342 0.376743 1.15385 -0.235271 -0.220544 1.01734
1.6772 -0.538957 -1.28689 -2.1525 … -0.679366 0.269263 -0.902019 -1.71012 -1.95944
0.318371 -0.838747 -0.158875 0.407624 0.682344 0.293501 -0.964282 -0.795304 -0.719654
-0.831405 -0.36298 -0.341576 -1.46556 0.927488 -0.485381 0.170069 -1.17026 -0.617091
julia> const nn = Chain(Dense(8,64),
Dense(64,32),
Dense(32,2))
Chain(Dense(8, 64), Dense(64, 32), Dense(32, 2))
julia> @benchmark nn($xx)
BenchmarkTools.Trial:
memory estimate: 766.19 KiB
allocs estimate: 10
--------------
minimum time: 124.091 μs (0.00% GC)
median time: 152.492 μs (0.00% GC)
mean time: 170.617 μs (9.51% GC)
maximum time: 1.460 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> import LinearAlgebra.BLAS
julia> BLAS.set_num_threads(4)
julia> @benchmark nn($xx)
BenchmarkTools.Trial:
memory estimate: 766.19 KiB
allocs estimate: 10
--------------
minimum time: 90.263 μs (0.00% GC)
median time: 100.746 μs (0.00% GC)
mean time: 112.331 μs (9.46% GC)
maximum time: 836.423 μs (75.47% GC)
--------------
samples: 10000
evals/sample: 1
julia> const nn2 = Chain(Dense(8,64,tanh),
Dense(64,32,tanh),
Dense(32,2,tanh))
Chain(Dense(8, 64, tanh), Dense(64, 32, tanh), Dense(32, 2, tanh))
julia> @benchmark nn2($xx)
BenchmarkTools.Trial:
memory estimate: 766.19 KiB
allocs estimate: 10
--------------
minimum time: 1.575 ms (0.00% GC)
median time: 1.589 ms (0.00% GC)
mean time: 1.602 ms (0.68% GC)
maximum time: 2.335 ms (27.41% GC)
--------------
samples: 3119
evals/sample: 1
Not sure why SMT has such a large performance impact…
You could also try MKL.jl. MKL won’t use more threads than the number of physical cores, and performs better for small matrices in general.
I set the number of cores that pytorch uses to 1
using PyCall
using Flux
using BenchmarkTools
torch = pyimport("torch")
torch.set_num_threads(1)
NN = torch.nn.Sequential(
torch.nn.Linear(8, 64),
torch.nn.Tanh(),
torch.nn.Linear(64, 32),
torch.nn.Tanh(),
torch.nn.Linear(32, 2),
torch.nn.Tanh()
)
torch_nn(in) = NN(in)
Flux_nn = Chain(Dense(8,64,tanh),
Dense(64,32,tanh),
Dense(32,2,tanh))
for i in [1, 10, 100, 1000]
println("Batch size: $i")
torch_in = torch.rand(i,8)
flux_in = rand(Float32,8,i)
print("pytorch :")
@btime torch_nn($torch_in)
print("flux :")
@btime Flux_nn($flux_in)
end
Batch size: 1
pytorch : 88.087 μs (6 allocations: 192 bytes)
flux : 3.567 μs (6 allocations: 1.25 KiB)
Batch size: 10
pytorch : 100.711 μs (6 allocations: 192 bytes)
flux : 18.236 μs (6 allocations: 8.22 KiB)
Batch size: 100
pytorch : 140.269 μs (6 allocations: 192 bytes)
flux : 162.120 μs (8 allocations: 77.16 KiB)
Batch size: 1000
pytorch : 465.119 μs (6 allocations: 192 bytes)
flux : 4.485 ms (10 allocations: 766.19 KiB)
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.4.1 (2020-04-14)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> torch.get_num_threads()
1
the result are about the same so I don’t think it is only the threads stuff. Pytorch probably has some other tricks. I will check MKL.jl later today, maybe that’s that.
about the SMT, I think it is disables on my laptop. but I’m sure I checked it right
dicker@dicker-X1:~$ grep -o '^flags\b.*: .*\bht\b' /proc/cpuinfo | tail -1
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
I will try that, thanks!
One general comment would be to convert the pytorch models over to their inference mode (specifically via torch.no_grad
; python - What does model.eval() do in pytorch? - Stack Overflow) for example. It might not make too much of a difference in the pytorch times, but something to keep in mind.
SMT == ht
or “hyper-threading” in your cpuinfo output. I agree with @Elrod that it would be worth testing out MKL.jl, as PyTorch uses it by default when installed through Conda (not sure about the pip wheel).
Good point, thankfully no_grad
only has a minimal (~20µs) impact on the run time.
I was trying to trigger this while benchmarking yesterday but failed to do so. It seems like the custom broadcast definition is being ignored because of a fusion between +
and σ
?
julia> Base.broadcasted(::typeof(tanh), x) = typeof(x)
julia> Dense(8, 64, tanh)(xx)
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(+),Tuple{Array{Float32,2},Array{Float32,1}}}
Ahh thats a shame, I think it’s because Torch_jll only builds the CUDA version. You’d have add the CPU version to it. If you feel up for it this video might be useful, it explains how BinaryBuilder works.
SMT ==
ht
or “hyper-threading”
I didn’t know that, thanks.
I install MKL.jl, it caused some problem with torch using PyCall.
But the results are about the same:
Batch size: 1
flux : 1.822 μs (6 allocations: 1.25 KiB)
Batch size: 10
flux : 15.199 μs (6 allocations: 8.22 KiB)
Batch size: 100
flux : 151.172 μs (8 allocations: 77.16 KiB)
Batch size: 1000
flux : 1.540 ms (10 allocations: 766.19 KiB)
Still on pytorch is about 4x faster. I guess that LoopVectorization will solve that.
I will set a meeting with the wizard
I have stumbled upon IntelVectorMath.jl and it looks like that is the trick that pytorch uses (I guess).
IVM_nn = Chain(Dense(8, 64),
IVM.tanh,
Dense(64,32),
IVM.tanh,
Dense(32, 2),
IVM.tanh)
results:
Batch size: 1
pytorch : 80.571 μs (6 allocations: 192 bytes)
flux : 3.076 μs (6 allocations: 1.25 KiB)
IVM : 2.108 μs (9 allocations: 1.88 KiB)
Batch size: 10
pytorch : 91.051 μs (6 allocations: 192 bytes)
flux : 17.028 μs (6 allocations: 8.22 KiB)
IVM : 5.418 μs (9 allocations: 12.33 KiB)
Batch size: 100
pytorch : 117.251 μs (6 allocations: 192 bytes)
flux : 164.212 μs (8 allocations: 77.16 KiB)
IVM : 32.514 μs (12 allocations: 115.73 KiB)
Batch size: 1000
pytorch : 343.641 μs (6 allocations: 192 bytes)
flux : 2.755 ms (10 allocations: 766.19 KiB)
IVM : 402.848 μs (15 allocations: 1.12 MiB)
LoopVectorization’s tanh
will get faster in Julia 1.5+.
julia> using IntelVectorMath
julia> using LoopVectorization
julia> x = randn(Float32, 200);
julia> y = similar(x);
julia> @benchmark IntelVectorMath.tanh!($y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 120.267 ns (0.00% GC)
median time: 121.149 ns (0.00% GC)
mean time: 121.448 ns (0.00% GC)
maximum time: 167.460 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 907
julia> @benchmark @avx $y .= tanh.($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 107.539 ns (0.00% GC)
median time: 107.754 ns (0.00% GC)
mean time: 107.898 ns (0.00% GC)
maximum time: 158.927 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 932
It will use xsimd’s implementation, via llvmcall. However, as llvm ir isn’t particularly stable, this gives syntax errors with LLVM 8 or below. Julia 1.5 (which ships with LLVM 9) will hopefully be released soon.
I should also add a caveat that relative performance is system dependent. Performance may vary between AVX512, AVX2, and AVX implementations, and for some functions (like tanh_fast
) by OS.
You can also try tanh_fast
, which should do well on Linux:
julia> @benchmark @avx $y .= LoopVectorization.SLEEFPirates.tanh_fast.($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 91.065 ns (0.00% GC)
median time: 91.241 ns (0.00% GC)
mean time: 91.347 ns (0.00% GC)
maximum time: 136.881 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 955
Wow, mind blown.
Are there downsides to tanh_fast
? It’s 3x faster for me.
It looks like this isn’t the lower-accuracy Sleef_tanh_u35
from SLEEF - Math library reference (scalar) , if I’m reading this right: https://github.com/chriselrod/SLEEFPirates.jl/blob/70baec2c9435c183abfa30e2c7c1ae6a479d273e/src/SLEEFPirates.jl#L213 . But perhaps still slightly less accurate?
I just ran a few tests (of the simple definition using exp
), and it can be less accurate than I feared. In particular:
julia> tanh(0.0001)
9.999999966666668e-5
julia> prevfloat(SLEEFPirates.tanh_fast(0.0001), 2076)
9.999999966666668e-5
julia> prevfloat(SLEEFPirates.tanh_fast(0.0001), 2076) == tanh(0.0001)
true
2076 is much greater than 3.5.