`Conv` is 2x slow than pytorch `Conv` on cpu


using BenchmarkTools

using Flux
conv = Conv((7,7), 3 => 64; stride=2, pad=3)
dummy = randn(Float32, 224, 224, 3, 2);
conv(dummy)  # ignore compile time
@benchmark conv(dummy) 
  memory estimate:  19.29 MiB
  allocs estimate:  60
  minimum time:     22.996 ms (0.00% GC)
  median time:      25.009 ms (0.00% GC)
  mean time:        30.242 ms (17.79% GC)
  maximum time:     66.116 ms (62.37% GC)
  samples:          165
  evals/sample:     1

using PyCall
const torch = pyimport_conda("torch", "torch")
const nn = pyimport("torch.nn")
dummy_py = torch.randn(2, 3, 224, 224);
conv1_py = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=true)
@benchmark conv1_py(dummy_py)  

  memory estimate:  160 bytes
  allocs estimate:  4
  minimum time:     8.824 ms (0.00% GC)
  median time:      9.436 ms (0.00% GC)
  mean time:        10.463 ms (0.00% GC)
  maximum time:     19.719 ms (0.00% GC)
  samples:          478
  evals/sample:     1

Is the result expected, or a benchmark/implementation mistake? This is the first layer of ResNet18 (with extra bias), the full network implementation gets similar results.

Something is fishy with that benchmark, how can pytorch not allocate more than 160 bytes when the output array is much larger than that?

1 Like

:sweat_smile: Sadly I didn’t even detect it before posting. But just @time also gives similar results, so I guess that the 2x gap itself could exist.

This is probably not responsible for the entire 2x speed difference, but I have noticed Flux.Conv layers are not type-stable, which I guess would reduce performance. Opened a github issue here and a thread here.

Because of the lack of response I am unsure whether this type-instability is actually a big deal though, I am quite new to Julia.

1 Like

Could one reason be that PyTorch uses the CUDNN function which also adds bias and activation while Flux only does the convolution with CUDNN and relies on generic implementations for adding bias performing activation?

1 Like

Bias calculation could make a difference, but the benchmark uses the identity function activation so I think it would be compiled away and have zero overhead, no?

Hmm, I just saw that the benchmark does not even use the GPU, so there is no CUDNN involved. Sorry for the noise.

Could it be an MKL vs OpenBLAS thing then?