`Conv` is 2x slow than pytorch `Conv` on cpu

Do you mind trying again with the functional conv functions in NNlib/PyTorch? I think it would be good to open an issue in Flux or NNlib, so this would help narrow things down.

Edit: using MKL.jl helps narrow the gap on my machine:

In [1]: import torch.nn.functional as F

In [2]: import torch

In [3]: x = torch.randn(2, 3, 224, 224)

In [4]: conv1 = torch.nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=True)

In [5]: %timeit conv1(x)
2.78 ms ± 70.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
julia> using LinearAlgebra: BLAS

julia> BLAS.vendor()
:mkl

julia> using BenchmarkTools, Flux

julia> const x = randn(Float32, 224, 224, 3, 2);

julia> const conv1 = Conv((7,7), 3 => 64; stride=2, pad=3);

julia> @benchmark conv1($x)
BenchmarkTools.Trial: 
  memory estimate:  19.29 MiB
  allocs estimate:  45
  --------------
  minimum time:     4.588 ms (0.00% GC)
  median time:      5.330 ms (0.00% GC)
  mean time:        5.424 ms (4.99% GC)
  maximum time:     10.235 ms (7.86% GC)
  --------------
  samples:          922
  evals/sample:     1
1 Like