Do you mind trying again with the functional conv functions in NNlib/PyTorch? I think it would be good to open an issue in Flux or NNlib, so this would help narrow things down.
Edit: using MKL.jl helps narrow the gap on my machine:
In [1]: import torch.nn.functional as F
In [2]: import torch
In [3]: x = torch.randn(2, 3, 224, 224)
In [4]: conv1 = torch.nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=True)
In [5]: %timeit conv1(x)
2.78 ms ± 70.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
julia> using LinearAlgebra: BLAS
julia> BLAS.vendor()
:mkl
julia> using BenchmarkTools, Flux
julia> const x = randn(Float32, 224, 224, 3, 2);
julia> const conv1 = Conv((7,7), 3 => 64; stride=2, pad=3);
julia> @benchmark conv1($x)
BenchmarkTools.Trial:
memory estimate: 19.29 MiB
allocs estimate: 45
--------------
minimum time: 4.588 ms (0.00% GC)
median time: 5.330 ms (0.00% GC)
mean time: 5.424 ms (4.99% GC)
maximum time: 10.235 ms (7.86% GC)
--------------
samples: 922
evals/sample: 1