Flux convolutional layer not type stable

In my spare time I am trying to implement YOLOv3 using Flux. I believe I’ve successfully defined the model, leaving me to find a way to translate the YOLOv3 pretrained weights into Flux. However, I have noticed, using random weights and input, that the the network is extremely slow on CPU, even with small inputs. I attribute this to the fact that Flux’s convolutional layers are not type-stable, on Julia 1.4.1, Flux 0.10.5:

using Flux
using BenchmarkTools
# small image batch
imgbatch = rand(Float32, 416, 416, 3, 2)
conv_layer = Conv((3,3), 3=>32)
@btime conv_layer($imgbatch);
#33.423 ms (75 allocations: 154.31 MiB) on my machine
@code_warntype conv_layer(imgbatch)
# results in:
Variables
  c::Conv{2,4,typeof(identity),Array{Float32,4},Array{Float32,1}}
  x::Array{Float32,4}
  #102::Flux.var"#102#103"
  σ::typeof(identity)
  b::Array{Float32,4}
  cdims::DenseConvDims{2,_A,_B,_C,_D,_E,_F,_G} where _G where _F where _E where _D where _C where _B where _A

Body::Any
1 ─ %1  = Base.getproperty(c, :σ)::Core.Compiler.Const(identity, false)
│   %2  = Base.getproperty(c, :bias)::Array{Float32,1}
│   %3  = Core.tuple(%2)::Tuple{Array{Float32,1}}
│         (#102 = %new(Flux.:(var"#102#103")))
│   %5  = #102::Core.Compiler.Const(Flux.var"#102#103"(), false)
│   %6  = Base.getproperty(c, :stride)::Tuple{Int64,Int64}
│   %7  = Flux.map(%5, %6)::Core.Compiler.Const((1, 1), false)
│   %8  = Core.tuple(Flux.:(:), 1)::Core.Compiler.Const((Colon(), 1), false)
│   %9  = Core._apply_iterate(Base.iterate, Flux.reshape, %3, %7, %8)::Array{Float32,4}
│         (σ = %1)
│         (b = %9)
│   %12 = (:stride, :padding, :dilation)::Core.Compiler.Const((:stride, :padding, :dilation), false)
│   %13 = Core.apply_type(Core.NamedTuple, %12)::Core.Compiler.Const(NamedTuple{(:stride, :padding, :dilation),T} where T<:Tuple, false)
│   %14 = Base.getproperty(c, :stride)::Tuple{Int64,Int64}
│   %15 = Base.getproperty(c, :pad)::NTuple{4,Int64}
│   %16 = Base.getproperty(c, :dilation)::Tuple{Int64,Int64}
│   %17 = Core.tuple(%14, %15, %16)::Tuple{Tuple{Int64,Int64},NTuple{4,Int64},Tuple{Int64,Int64}}
│   %18 = (%13)(%17)::NamedTuple{(:stride, :padding, :dilation),Tuple{Tuple{Int64,Int64},NTuple{4,Int64},Tuple{Int64,Int64}}}
│   %19 = Core.kwfunc(Flux.DenseConvDims)::Core.Compiler.Const(Core.var"#Type##kw"(), false)
│   %20 = Base.getproperty(c, :weight)::Array{Float32,4}
│         (cdims = (%19)(%18, Flux.DenseConvDims, x, %20))
│   %22 = σ::Core.Compiler.Const(identity, false)
│   %23 = Base.getproperty(c, :weight)::Array{Float32,4}
│   %24 = Flux.conv(x, %23, cdims)::AbstractArray{yT,4} where yT
│   %25 = Base.broadcasted(Flux.:+, %24, b)::Any
│   %26 = Base.broadcasted(%22, %25)::Any
│   %27 = Base.materialize(%26)::Any
└──       return %27

I have already opened an issue on Flux’s github here, because right now it takes almost 4 seconds to run my YOLOv3 implementation on the CPU, with rand(Float32, 416, 416, 3, 2) as input. Since convolutional layers are such a staple of deep learning I think fixing this issue is quite important, which is why I am writing this now. Maybe I am missing something? I am quite new to Julia.

Additionally, the memory requirements on GPU are massive. Using Chain where possible and only saving outputs when needed for the shortcut and route layers, my 2GB (not a lot I am aware) GPU is already full during inference on CuArray(rand(Float32, 416, 416, 3, 2)) as input. Using PyTorch I have an implementation that allows me to actually train YOLOv3, even on my small GPU albeit a rather small batch size as well. What are the ways of improving the memory requirements in Flux?