Hi,

I’m currently running an Alphazero replica with Flux. As you may know most of the time is spent during selfplay and essentially doing neural network inferences. I’m currently using a batchsize of 512 and when mesuring time spent for inferences i get close to the “theoretic timing” ie mean time

```
for k in 1:n_sim
model(batch)
end
```

where batch is `rand(Float32,7,7,3,512)|>gpu)`

. So far so good. Problem is the same model running on another implementation in rust is 10 times faster. If i just count the inference time it already is at least 5 time slower than the whole loop in rust.

so I wonder is there any penalty calling cudnn from flux? Would it possibly be faster to use directly CUDA.jl/CUDNN or is there something dumb I would be doing making the forward pass slow ?

for the record here is the model struct:

```
mutable struct ResnetB{T}
layers::T
end
Flux.@functor ResnetB
function ResNetBlock(n::Int)
layers = Chain(
Conv((3, 3), n => n, pad=1, stride=1, bias=false),
BatchNorm(n, relu),
Conv((3, 3), n => n, pad=1, stride=1, bias=false),
BatchNorm(n))
return ResnetB(layers)
end
function (m::ResnetB)(x)
return relu.(m.layers(x) .+ x)
end
mutable struct resnetwork_2H <: Network
base
res
value
policy
end
Flux.@functor resnetwork_2H
function (m::resnetwork_2H)(x)
b = m.base(x)
b = m.res(b)
return m.policy(b), m.value(b)
end
function resnetb_2H(n_filter, n_tower, dense)
return resnetwork_2H(Chain(Conv((3, 3), 3 => n_filter, stride=1, pad=1, bias=false), BatchNorm(n_filter, relu)),
Chain([ResnetB(Chain(ResNetBlock(n_filter), ResNetBlock(n_filter))) for k in 1:n_tower]...),
Chain(Conv((1, 1), n_filter => 32, stride=1, pad=0, bias=false), BatchNorm(32, relu), Flux.flatten, Dense(32 * 49, dense, relu), Dense(dense, 1, tanh)),
Chain(Conv((1, 1), n_filter => 32, stride=1, pad=0, bias=false), BatchNorm(32, relu), Flux.flatten, Dense(32 * 49, 539), softmax)) |> gpu
end
Thanks in advance
```