Hi,

I’m currently using Knet and Flux for doing alphazero like calculations. As the botlleneck is the many inferences needed i was wondering if it would be possible to wrap TensorRT like https://github.com/zerollzeng/tiny-tensorrt (the C++ part) and how hard it would be. If not is it plausible that specific CUDA kernels for the inference could bring some acceleration, perhaps using Float16 ?

Thanks.

To give an idea 90% of the time is spent by this function, in which 90% of the time is spent on π,v=m(KnetArray(batch)) where m is a residual network. The typical size input is 8x8x8x1600 (reversi, 200 games in parallel, batch size for parallel MCTS 8) and inference time is around 150 ms for 96 layers 10 block using GTX 1070:(

```
function (m::resnetwork)(x::Vector{GameEnv},squashing=1f0)
l=size(x)[1]
batch=zeros(Float32,(sizeInput...,l))
@threads for k in 1:l
@views decoder(x[k],batch[:,:,:,k])
end
π,v=m(KnetArray(batch))
π=softmax(squashing .*π)
Array(π),Array(v)
end
```