Fast CNN inference

I’m currently using Knet and Flux for doing alphazero like calculations. As the botlleneck is the many inferences needed i was wondering if it would be possible to wrap TensorRT like (the C++ part) and how hard it would be. If not is it plausible that specific CUDA kernels for the inference could bring some acceleration, perhaps using Float16 ?
To give an idea 90% of the time is spent by this function, in which 90% of the time is spent on π,v=m(KnetArray(batch)) where m is a residual network. The typical size input is 8x8x8x1600 (reversi, 200 games in parallel, batch size for parallel MCTS 8) and inference time is around 150 ms for 96 layers 10 block using GTX 1070:(

function (m::resnetwork)(x::Vector{GameEnv},squashing=1f0)

    @threads for k in 1:l
         @views decoder(x[k],batch[:,:,:,k])
   π=softmax(squashing .*π)

Have you tried just keeping everything on CPU? That input size isn’t terribly massive, so I wonder if the back-and-forth transfer is worth the latency.

I tried it is at least a hundred times slower. I don’t know of any reasonable implementation that does not use gpu, that is if you try anything bigger than tic tac toe.

Wrapping C++ is tricky, and TensorRT doesn’t seem to have an C API either, so that won’t be easy. Maybe you can PyCall the TensorRT Python bindings.