LSTM on a GPU

Hello,

I am trying to run the Flux model zoo LSTM example on a GPU. Turns out there is no performance improvement; on the contrary, it gets worse. The example uses ~500mb of training data, and each epoch runs for ~25sec on CPU and ~55sec on GPU. I am using the CUDAnative workflow from Flux (not CuLSTM). A few questions:

  1. Should I expect better GPU performance on a problem of this size?
  2. I suspect one of the reasons of poor performance could be unnecessary copying of data between CPU and GPU. Are there tools to debug this?
  3. I noticed there is another set of models in Flux that use the cudnn NVidia library, the CuLSTM model. Is it better to use those? Are there any examples?
  4. Should I be using an entirely different ML package?

Thanks in advance.