I am trying to run the Flux model zoo LSTM example on a GPU. Turns out there is no performance improvement; on the contrary, it gets worse. The example uses ~500mb of training data, and each epoch runs for ~25sec on CPU and ~55sec on GPU. I am using the CUDAnative workflow from Flux (not CuLSTM). A few questions:
- Should I expect better GPU performance on a problem of this size?
- I suspect one of the reasons of poor performance could be unnecessary copying of data between CPU and GPU. Are there tools to debug this?
- I noticed there is another set of models in Flux that use the
cudnnNVidia library, the
CuLSTMmodel. Is it better to use those? Are there any examples?
- Should I be using an entirely different ML package?
Thanks in advance.