Gradient on GPU 70X slower than on CPU

Let me try to remove PermutedDimsArray and see if it solves the problem. Thank you!!!

UPDATE. I managed to avoid creating PermutedDimsArray inside the model by pre-permuting the data outside the model. Now the code works on GPU and about 10X faster than on CPU. Maybe I can further optimize the code for GPU, but for now this is good enough. Thank you all for the help!