How to parallerize dual coordinet descent mehods on GPU using CUDA.jl?

Thank you so much, it works!
I will try it on the real data X and compare the running time with CPU!