You will need to optimize a little given the GPUs architecture, e.g., here your kernel seems to be loading a lot of identical values in each thread, which you better load only from in a single thread and cache in shared memory, Also be sure to use @inbounds
where possible, as bounds checking branches are much more expensive on the GPU.
Generally, only few embarrassingly parallel algorithms get easy speed-ups when parallelizing them like that, and even then it often depends on the memory pressure and arithmetic intensity. Beyond that, you will need to optimize for the architecture.