How to perform GPU overlap operations on the custom kernel function?

Your program is written correctly, whether execution will overlap or not is up to the CUDA driver. You also better use a proper profiler to verify whether execution overlaps or not (i.e., NSight Systems).