That’s because thread 8 doesn’t participate in the shfl; for thread 8 higher_cg_lane is 9 so you don’t call CG.shfl.
You’re maybe better off keeping all threads participating and doing something like Cooperative Groups: Flexible CUDA Thread Programming | NVIDIA Technical Blog?