Could this kernel go faster?

Thanks to all of you, this code is getting better and better. However, this post is messy and I have managed to pin out more precisely my bottleneck. Therefore I reposted it there : Is this the maximum perf i can obtain?

1 Like