In your second version, you’re only launching a single thread and having it calculate all values. That’s using the GPU really badly. You should spawn multiple threads and blocks, and calculate i and j from the thread and block indices.
1 Like