Oh it’s because I forgot the threads argument, correct? I guess that’s a silly mistake, haha. I’ve just now run with threads=1024 and it ran in about 13 seconds. Thank you.
Another question: I’ve seen the post on @cuda threads and the new occupancy API, but I’m not sure I understood how to use it. Could you expand on that?