That’s not a generic for loop. Most of the iteration is performed implicitly by the function being called by multiple threads, the loop just serves to extend that to input sizes that are too large.
Iteration with Iterators.product won’t get mapped to efficient GPU iteration automatically.
f ProductIterator would support getindex, it would be possible to create the iterator on the CPU and ‘index’ it from a GPU thread to get an appropriate index, but it doesn’t look like that’s supported.