Parallelizing a Nested Loop Concurrently

Is there a way to use pmap in the inner loop and @floops in the outer loop?

Or to do this with Distributed instead of multithreading?

Providing a little bit of context, the current solution based on FLoops spends a lot of time in Garbage Collection and uses a lot of memory. In this post, @odow provides a version based on Distributed which has comparable speed but uses much less memory.