You should go with Distributed.jl instead since I imagine you want to use multiple machine with gpus see Multiple GPUs · CUDA.jl.
If you do want to use multithreaded on 1 gpu (which I think will be far slower than just building the big block sparse array and use full power of your gpu on it) I think you will run out of mem really quickly (see Multhreading & GPU memory management).
PS : Last little thing maybe you know but the code you showed isn’t thread safe