Parallelizing for loop in the computation of a gradient

Yeah but so I can add an additional function barrier with the explicit kernel function and even if some types are messed up due to the closure bug, I only have to pay nthreads() times for the dynamic dispatch.