Those are CPU allocations, likely because of the additional boxing happening when invoking the kernel (passing a CuDeviceArray directly vs. additionally putting it in a closure object). You don’t have to care about those, the generated GPU code is going to be near identical, and can’t ever allocate anyway. Meanwhile, allocating such tiny objects on the CPU is really fast, and you can’t escape having to allocate some (the kernel’s arguments, it be regular arguments or captured ones, have to be boxed when sending it to the CUDA driver).
1 Like