@Prince_Mahajan I cannot recreate, and I am pretty sure you’re just measuring compile time. This method is pretty heavy on the compile time because of how it performs the kernel generation for GPUs, so that is expected (and there’s a fix we can do for this). When I run your script twice, I get:
result.total_time = 25.036839354
result.total_time = 1.347138324
The first time includes compile time. The second one, without compile time, does not take more than 1.5 seconds.