Dynamic parallelism slow in CUDA.jl

By default, space is reserved for 2048 pending child grids; this can be extended by setting the appropriate device limit, as in the following code.

The runtime first tries to add the newly launched grid to the fixed-size pool, and if it is full, uses the virtualized pool. While this means that grids are queued successfully, the costs of using the virtualized pool are higher than those of the fixed-size pool.

1 Like