The issue with ParallelStencil.jl was the default number of blocks. I only know a little about GPU computing, so I didn’t realize it could matter quite so much.
By setting threads=(16,16) and blocks=(Nx,Ny) .\div threads I get
22.434988 seconds (13.81 M CPU allocations: 587.483 MiB, 2.22% gc time) (2.00 k GPU allocations: 445.312 KiB, 0.03% memmgmt time)
And It is even faster with threads=(32,8).
Thank you for your help,
Best regards