Simple CuArray conversion, reverse, and transpose taking too long?

I am ingesting a NetCDF file using the NCDatasets package; the resulting array of type Matrix{Float32} has size (32768,1000). Afterwards, I’m converting the array to a CuArray:

```using CUDA
using NCDatasets

CUDA.allowscalar(false)

# Read in data
t0 = Dates.now()
ds = Dataset(data_path)["power"][1,:,:]
t1 = Dates.now()

# Convert to CUDA array
cu_ds = CuArray(ds)
t2 = Dates.now()

# Transpose CUDA array
cu_ts = transpose(cu_ds)
t3 = Dates.now()

# Reverse CUDA array
cu_rs = reverse!(cu_ts,dims=1)
t4 = Dates.now()```

Times for each of these steps is as follows:

```[ Info: Ingest:     1235 milliseconds
[ Info: CUDA array: 2695 milliseconds
[ Info: Transpose:  2718 milliseconds
[ Info: Reverse!:   16385 milliseconds```

I know I’m doing something wrong, but I cannot figure out what it is. I’ve checked file types to ensure all arrays are CuArrays. I’ve also looked through the CUDA GitHub script to make sure I’m calling the functions correctly.

In Julia, the first execution of everything takes longer. Here, the GPU compiler is being compiled for the reverse! kernel. Call it a second time, it’ll be instantaneous.

Thanks! I’m aware that the compiler takes a bit longer the first iteration, but I only need to call this function once at the beginning of my script.

Is there any way I can initialize reverse! by calling it on a very small or empty array, pay this “time tax” early on, and have the reverse! function work instantaneously when I actually need to use it? Or will it have a large time penalty regardless?

That would work, as long as the types of objects that will be passed to the kernel are identical. It’s not yet possible to precompile those invocations though, due to a number of missing features and bugs in Julia, so for now it needs to happen at run time.