Caveats to reusing CuArray memory by changing .dims?

eldee · September 15, 2024, 9:48am

Hi,

I have a situation where I’m creating and processing some data on the GPU in a loop. This data is a 3D tensor whose size (nx, ny, nz) can change every iteration, but there are clear upper bounds for the size of each axis: nx <= max_nx, etc. These upper bounds can all be reached simultaneously, but as the values of max_... are relatively small, it is no problem to allocate a CuArray with this maximum size.

Instead of allocating a new data CuArray every iteration, and relying on the garbage collector (and unsafe_free!), I would prefer to just allocate a big CuArray of size (max_nx, max_ny, max_nz) once, and reuse this memory in every iteration, resizing the CuArray (implicitly) to the correct size. This seems to be possible by changing a CuArray’s dims field. For example,

using CUDA

max_nx = max_ny = max_nz = 100
data = CuArray{Float32}(undef, max_nx, max_ny, max_nz)
for _ = 1:5
    nx = rand(1:max_nx); ny = rand(1:max_ny); nz = rand(1:max_nz)
    data.dims = (nx, ny, nz)
    CUDA.rand!(data)
    println("size: ", size(data), "; data sum: ", sum(data), " (expected: ", 0.5f0 * nx * ny * nz, ")")
end

with output like

size: (39, 76, 76); data sum: 112723.05 (expected: 112632.0)
size: (86, 83, 54); data sum: 192607.12 (expected: 192726.0)
size: (16, 43, 91); data sum: 31348.145 (expected: 31304.0)
size: (68, 15, 98); data sum: 50067.98 (expected: 49980.0)
size: (100, 20, 95); data sum: 94950.41 (expected: 95000.0)

(to illustrate that we are indeed only using the appropriate (number of) entries).

So this seems to work fine, and I haven’t encountered any issues so far. But I have this niggling feeling that eventually somewhere this approach will come back to bite me. So I’d like to double-check: Is this a valid way to reuse memory? Could this cause any issues further down the line?

(I’m aware I could alternatively use something like full_data = CuArray{Float32}(undef, max_nx, max_ny, max_nz) and data = @view full_data[1:nx, 1:ny, 1:nz], but the processing is faster using contiguous data. A better view-based approach resulting in contiguous data would be to (slightly) adapt @PeterSimon’s CPU approach here, which uses reshape(view(buffer, 1 : nx * ny * nz), nx, ny, nz) of a (Cu)Vector buffer of length max_nx * max_ny * max_nz. Would this in theory be preferred over altering dims?)

maleadt · September 15, 2024, 11:06am

Can you use reshape instead? The memory will still be shared, but you’ll get a different CuArray object (so it does put some additional pressure on the GC, but shouldn’t matter here as you’re using a single monolithic allocation). Mutating dims right now isn’t problematic, but we probably ought to make that field const, which would break your code.

eldee · September 15, 2024, 12:10pm

Okay thanks! For future-proofing I’ll then switch to the reshape(view(buffer, ...), ...) approach, which indeed also works. In my full code the performance and memory usage (on the macro scale) seem identical between the .dims and reshape approaches (which are faster than when using @view full_data[1:nx, 1:ny, 1:nz]).

maleadt · September 15, 2024, 7:20pm

That’s kind of curious though, as contiguous views should just result in a CuArray object just like reshape does.

eldee · September 16, 2024, 1:24pm

Perhaps I misunderstand how views work or what is the precise definition of contiguous memory. But to my understanding, data = @view full_data[1:nx, 1:ny, 1:nz] would be non-contiguous if nx < max_nx (or ny < max_ny), as there’s a memory gap between ‘subsequent’ entries data[nx, 1, 1] and data[1, 2, 1], in order to accommodate for full_data[nx + 1, 1, 1], ..., full_data[max_nx, 1, 1]?

Then again, in the example below, I actively try to get worse results for this view situation compared to the dims or buffer approach, by destroying memory coalescencing in a memory bandwidth limited scenario. But it doesn’t seem to matter.

using CUDA
using BenchmarkTools

function wrap_cubuffer(buffer, dims)
    # similar to PeterSimon's CPU code
    len = prod(dims)
    len <= length(buffer) || error("The CuVector buffer is too small (length: $(length(buffer))) to wrap a CuArray of size $dims")
    return reshape(view(buffer, 1:len), dims)
end

max_nx = 32; max_ny = 128; max_nz = 65536
nx = 1; ny = 1; nz = max_nz

vals = CUDA.rand(Float32, nx, ny, nz)  # The data values we'll use for all three methods

data_via_dims = CuArray{Float32}(undef, max_nx, max_ny, max_nz)
data_via_dims.dims = (nx, ny, nz)
copyto!(data_via_dims, vals)

buffer = CuVector{Float32}(undef, max_nx * max_ny * max_nz)
data_via_buffer = wrap_cubuffer(buffer, (nx, ny, nz))
copyto!(data_via_buffer, vals)

base_data_for_view = CuArray{Float32}(undef, max_nx, max_ny, max_nz)
data_via_view = @view base_data_for_view[1:nx, 1:ny, 1:nz]
copyto!(data_via_view, vals)

out = similar(vals)  # i.e. CuArray{Float32}(undef, nx, ny, nz)

@btime CUDA.@sync $out .= $data_via_dims .+ 1f0
    #   43.500 μs (65 allocations: 1.50 KiB)
@btime CUDA.@sync $out .= $data_via_buffer .+ 1f0
    #   47.400 μs (65 allocations: 1.50 KiB)
@btime CUDA.@sync $out .= $data_via_view .+ 1f0
   #    41.100 μs (55 allocations: 1.34 KiB)

So here the view approach is even fastest! Could you then explain the mistakes in my reasoning at the beginning of this post?

EDIT: Perhaps the CPU timing portion in @btime CUDA.@sync ... is (somehow) messing up the results. Using CUDA.@elapsed I get timings closer to what I would have expected:

using Statistics

# mean, median and minimum were not consistent between runs, 
# so I'm using 1%-quantile, which is very consistent 
# (but seemingly quite quantised, I guess due to CUDA.@elapsed returning 32-bit floats)

quantile([CUDA.@elapsed out .= data_via_dims  for _ = 1:10^5], 0.01) * 1e6
    #  14.336000276671257 µs
quantile([CUDA.@elapsed out .= data_via_buffer  for _ = 1:10^5], 0.01) * 1e6
    #  14.336000276671257 µs
quantile([CUDA.@elapsed out .= data_via_view  for _ = 1:10^5], 0.01) * 1e6
    #  18.33600072131958 µs

I would have expected the difference between data_via_view and the other two to have been bigger, though.

Here’s another example more similar to the ‘full code’ I mentioned. In short, it’s taking inner products of slices of data with (a subvector of) coeffs.

using CUDA
using CUDA: i32
using BenchmarkTools

function wrap_cubuffer(buffer, dims)
    len = prod(dims)
    len <= length(buffer) || error("The CuVector buffer is too small (length: $(length(buffer))) to wrap a CuArray of size $dims")
    return reshape(view(buffer, 1:len), dims)
end

max_nx = 16; max_ny = 3; max_nz = 2^20
nx = 1; ny = 3; nz = max_nz

# Same setup as in previous example, except we also get coeffs
vals = CUDA.rand(Float32, nx, ny, nz)
coeffs = CUDA.rand(Float32, max_nx)

data_via_dims = CuArray{Float32}(undef, max_nx, max_ny, max_nz)
data_via_dims.dims = (nx, ny, nz)
copyto!(data_via_dims, vals)

buffer = CuVector{Float32}(undef, max_nx * max_ny * max_nz)
data_via_buffer = wrap_cubuffer(buffer, (nx, ny, nz))
copyto!(data_via_buffer, vals)

base_data_for_view = CuArray{Float32}(undef, max_nx, max_ny, max_nz)
data_via_view = @view base_data_for_view[1:nx, 1:ny, 1:nz]
copyto!(data_via_view, vals)

result = CuArray{Float32}(undef, ny, nz)

# Kernel for these inner products
function kernel!(result, data, coeffs)
   z_idx = threadIdx().x + (blockIdx().x - 1i32) * blockDim().x
   while z_idx <= size(data, 3)
	   y_idx = 1i32
	   while y_idx <= size(data, 2)
		   s = 0.f0  # Will contain dot product of data[:, y_idx, z_idx] and coeffs
		   x_idx = 1i32
		   while x_idx <= size(data, 1)
			   s += data[x_idx, y_idx, z_idx] * coeffs[x_idx]
			   x_idx += 1i32
		   end
		   result[y_idx, z_idx] = s

		   y_idx += 1i32
	   end

	   z_idx += blockDim().x * gridDim().x
   end
   return
end
# The memory layout of data exploits caching, not memory coalescing,
# as in the full code I also skip about 2/3 of entries distributed quite uniformly in the biggest (currently z) axis.

@btime CUDA.@sync @cuda threads=512 blocks=cld(nz, 512) kernel!($result, $data_via_dims, $coeffs)
    #    85.000 μs (42 allocations: 1.00 KiB)
@btime CUDA.@sync @cuda threads=512 blocks=cld(nz, 512) kernel!($result, $data_via_buffer, $coeffs)
    #    84.300 μs (42 allocations: 1.00 KiB)
@btime CUDA.@sync @cuda threads=512 blocks=cld(nz, 512) kernel!($result, $data_via_view, $coeffs)
    #    330.300 μs (43 allocations: 1.08 KiB)

For this use-case the view approach is then indeed significantly slower.

maleadt · September 16, 2024, 4:32pm

For analyzing individual kernels like that, it may also be valuable to use CUDA.@profile: CUDA.@bprofile to run the kernel multiple times a la @benchmark but only reporting actual kernel times (i.e., even more accurate than CUDA.@elapsed), or using CUDA.@profile trace=true to see more details about the kernels like register or local memory usage (both of which can significantly impact performance).

eldee · September 18, 2024, 7:35am

I’m not sure it’s fully relevant, but I forgot that launching a kernel already takes some 10 µs as a rule of thumb (or closer to 5 µs in my case as reported by CUDA.@bprofile). But according to CUDA.@bprofile the difference between the three approaches in the .= copy (or .+ 1f0) example above is still only minor.

CUDA.@bprofile time=5 out .= data_via_dims
    #   18.13 µs ± 0.8    ( 17.44 ‥ 39.04)
CUDA.@bprofile time=5 out .= data_via_buffer
    #   18.16 µs ± 0.84   ( 17.41 ‥ 46.3)
CUDA.@bprofile time=5 out .= data_via_view
    #   19.63 µs ± 26.35  (  16.9 ‥ 1019.6)

However, if I increase the data size (nx * ny * nz == max_nz) to

max_nx = 16; max_ny = 2; max_nz = 2^23
nx = 1; ny = 1; nz = max_nz  # Still only use 1 out of every 32 memory consecutive entries in the via_view case

I again get a large advantage of dims and buffer over view:

CUDA.@bprofile time=5 out .= data_via_dims
    #  186.6 µs ± 205.62 ( 171.1 ‥ 5209.16)     (174.07999257557094 µs via CUDA.@elapsed)
CUDA.@bprofile time=5 out .= data_via_buffer
    #  187.58 µs ± 219.06 (171.74 ‥ 5235.43)    (174.07999257557094 µs via CUDA.@elapsed)
CUDA.@bprofile time=5 out .= data_via_view
    #  778.42 µs ± 959.34 (728.76 ‥ 23202.88)   (730.7520136237144 µs via CUDA.@elapsed)

In any case, the conclusion remains that the buffer approach is as fast as the dims approach, but in contrast to the latter should keep working in the future, and is therefore preferred.

Topic		Replies	Views
Dot-product of CuArray views is slow GPU performance , memory-allocation , views	10	1538	May 11, 2021
CatViews for CUDA.jl? Performance	12	368	February 10, 2023
Performance of view with cuArrays GPU	11	2671	November 11, 2018
Inplace array modification performances GPU	4	614	July 20, 2022
How to copy view of CuArray to Array efficiently? GPU question , cuda	4	168	October 6, 2024

Caveats to reusing CuArray memory by changing .dims?

Related topics