Perhaps I misunderstand how views work or what is the precise definition of contiguous memory. But to my understanding, data = @view full_data[1:nx, 1:ny, 1:nz]
would be non-contiguous if nx < max_nx
(or ny < max_ny
), as there’s a memory gap between ‘subsequent’ entries data[nx, 1, 1]
and data[1, 2, 1]
, in order to accommodate for full_data[nx + 1, 1, 1], ..., full_data[max_nx, 1, 1]
?
Then again, in the example below, I actively try to get worse results for this view
situation compared to the dims
or buffer approach, by destroying memory coalescencing in a memory bandwidth limited scenario. But it doesn’t seem to matter.
using CUDA
using BenchmarkTools
function wrap_cubuffer(buffer, dims)
# similar to PeterSimon's CPU code
len = prod(dims)
len <= length(buffer) || error("The CuVector buffer is too small (length: $(length(buffer))) to wrap a CuArray of size $dims")
return reshape(view(buffer, 1:len), dims)
end
max_nx = 32; max_ny = 128; max_nz = 65536
nx = 1; ny = 1; nz = max_nz
vals = CUDA.rand(Float32, nx, ny, nz) # The data values we'll use for all three methods
data_via_dims = CuArray{Float32}(undef, max_nx, max_ny, max_nz)
data_via_dims.dims = (nx, ny, nz)
copyto!(data_via_dims, vals)
buffer = CuVector{Float32}(undef, max_nx * max_ny * max_nz)
data_via_buffer = wrap_cubuffer(buffer, (nx, ny, nz))
copyto!(data_via_buffer, vals)
base_data_for_view = CuArray{Float32}(undef, max_nx, max_ny, max_nz)
data_via_view = @view base_data_for_view[1:nx, 1:ny, 1:nz]
copyto!(data_via_view, vals)
out = similar(vals) # i.e. CuArray{Float32}(undef, nx, ny, nz)
@btime CUDA.@sync $out .= $data_via_dims .+ 1f0
# 43.500 μs (65 allocations: 1.50 KiB)
@btime CUDA.@sync $out .= $data_via_buffer .+ 1f0
# 47.400 μs (65 allocations: 1.50 KiB)
@btime CUDA.@sync $out .= $data_via_view .+ 1f0
# 41.100 μs (55 allocations: 1.34 KiB)
So here the view
approach is even fastest! Could you then explain the mistakes in my reasoning at the beginning of this post?
EDIT: Perhaps the CPU timing portion in @btime CUDA.@sync ...
is (somehow) messing up the results. Using CUDA.@elapsed
I get timings closer to what I would have expected:
using Statistics
# mean, median and minimum were not consistent between runs,
# so I'm using 1%-quantile, which is very consistent
# (but seemingly quite quantised, I guess due to CUDA.@elapsed returning 32-bit floats)
quantile([CUDA.@elapsed out .= data_via_dims for _ = 1:10^5], 0.01) * 1e6
# 14.336000276671257 µs
quantile([CUDA.@elapsed out .= data_via_buffer for _ = 1:10^5], 0.01) * 1e6
# 14.336000276671257 µs
quantile([CUDA.@elapsed out .= data_via_view for _ = 1:10^5], 0.01) * 1e6
# 18.33600072131958 µs
I would have expected the difference between data_via_view
and the other two to have been bigger, though.
Here’s another example more similar to the ‘full code’ I mentioned. In short, it’s taking inner products of slices of data
with (a subvector of) coeffs
.
using CUDA
using CUDA: i32
using BenchmarkTools
function wrap_cubuffer(buffer, dims)
len = prod(dims)
len <= length(buffer) || error("The CuVector buffer is too small (length: $(length(buffer))) to wrap a CuArray of size $dims")
return reshape(view(buffer, 1:len), dims)
end
max_nx = 16; max_ny = 3; max_nz = 2^20
nx = 1; ny = 3; nz = max_nz
# Same setup as in previous example, except we also get coeffs
vals = CUDA.rand(Float32, nx, ny, nz)
coeffs = CUDA.rand(Float32, max_nx)
data_via_dims = CuArray{Float32}(undef, max_nx, max_ny, max_nz)
data_via_dims.dims = (nx, ny, nz)
copyto!(data_via_dims, vals)
buffer = CuVector{Float32}(undef, max_nx * max_ny * max_nz)
data_via_buffer = wrap_cubuffer(buffer, (nx, ny, nz))
copyto!(data_via_buffer, vals)
base_data_for_view = CuArray{Float32}(undef, max_nx, max_ny, max_nz)
data_via_view = @view base_data_for_view[1:nx, 1:ny, 1:nz]
copyto!(data_via_view, vals)
result = CuArray{Float32}(undef, ny, nz)
# Kernel for these inner products
function kernel!(result, data, coeffs)
z_idx = threadIdx().x + (blockIdx().x - 1i32) * blockDim().x
while z_idx <= size(data, 3)
y_idx = 1i32
while y_idx <= size(data, 2)
s = 0.f0 # Will contain dot product of data[:, y_idx, z_idx] and coeffs
x_idx = 1i32
while x_idx <= size(data, 1)
s += data[x_idx, y_idx, z_idx] * coeffs[x_idx]
x_idx += 1i32
end
result[y_idx, z_idx] = s
y_idx += 1i32
end
z_idx += blockDim().x * gridDim().x
end
return
end
# The memory layout of data exploits caching, not memory coalescing,
# as in the full code I also skip about 2/3 of entries distributed quite uniformly in the biggest (currently z) axis.
@btime CUDA.@sync @cuda threads=512 blocks=cld(nz, 512) kernel!($result, $data_via_dims, $coeffs)
# 85.000 μs (42 allocations: 1.00 KiB)
@btime CUDA.@sync @cuda threads=512 blocks=cld(nz, 512) kernel!($result, $data_via_buffer, $coeffs)
# 84.300 μs (42 allocations: 1.00 KiB)
@btime CUDA.@sync @cuda threads=512 blocks=cld(nz, 512) kernel!($result, $data_via_view, $coeffs)
# 330.300 μs (43 allocations: 1.08 KiB)
For this use-case the view
approach is then indeed significantly slower.