Thanks everyone for your answers. The reshape
, view
and TensorCast
based solutions all work as expected.
However, I’ve realised I’ve made a slight mistake in my initial question. The initial broadcasting operations of PyTorch/numpy won’t use additional memory but the result of broadcast_a - broadcast_b
will be materialised as its the argument to the sum function. In my head I was looking for a solution that didn’t materialise the intermediate array but I didn’t explain that well in the question.
I’ve managed to produce a partial solution to my specific pairwise distances problem by writing a CUDA kernel. The following script benchmarks some of the solutions presented here.
using CUDA
using TensorCast
function cuda_pairwise!(a, b, c, width, height)
# Subtract 1 from blockIdx because Julia is 1-indexed not 0-indexed
row = (blockIdx().y - 1) * blockDim().y + threadIdx().y
col = (blockIdx().x - 1) * blockDim().x + threadIdx().x
# Early exit for out of bounds
if !((1 <= row <= width) & (1 <= col <= height))
return nothing
end
euclidean_distance = 0
for i in 1:size(a)[2]
@inbounds euclidean_distance += (a[row, i] - b[col, i]) ^ 2
end
@inbounds c[row, col] = sqrt(euclidean_distance)
return nothing
end
function cuda_pairwise(a, b, numblocks=nothing)
n = size(a)[1]
m = size(b)[1]
c = CUDA.fill(zero(eltype(a)), (n, m))
if isnothing(numblocks)
numblocks = (ceil(Int, n/32), ceil(Int, n/32))
end
threads = (32, 32)
@cuda threads=threads blocks=numblocks cuda_pairwise!(a, b, c, n, m)
return c
end
a = CuArray([0.0f0 0.0f0; 0.0f0 1.0f0; 0.0f0 1.0f0]);
b = CuArray([0.0f0 0.0f0; 0.0f0 1.0f0; 1.0f0 0.0f0; 1.0f0 0.0f0]);
# a = CUDA.fill(0.0f0, (4096, 128));
# b = CUDA.fill(1.0f0, (4096, 128));
a_t = transpose(a) |> CuArray;
b_t = transpose(b) |> CuArray;
println("cuda_pairwise")
CUDA.@sync cuda_pairwise(a, b);
CUDA.@time CUDA.@sync cuda_pairwise(a, b)
newaxis = [CartesianIndex{0}()];
println("newaxis_pairwise")
CUDA.@sync sum(view(a, newaxis, :, :) .- view(b, :, newaxis, :), dims=3);
CUDA.@time CUDA.@sync sum(view(a, newaxis, :, :) .- view(b, :, newaxis, :), dims=3)
println("tensorcast_pairwise")
CUDA.@sync sqrt.(@reduce tmp[i,j] := sum(x) (a_t[x,i] - b_t[x,j])^2);
CUDA.@time CUDA.@sync sqrt.(@reduce tmp[i,j] := sum(x) (a_t[x,i] - b_t[x,j])^2);
This produces the following results
cuda_pairwise
0.000302 seconds (389 CPU allocations: 6.422 KiB) (1 GPU allocation: 48 bytes, 5.77% gc time of which 81.61% spent allocating)
newaxis_pairwise
0.000362 seconds (400 CPU allocations: 9.516 KiB) (2 GPU allocations: 144 bytes, 6.74% gc time of which 84.43% spent allocating)
tensorcast_pairwise
0.000432 seconds (459 CPU allocations: 9.891 KiB) (3 GPU allocations: 192 bytes, 7.67% gc time of which 85.50% spent allocating)
The CUDA kernel solution only allocates 48 bytes which is just the output 43 array while the other solutions allocate more because they materialise the intermediate array. This is much more apparent when you use larger matrices (in this case two 4096128 matrices).
cuda_pairwise
0.022827 seconds (30.16 k CPU allocations: 471.547 KiB) (1 GPU allocation: 64.000 MiB, 0.11% gc time of which 82.63% spent allocating)
newaxis_pairwise
0.323142 seconds (464.30 k CPU allocations: 7.088 MiB, 4.50% gc time) (2 GPU allocations: 8.062 GiB, 4.54% gc time of which 0.13% spent allocating)
tensorcast_pairwise
0.269536 seconds (396.04 k CPU allocations: 6.046 MiB, 1.08% gc time) (3 GPU allocations: 8.125 GiB, 2.25% gc time of which 51.09% spent allocating)
I say “partial” solution because I can’t seem to use Zygote with this function, but that’s a problem for another question.