How to perform explicit array broadcasting like np.broadcast_to/torch.expand in Julia?

Hi all. In Pytorch/numpy you can perform broadcasting with fine-grained and explicit control. To see what I mean here is a numpy example of what I mean.

import numpy as np

a = np.array([[0,0], [0,1], [0, 1]])
b = np.array([[0,0], [0,1], [1, 0], [1, 0]])
broadcast_a = np.broadcast_to(a, (len(b),) + a.shape)
broadcast_b = np.broadcast_to(b, (len(a),) + b.shape)
c = np.sqrt(np.power((broadcast_a - broadcast_b.transpose((1, 0, 2))), 2)\
    .sum(axis=-1)).T

In Pytorch.

import torch

a = torch.tensor([[0,0], [0,1], [0, 1]])
b = torch.tensor([[0,0], [0,1], [1, 0], [1, 0]])
broadcast_a = a.unsqueeze(0).expand((len(b), ) + a.shape)
broadcast_b = b.unsqueeze(0).expand((len(a), ) + b.shape)
c = torch.pow(broadcast_a - broadcast_b.permute((1, 0, 2)), 2)\
    .sum(dim=-1).sqrt().T

Consider a and b to be matrices representing points in a 2D space and the examples above calculate pairwise euclidean distances between all points in a and b. I haven’t profiled these examples but the documentation of Pytorch/numpy assures me that the broadcasted arrays do not use any additional memory over the original arrays, despite having a larger shape.

Both libraries offer an API that lets you specify the exact shape you wish to broadcast to. Is it possible to do this in Julia?

I’ve scanned the broadcasting documentation and there doesn’t seem to be any obvious way to do this. I’m aware that the Distances.jl can perform this specific task but some experimentation shows that it doesn’t work well with GPU arrays and I’d like to know how to perform these kind of operations for applications other than pairwise distance calculations.

I’m not sure if I understand your question in all generality, but this seems to do the same as your python code:

julia> a = [[0,0], [0,1], [0, 1]];

julia> b = [[0,0], [0,1], [1, 0], [1, 0]];

julia> sqrt.(sum.((x->x.^2).(a .- permutedims(b))))
3×4 Array{Float64,2}:
 0.0  1.0  1.0      1.0
 1.0  0.0  1.41421  1.41421
 1.0  0.0  1.41421  1.41421

Perhaps slightly better to use the sum(f, x) notation:

julia> sqrt.(sum.(x->x.^2, a .- permutedims(b)))
3×4 Array{Float64,2}:
 0.0  1.0  1.0      1.0
 1.0  0.0  1.41421  1.41421
 1.0  0.0  1.41421  1.41421

Faster yet, on my laptop, and maybe clearer is

julia> [sqrt(sum(abs2, x .- y)) for x in a, y in b]

And since you are particularly into 2D space, I recommend using vectors of StaticVectors, like this:

julia> using StaticArrays

julia> a = [SA[0,0], SA[0,1], SA[0, 1]];

julia> b = [SA[0,0], SA[0,1], SA[1, 0], SA[1, 0]];

which gives me a 10x speedup on my laptop.

1 Like

I’m not sure I understand this question. Broadcasting will adapt to the sizes of the arrays in question, but you cannot broadcast a 2x3 matrix to 3x5 or anything like that. Can you broadcast to arbitrary sizes with numpy or will it just expand length-1 dimensions of one array to the length of the corresponding length of the other array, just like Julia?

From my vague knowledge of python, I think its a will actually be a solid array, not an array of arrays. (Or, something more like Julia’s Matrix, than like Vector{Vector}.) This will be more efficient than making arrays of small arrays, unless (as suggested) you use StaticArrays.

julia> a = [[0,0], [0,1], [0, 1]];

julia> b = [[0,0], [0,1], [1, 0], [1, 0]];

julia> as = reduce(hcat, a); bs = reduce(hcat, b)
2×4 Matrix{Int64}:
 0  0  1  1
 0  1  0  0

julia> sqrt.(dropdims(sum((as .- reshape(bs, size(bs,1), 1, size(bs,2))).^2, dims=1), dims=1))
3×4 Matrix{Float64}:
 0.0  1.0  1.0      1.0
 1.0  0.0  1.41421  1.41421
 1.0  0.0  1.41421  1.41421

# other tricks for how to write this:

julia> const newaxis = [CartesianIndex{0}()];

julia> view(bs, :, newaxis, :) = reshape(bs, size(bs,1), 1, size(bs,2))
true

julia> using TensorCast

julia> sqrt.(@reduce tmp[i,j] := sum(x) (as[x,i] - bs[x,j])^2)
3×4 Matrix{Float64}:
 0.0  1.0  1.0      1.0
 1.0  0.0  1.41421  1.41421
 1.0  0.0  1.41421  1.41421

Is this true? (I mean as a real question, not snark!) Julia’s broadcasting will always (at present) materialise the big array, before reducing (i.e. the 2×3×4 Array{Int64, 3} which is the argument of sum.) But things like reshape or view(bs, :, newaxis, :) will not not make a copy.

3 Likes

Yes, I was actually aware, but a numpy matrix actually behaves like a vector of vectors, so that a[0] returns [0, 0]. So, semantically, it’s a bit unclear which representation to choose.

Anyway, one can still write

julia> a = [0 0;  0 1; 0 1]
3×2 Matrix{Int64}:
 0  0
 0  1
 0  1

julia> b = [0 0; 0 1; 1 0; 1 0]
4×2 Matrix{Int64}:
 0  0
 0  1
 1  0
 1  0

julia> [sqrt(sum(abs2, x .- y)) for x in eachrow(a), y in eachrow(b)]
3×4 Matrix{Float64}:
 0.0  1.0  1.0      1.0
 1.0  0.0  1.41421  1.41421
 1.0  0.0  1.41421  1.41421
1 Like

As a fellow Python ML person, hello!

I think it’s difficult to suggest a completely general solution because broadcasting semantics differ so much between Numpy-like libraries and Julia. For this specific case, @mcabbott’s first solution above would probably be the most “natural” coming from Python. Tweaked just slightly to match your code snippet:

broadcast_a = as
# Julia doesn't have a built-in unsqueeze, but this is exactly unsqueeze(1) in PyTorch
broadcast_b = reshape(bs, size(bs, 1), 1, size(bs, 2))
c = sqrt.(sum((broadcast_a .- broadcast_b).^2, dims=1))
# optionally, to match the row -> column-first change. N.B. I snuck this in to avoid writing dropdims
c = reshape(c, size(bs, 2), size(as, 2))

At least for PyTorch, yes: torch.Tensor.expand — PyTorch 1.9.0 documentation

Edit: I just tried the versions with comprehensions but did not manage to make them work with Zygote. Assuming you want to run some kind of AD with this, the reshape + broadcasting or einsum versions are the way to go.

You can’t broadcast to arbitrary sizes but you can broadcast an (a, b, c) matrix to (a, b, c, n) or (a, n, b, c) and so on.

In my example I’m taking a 4x2 and a 3x2 matrix, expanding dimensions (“unsqueezing”) to 1x4x2 and 3x1x2 and broadcasting both of these two matrices to 4x3x2.

So it seems broadcasting is like in Julia, and you add in new dimensions. The corresponding thing to do in Julia is reshape then broadcast.

Thanks everyone for your answers. The reshape, view and TensorCast based solutions all work as expected.

However, I’ve realised I’ve made a slight mistake in my initial question. The initial broadcasting operations of PyTorch/numpy won’t use additional memory but the result of broadcast_a - broadcast_b will be materialised as its the argument to the sum function. In my head I was looking for a solution that didn’t materialise the intermediate array but I didn’t explain that well in the question.

I’ve managed to produce a partial solution to my specific pairwise distances problem by writing a CUDA kernel. The following script benchmarks some of the solutions presented here.

using CUDA
using TensorCast

function cuda_pairwise!(a, b, c, width, height)
    # Subtract 1 from blockIdx because Julia is 1-indexed not 0-indexed
    row = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    col = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    # Early exit for out of bounds
    if !((1 <= row <= width) & (1 <= col <= height))
        return nothing
    end

    euclidean_distance = 0
    for i in 1:size(a)[2]
        @inbounds euclidean_distance += (a[row, i] - b[col, i]) ^ 2
    end

    @inbounds c[row, col] = sqrt(euclidean_distance)
    
    return nothing
end

function cuda_pairwise(a, b, numblocks=nothing)
    n = size(a)[1]
    m = size(b)[1]
    c = CUDA.fill(zero(eltype(a)), (n, m))
    if isnothing(numblocks)
        numblocks = (ceil(Int, n/32), ceil(Int, n/32))
    end
    threads = (32, 32)
    @cuda threads=threads blocks=numblocks cuda_pairwise!(a, b, c, n, m)
    return c
end

a = CuArray([0.0f0 0.0f0; 0.0f0 1.0f0; 0.0f0 1.0f0]);
b = CuArray([0.0f0 0.0f0; 0.0f0 1.0f0; 1.0f0 0.0f0; 1.0f0 0.0f0]);
# a = CUDA.fill(0.0f0, (4096, 128));
# b = CUDA.fill(1.0f0, (4096, 128));
a_t = transpose(a) |> CuArray;
b_t = transpose(b) |> CuArray;

println("cuda_pairwise")
CUDA.@sync cuda_pairwise(a, b);
CUDA.@time CUDA.@sync cuda_pairwise(a, b)

newaxis = [CartesianIndex{0}()];
println("newaxis_pairwise")
CUDA.@sync sum(view(a, newaxis, :, :) .- view(b, :, newaxis, :), dims=3);
CUDA.@time CUDA.@sync sum(view(a, newaxis, :, :) .- view(b, :, newaxis, :), dims=3)

println("tensorcast_pairwise")
CUDA.@sync sqrt.(@reduce tmp[i,j] := sum(x) (a_t[x,i] - b_t[x,j])^2);
CUDA.@time CUDA.@sync sqrt.(@reduce tmp[i,j] := sum(x) (a_t[x,i] - b_t[x,j])^2);

This produces the following results

cuda_pairwise
  0.000302 seconds (389 CPU allocations: 6.422 KiB) (1 GPU allocation: 48 bytes, 5.77% gc time of which 81.61% spent allocating)
newaxis_pairwise
  0.000362 seconds (400 CPU allocations: 9.516 KiB) (2 GPU allocations: 144 bytes, 6.74% gc time of which 84.43% spent allocating)
tensorcast_pairwise
  0.000432 seconds (459 CPU allocations: 9.891 KiB) (3 GPU allocations: 192 bytes, 7.67% gc time of which 85.50% spent allocating)

The CUDA kernel solution only allocates 48 bytes which is just the output 43 array while the other solutions allocate more because they materialise the intermediate array. This is much more apparent when you use larger matrices (in this case two 4096128 matrices).

cuda_pairwise
  0.022827 seconds (30.16 k CPU allocations: 471.547 KiB) (1 GPU allocation: 64.000 MiB, 0.11% gc time of which 82.63% spent allocating)
newaxis_pairwise
  0.323142 seconds (464.30 k CPU allocations: 7.088 MiB, 4.50% gc time) (2 GPU allocations: 8.062 GiB, 4.54% gc time of which 0.13% spent allocating)
tensorcast_pairwise
  0.269536 seconds (396.04 k CPU allocations: 6.046 MiB, 1.08% gc time) (3 GPU allocations: 8.125 GiB, 2.25% gc time of which 51.09% spent allocating)

I say “partial” solution because I can’t seem to use Zygote with this function, but that’s a problem for another question.

I came up with a hack that could work for you.

Normally, it is possible to broadcast operations between two vectors e.g. to create a matrix from a vector and a transposed vector / adjoint:

u = [1,2,3]
v = [1,2,3,4]
u .- v' # note the '

So if we know what the subtraction does to each element and how to transpose the vector it’s all worked out for us. So let’s try this with static vectors instead of numbers:

using StaticVectors
# normally static arrays are only recommended for smaller sizes!
a = [@SVector(rand(Float32,128)) for _ in 1:4096];
b = [@SVector(rand(Float32,128)) for _ in 1:4096];
a .- b'

Throws a dimension mismatch error. So the SVector does not seem to behave like a number. But we can make it behave like a number by wrapping it in a tuple, which shouldn’t be harmful to the runtime:

a = tuple.(a) # note that Tuple.(a) behaves different
b = tuple.(b)
a .- b'

Another error: “No method matching adjoint(::Tuple{SVector{128, Float32})” - it doesn’t know how to transpose our stuff. But we can tell julia to just leave all the tuple elements as they are for the adjoint:

Base.adjoint(x::Tuple{SVector{N,T}}) where {N,T} = x
a' # yields something like a 1x4096 row vector - we're getting closer!
a .- b'

Next error: “No method matching -(::Tuple, …)” - it doesn’t know how to subtract our tuple-wrapped SVectors. Again, we tell it what to do:

Base.:-(x::Tuple{<:SVector}, y::Tuple{<:SVector}) = first(x) .- first(y)
a .- b' # works. yay!

Ok, so it works on CPU. The nice thing is that it also works on GPU with reasonable performance:

using CUDA # I'm using v3.3.4
agpu = cu(a)
bgpu = cu(b)
agpu .- bgpu # 4096 x 4096

And lastly, to get pairwise distances:

using LinearAlgebra
CUDA.@time norm.(agpu .- bgpu') # 64MB allocs as in your example
out = similar(norm.(agpu .- bgpu'))
CUDA.@time out .= norm.(agpu .- bgpu') # GPU allocations are completely gone
# 0.151008 seconds (286 CPU allocations: 5.406 KiB) on a GTX 1080 Ti

I wouldn’t consider this a good solution; it’s just what I came up with during procrastination… it’s not very extensible, but it should work with AD / Zygote. Results should be the same as with your cuda_pairwise. The materialization of an intermediate array is avoided and allocations are down an order of magnitude, but your version is still about 10 times faster. If you figure out what the derivative is, you can write a gradient kernel and register an rrule with ChainRules, then Zygote will magically work.

using ChainRulesCore
import ChainRulesCore: rrule

# look e.g. here https://github.com/FluxML/NNlib.jl/blob/master/src/softmax.jl
function rrule(::typeof(pairwise_dist), x, y)
    Ω = pairwise_dist(x, s)
    pairwise_dist_pullback(Δ) = # your gradient kernel call here
    return Ω, pairwise_dist_pullback
end

Maybe worth mentioning that for this particular function, the standard trick to write it in array operations uses matrix multiplication, not just broadcast and sum. This ought to be faster, and Zygote/CUDA compatible.

julia> dist_mul(x::AbstractMatrix, y::AbstractMatrix) = sqrt.(sum(abs2, x; dims=1)' .+ sum(abs2, y; dims=1) .- 2 .* (x'*y));

julia> dist_mul(as, bs)
3×4 Matrix{Float64}:
 0.0  1.0  1.0      1.0
 1.0  0.0  1.41421  1.41421
 1.0  0.0  1.41421  1.41421

julia> dist_cast(as, bs) = sqrt.(dropdims(sum((as .- reshape(bs, size(bs,1), 1, size(bs,2))).^2, dims=1), dims=1));

julia> using BenchmarkTools

julia> xs = randn(3,100); ys = randn(3,100);

julia> @btime dist_cast($ys, $xs);
  59.125 μs (11 allocations: 391.02 KiB)

julia> @btime dist_mul($ys, $xs);
  12.833 μs (6 allocations: 158.09 KiB)

julia> using Distances

julia> @btime Distances.pairwise(Euclidean(), $xs, $ys);
  19.833 μs (5 allocations: 80.05 KiB)

julia> dist_cols(as, bs) = [sqrt(sum(abs2, x .- y)) for x in eachcol(as), y in eachcol(bs)];

julia> @btime dist_cols($ys, $xs);
  251.041 μs (10002 allocations: 859.42 KiB)

And another way is to use Tullio, which should work on the GPU, but I haven’t timed it there today. (It makes a second kernel for the gradient.)

julia> using Tullio

julia> dist_tullio(as, bs) = @tullio out[i,j] := (as[x,i] - bs[x,j])^2 |> sqrt  avx=false

julia> @btime dist_tullio($ys, $xs);
  23.167 μs (2 allocations: 78.20 KiB)

julia> using LoopVectorization

julia> dist_tullio(as, bs) = @tullio out[i,j] := (as[x,i] - bs[x,j])^2 |> sqrt

julia> @btime dist_tullio($ys, $xs);
  14.750 μs (2 allocations: 78.20 KiB)

julia> using CUDA, KernelAbstractions, CUDAKernels  # before calling @tullio
3 Likes

Tullio seems good for this, thanks for brining this up. Script for timings on the GPU

using CUDA, KernelAbstractions, CUDAKernels, Tullio

# use cuda_pairwise from earlier

tullio_pairwise(x, y) = @tullio out[i,j] := (x[i,k] - y[j,k])^2 |> sqrt

a = CUDA.fill(0.0f0, (4096, 128));
b = CUDA.fill(1.0f0, (4096, 128));

# Warmup
CUDA.@sync cuda_pairwise(a, b)
CUDA.@sync tullio_pairwise(a, b)

println("cuda_pairwise")
CUDA.@time CUDA.@sync cuda_pairwise(a, b);
println("tullio_pairwise")
CUDA.@time CUDA.@sync tullio_pairwise(a, b);

Outputs

cuda_pairwise
  0.016531 seconds (19.87 k CPU allocations: 310.797 KiB) (1 GPU allocation: 64.000 MiB, 22.93% gc time of which 99.90% spent allocating)
tullio_pairwise
  0.018748 seconds (24.51 k CPU allocations: 386.000 KiB) (1 GPU allocation: 64.000 MiB, 19.20% gc time of which 99.89% spent allocating)