Migrating code from Matlab: Accessing arrays without allocating

One of the performance critical parts of my Matlab code is this:

function M = MFdiff(x1, x2)
% Computes a flattened multidimensional x1' - x2
% x1 is (d x n1)
% x2 is (d x n2)
% M  is (d x n1 * n2)

n1 = size(x1, 2);
n2 = size(x2, 2);

M = repmat(x1, 1, n2) - repelem(x2, 1, n1);
end

In Julia, I tried to write it like this:

function BMdiff!(out, x1, x2)
    # Broadcasted vector diff

    n1 = size(x1, 2)
    n2 = size(x2, 2)

    for k = 1:n2
        out[:, 1+n1*(k-1):n1*k] = x1 .- x2[:, k]
    end
end

The Julia code does one allocation per each loop iteration and is three times slower than the Matlab version.

I can post full code and benchmarks if there is no obvious performance error. See also What is the Eigen equivalent of Matlab repelem?

A bit faster (350ns vs 680ns) is if you use .=:

function BMdiff!(out, x1, x2)
    # Broadcasted vector diff

    n1 = size(x1, 2)
    n2 = size(x2, 2)

    for k = 1:n2
        out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k]
    end
end

but this still creates a view. Fastest and allocation-free is the loop:

function BMdiff2!(out, x1, x2)
    # Broadcasted vector diff

    n1 = size(x1, 2)
    n2 = size(x2, 2)
    d = size(x1,1)

    for k = 1:n2
        for i = 1:d
            for (jj,j) = enumerate(1+n1*(k-1):n1*k)
                out[i, j] = x1[i,jj] - x2[i, k]
            end
        end
    end
end

at 140ns, i.e. 5x faster.

It’ll definitely be easier to help if you can at least post the sizes of matrices you’re interested in (since that might affect the optimal approach) and what the relevant timings from Matlab were. But there’s a one-character change that helps quite a bit, which is to turn the out[...] = x1 .- x2[:, k] into out[...] .= x1 .- x2[:, k] which allows Julia to fuse the broadcasted - with the broadcasted =, avoiding allocating an entire new vector just to hold the right-hand side of that assignment. See: More Dots: Syntactic Loop Fusion in Julia

But I wonder if your loop is even necessary. There’s nothing wrong with loops in Julia, but sometimes using broadcasting can make your intentions clearer, and it should perform almost as well as a well-written loop. In particular, this seems like a case where actually using a 3D array can make your life much easier (rather than collapsing the last two dimensions). Here’s an example implementation that just uses broadcasting:

julia> function mfd!(out, x1, x2)
         d = size(x1, 1)
         out .= reshape(x1, (d, :, 1)) .- reshape(x2, (d, 1, :))
       end
mfd! (generic function with 1 method)

julia> d = 1; n1 = 2; n2 = 3; x1 = rand(d, n1); x2 = rand(d, n2);

julia> out = zeros(d, n1, n2);

julia> mfd!(out, x1, x2)
1Ă—2Ă—3 Array{Float64,3}:
[:, :, 1] =
 -0.11027  -0.207838

[:, :, 2] =
 0.0095438  -0.0880241

[:, :, 3] =
 -0.453903  -0.551471

and if you really need a 2D structure, you can always reshape the result:

julia> reshape(out, (d, :))
1Ă—6 Array{Float64,2}:
 -0.11027  -0.207838  0.0095438  -0.0880241  -0.453903  -0.551471

This performs similarly to the improved loop version (with the .=):

julia> out_flat = zeros(d, n1 * n2);

julia> @btime BMdiff!($out_flat, $x1, $x2)
  103.822 ns (3 allocations: 288 bytes)
 
julia> @btime mfd!($out, $x1, $x2)
  110.422 ns (4 allocations: 224 bytes)
3 Likes
        for i = 1:d
            for (jj,j) = enumerate(1+n1*(k-1):n1*k)
                out[i, j] = x1[i,jj] - x2[i, k]
            end
        end

Those two loops should be inverted for a performance speedup

out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k]

This does not use a view (you need @views or view(x2,:,k) for that).

For small arrays, the (well-ordered) loop version is faster, while for large arrays all versions (loops, views or no views) are equally fast for me. Even for small arrays the performance of the broadcasted version is acceptable (within 3x of the fast version), and it’s unlikely that a computation would be limited by this. If it is, consider using StaticArrays.

Do you know where do the allocations come from in the out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k] variant?

Creating a view. Passing a view to any non inlined function will require it to be allocated

Ref:
https://github.com/JuliaLang/julia/pull/29867
https://github.com/JuliaLang/julia/pull/29868

1 Like

@kristoffer.carlsson: @mauro3 was asking about a code without views. x2[:, k] allocates, even if it’s contiguous: RHS slices are not views, even when participating in broadcast. LHS slices don’t allocate, I think.

1 Like

No, the expression there gets lowered to views.

julia> Meta.@lower out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k]
:($(Expr(:thunk, CodeInfo(
 1 ─ %1  = (Base.getproperty)(Base.Broadcast, :materialize!)                                                                                                                                                                │
 │   %2  = k - 1                                                                                                                                                                                                            │
 │   %3  = n1 * %2                                                                                                                                                                                                          │
 │   %4  = 1 + %3                                                                                                                                                                                                           │
 │   %5  = n1 * k                                                                                                                                                                                                           │
 │   %6  = %4:%5                                                                                                                                                                                                            │
 │   %7  = (Base.dotview)(out, :, %6)  <------------------------------------ view                                                                                                                                                          │
 │   %8  = (Base.getproperty)(Base.Broadcast, :broadcasted)                                                                                                                                                                 │
 │   %9  = (Base.getindex)(x2, :, k)                                                                                                                                                                                        │
 │   %10 = (%8)(-, x1, %9)                                                                                                                                                                                                  │
 │   %11 = (%1)(%7, %10)                                                                                                                                                                                                    │
 └──       return %11                                                                                                                                                                                                       │
))))

The view on a LHS is not magical and it has to be tracked. When it gets passed to the noninlined copyto! it will cause the view to be allocated (inline copyto! for SubArrays by KristofferC · Pull Request #29868 · JuliaLang/julia · GitHub)

That’s only the LHS that gets view-ified. The indexing expression on the RHS is still just normal indexing.

Yes, I was commenting on “asking about a code without views” when the code had a view and “LHS slices don’t allocate,” when they do allocate.

Fair enough. I was missing the tree for the forest. But those allocations don’t necessarily need to occur — often broadcast manages to arrange things in a way that it doesn’t hit that penalty. You can see this if you make the RHS slice participate in the dot fusion:

julia> d = 5; n1 = 10; n2 = 30; x1 = rand(d, n1); x2 = rand(d, n2); out = zeros(d, n1* n2);

julia> function BMdiff!(out, x1, x2)
           # Broadcasted vector diff

           n1 = size(x1, 2)
           n2 = size(x2, 2)

           for k = 1:n2
               out[:, 1+n1*(k-1):n1*k] .= x1 .- getindex.((x2,), axes(x2,1), k)
           end
       end
BMdiff! (generic function with 1 method)

julia> @btime BMdiff!($out, $x1, $x2)
  2.790 ÎĽs (0 allocations: 0 bytes)
3 Likes

Ah right, sorry, I was only considering array allocations, not views. So there are two kinds of allocations going on there: arrays and views. out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k] allocates twice, a copy on the RHS and a view on the LHS that gets allocated. The copy on the RHS is semantics and unlikely to change soon, the allocation on the LHS is a performance issue that looks like it might get solved soon. By contrast, out[:, 1+n1*(k-1):n1*k] .= x1 .- view(x2,:,k) allocates two views, both being known performance issues (but one being easier than the other to get rid of). Correct?

Interestingly, commenting out the getindex.( part and we are back to copyto! and allocations:

julia> function BMdiff!(out, x1, x2)
           # Broadcasted vector diff

           n1 = size(x1, 2)
           n2 = size(x2, 2)

           for k = 1:n2
               out[:, 1+n1*(k-1):n1*k] .= x1 # .- getindex.((x2,), axes(x2,1), k)
           end
       end
BMdiff! (generic function with 1 method)

julia> @btime BMdiff!($out, $x1, $x2)
  726.189 ns (30 allocations: 1.88 KiB)

Edit: With https://github.com/JuliaLang/julia/pull/29868 we have:

julia> @btime BMdiff!($out, $x1, $x2)
  422.608 ns (0 allocations: 0 bytes)
1 Like

Thank you for an enlightening discussion! I see that the issue is quite delicate. With your help I was able to make the Julia code competitive. Below are my timings and code for Matlab, C++ and Julia.

Language Time (s)
C++ (Eigen) 12
Matlab (normal use) 14
Julia 18
Matlab (single thread) 20

C++ (Eigen)

#include <iostream>
#include <chrono>

#include <Eigen/Dense>

using namespace Eigen;

template <typename Derived1, typename Derived2>
MatrixXd MFdiff(
  DenseBase<Derived1> const & x1,
  DenseBase<Derived2> const & x2)
{
  int d = x1.rows();
  int n1 = x1.cols();
  int n2 = x2.cols();

  MatrixXd out(d, n1*n2);

  for (int c = 0, idx = 0; c != n2; c+=1, idx += n1)
    out.block(0, idx, d, n1) = x1.colwise() - x2.col(c);

  return out;
}

void test(VectorXi const & ds,
	  VectorXi const & n1s,
	  VectorXi const & n2s)
{
  auto start = std::chrono::high_resolution_clock::now();
  MatrixXd z1 = MatrixXd::Random(ds.maxCoeff(), n1s.maxCoeff());
  MatrixXd z2 = MatrixXd::Random(ds.maxCoeff(), n2s.maxCoeff());
  std::chrono::duration<double> timer1 =
    std::chrono::high_resolution_clock::now() - start;

  start = std::chrono::high_resolution_clock::now();
  
  double s = 0;
  int k = 0;
  
  for (int idx = 0; idx!=ds.rows(); idx++) {
    for (int jdx = 0; jdx!=n1s.rows(); jdx++) {
      for (int kdx = 0; kdx!=n2s.rows(); kdx++) {
	MatrixXd K = MFdiff(z1.topLeftCorner(ds(idx), n1s(jdx)),
			     z2.topLeftCorner(ds(idx), n2s(kdx)));
	s += K(0,0);
	k += 1;
      }
    }
  }
  std::chrono::duration<double> timer2 =
    std::chrono::high_resolution_clock::now() - start;

  std::cout << "Generate " << ds.maxCoeff() * n1s.maxCoeff() + ds.maxCoeff() * n2s.maxCoeff()
	    << " random numers: " << timer1.count() << " s.\n";
  std::cout << "Calculate " << k << " blocks: "
	    << timer2.count() << " s.\n";
  std::cout << "Sum: " << s << '\n';
}  

int main() {
  VectorXi ds = VectorXi::LinSpaced(5, 1, 5);
  VectorXi n1s = VectorXi::LinSpaced(100, 5, 500);
  VectorXi n2s = VectorXi::LinSpaced(100, 5, 500);
  test(ds, n1s, n2s);
}

Matlab

% Reference implementation of the MFdiff
% matlab -singleCompThread -nojvm -nodisplay -r test6

ds = 1:5;
n1s = 5:5:500;
n2s = 5:5:500;

z1 = randn(max(ds), max(n1s));
z2 = randn(max(ds), max(n2s));

%%
tic;
s = 0;
t = 0;
for idx = 1:numel(ds)
    for jdx = 1:numel(n1s)
        for kdx = 1:numel(n2s)
            K = MFdiff(z1(1:ds(idx), 1:n1s(jdx)),...
                z2(1:ds(idx), 1:n2s(kdx)));
            s = s + K(1,1);
            t = t + 1;
        end
    end
end
toc

function M = MFdiff(x1, x2)
% Computes a flattened multidimensional x1' - x2
% x1 is (d x n1)
% x2 is (d x n2)
% M  is (d x n1 * n2)

n1 = size(x1, 2);
n2 = size(x2, 2);

M = repmat(x1, 1, n2) - repelem(x2, 1, n1);
end

Julia

function toTest(ds, n1s, n2s, z1, z2)
    s = 0
    t = 0
    for idx = 1:length(ds)
        for jdx = 1:length(n1s)
            for kdx = 1:length(n2s)
                K = BMdiff(z1[1:ds[idx], 1:n1s[jdx]], z2[1:ds[idx], 1:n2s[kdx]])
                s = s + K[1,1]
                t = t + 1
            end
        end
    end
    return s, t
end


function BMdiff(x1, x2)
    # Broadcasted matrix diff

    d, n1 = size(x1)
    d, n2 = size(x2)

    out = similar(x1, d, n1 * n2)

    BMdiff!(out, x1, x2)

    return out
end

function BMdiff!(out, x1, x2)
    # Broadcasted matrix diff

    n1 = size(x1, 2)
    n2 = size(x2, 2)

    for k = 1:n2
        out[:, 1+n1*(k-1):n1*k] .= x1 .- getindex.((x2,), axes(x2,1), k)
    end
end

const ds = 1:5
const n1s = 5:5:500
const n2s = 5:5:500

const z1s = randn(maximum(ds), maximum(n1s))
const z2s = randn(maximum(ds), maximum(n2s))

println(toTest(ds, n1s, n2s, z1s, z2s))
1 Like

Profiling indicates that about 30% of the time is spent in similar, so it looks like julia’s allocator is slower than that of these other languages (although I’m not sure if that’s a problem in “real” code)

Related question. Is it safe to use .= with a matrix operation on a side?

For example,
a=rand(2,2)
b=rand(2,2)

c=Array{Float64}(undef,2,2)
for i=1:2
c[:,i].=a[:,i].-b[:,i] + a*b[:,i]
end

As a learner, what should I read to get into the correct mindset? I am a bit surprised that going from

for k = 1:n2
    out[:, 1+n1*(k-1):n1*k] = x1 - x2[:, k]
end

to

for k = 1:n2
    out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k]    # Added dots
end

to

for k = 1:n2
    out[:, 1+n1*(k-1):n1*k] .= x1 .- getindex.((x2,), axes(x2,1), k)
end

could make such a big difference. I think the semantics are exactly the same, only the syntax gets more complicated.

As a learner, I’d recommend the following:

Always count allocations. If the number of allocs scales with the size N of your testset, then you’re doing something wrong. If you have a handful of tiny allocs independently of N, this is normally benign (the allocator is fast; who cares about an us when your code takes ms). Look at @code_warntype.

Try writing out everything in loops first (no views, no broadcasts, no colons), excepting BLAS (matrix products). Once you have written everything in loops, remember that x[i,j,k] and x[i+1,j,k] correspond to x[n] and x[n+1] (for some n=n(size(x), i,j,k)) and these are adjacent in memory. This is good, and they should be used together (close in time). This tells you how to structure your loops and arrays (outer loop iterates over last index, innermost loop over first index), and permits you to spot terms that you can re-use (hoist one loop-level outwards). Also, spot temporary allocations that you can re-use. In julia, explicit loops are the fastest and easiest to understand way of writing things (excepting BLAS); this has the main disadvantage of being verbose and somewhat ugly in your source files.

Once you did that, try using slick broadcast syntax and views, until performance deteriorates. If you did this a couple of times then you get a feeling for what loop corresponds to what “vectorized” construct and can start writing one-liners that use broadcasting and views.

Remember not to time too small testsets, compared to your real application (branch prediction, call overhead, cache sizes, etc). Remember to profile your entire computation: If something is super cheap anyway then it is not worth optimizing, e.g. code executed only once for initialization, compared to some loop that eats 90% of your execution time.

5 Likes

Hello!

I’m still perplexed by this example. Below are many questions, I would much appreciate help to sort out my mental model of Julia. Thank you for your help!

The manual says that “a[i,j,...] is converted by the compiler to getindex(a, i, j, ...)”. So when I wrote x2[:, k] I should get getindex, right? My guess is that what I really want is getindex. (note the dot), and the compiler can’t give me that.

OK, let’s say I need to manually write getindex. But why do I need to write getindex.((x2,), ...) (note the tuple)? The manual says that getindex wants a collection. OK, I guess I must wrap the array in a tuple (sigh). But why then does the compiler accept

out[:, 1+n1*(k-1):n1*k] .= x1 .- getindex.((x2,), axes(x2,1), k)

must you not write

setindex!(out, x1 .- getindex.((x2,), axes(x2,1), k), axes(out,1), 1+n1*(k-1):n1*k)

or even

setindex!((out,), x1 .- getindex.((x2,), axes(x2,1), k), axes(out,1), 1+n1*(k-1):n1*k)

All three work, BTW.

Again, the manual says that setindex! wants a collection. Apparently it’s not the same type of collection that getindex wants.

First of all, I think there might be more rewarding parts of Julia to learn than the intricacies of broadcasting (metaprogramming and macros are a good rabbit-hole). I’ve been coding Julia for years and I must admit I don’t understand all of what was discussed in this thread and definitely only the surface of the broadcasting stuff. I think as a general rule you should follow: if broadcasting works as intended then use it, if not (without doing excessive gymnastics) just write the loop. (similar to @foobar_lv2’s statement above)

Anyway, the call getindex.((x2,), axes(x2,1), k) uses the . to broadcast over the (x2,) tuple, i.e. to apply getindex to each of the elements fo (x2,), i.e. it then just does a normal getindex on x2. This, together with axes, seems to be a trick to “[…] make the RHS slice participate in the dot fusion.” (@mbauman’s comment above; Do note that @mbauman is the master of broadcast!).

PS: have you watched An introduction to high performance custom arrays | Matt Bauman - YouTube?

2 Likes