Migrating code from Matlab: Accessing arrays without allocating

HenrikM · November 12, 2018, 1:13pm

One of the performance critical parts of my Matlab code is this:

function M = MFdiff(x1, x2)
% Computes a flattened multidimensional x1' - x2
% x1 is (d x n1)
% x2 is (d x n2)
% M  is (d x n1 * n2)

n1 = size(x1, 2);
n2 = size(x2, 2);

M = repmat(x1, 1, n2) - repelem(x2, 1, n1);
end

In Julia, I tried to write it like this:

function BMdiff!(out, x1, x2)
    # Broadcasted vector diff

    n1 = size(x1, 2)
    n2 = size(x2, 2)

    for k = 1:n2
        out[:, 1+n1*(k-1):n1*k] = x1 .- x2[:, k]
    end
end

The Julia code does one allocation per each loop iteration and is three times slower than the Matlab version.

I can post full code and benchmarks if there is no obvious performance error. See also What is the Eigen equivalent of Matlab repelem?

mauro3 · November 12, 2018, 1:45pm

A bit faster (350ns vs 680ns) is if you use .=:

function BMdiff!(out, x1, x2)
    # Broadcasted vector diff

    n1 = size(x1, 2)
    n2 = size(x2, 2)

    for k = 1:n2
        out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k]
    end
end

but this still creates a view. Fastest and allocation-free is the loop:

function BMdiff2!(out, x1, x2)
    # Broadcasted vector diff

    n1 = size(x1, 2)
    n2 = size(x2, 2)
    d = size(x1,1)

    for k = 1:n2
        for i = 1:d
            for (jj,j) = enumerate(1+n1*(k-1):n1*k)
                out[i, j] = x1[i,jj] - x2[i, k]
            end
        end
    end
end

at 140ns, i.e. 5x faster.

rdeits · November 12, 2018, 1:53pm

It’ll definitely be easier to help if you can at least post the sizes of matrices you’re interested in (since that might affect the optimal approach) and what the relevant timings from Matlab were. But there’s a one-character change that helps quite a bit, which is to turn the out[...] = x1 .- x2[:, k] into out[...] .= x1 .- x2[:, k] which allows Julia to fuse the broadcasted - with the broadcasted =, avoiding allocating an entire new vector just to hold the right-hand side of that assignment. See: More Dots: Syntactic Loop Fusion in Julia

But I wonder if your loop is even necessary. There’s nothing wrong with loops in Julia, but sometimes using broadcasting can make your intentions clearer, and it should perform almost as well as a well-written loop. In particular, this seems like a case where actually using a 3D array can make your life much easier (rather than collapsing the last two dimensions). Here’s an example implementation that just uses broadcasting:

julia> function mfd!(out, x1, x2)
         d = size(x1, 1)
         out .= reshape(x1, (d, :, 1)) .- reshape(x2, (d, 1, :))
       end
mfd! (generic function with 1 method)

julia> d = 1; n1 = 2; n2 = 3; x1 = rand(d, n1); x2 = rand(d, n2);

julia> out = zeros(d, n1, n2);

julia> mfd!(out, x1, x2)
1×2×3 Array{Float64,3}:
[:, :, 1] =
 -0.11027  -0.207838

[:, :, 2] =
 0.0095438  -0.0880241

[:, :, 3] =
 -0.453903  -0.551471

and if you really need a 2D structure, you can always reshape the result:

julia> reshape(out, (d, :))
1×6 Array{Float64,2}:
 -0.11027  -0.207838  0.0095438  -0.0880241  -0.453903  -0.551471

This performs similarly to the improved loop version (with the .=):

julia> out_flat = zeros(d, n1 * n2);

julia> @btime BMdiff!($out_flat, $x1, $x2)
  103.822 ns (3 allocations: 288 bytes)
 
julia> @btime mfd!($out, $x1, $x2)
  110.422 ns (4 allocations: 224 bytes)

antoine-levitt · November 12, 2018, 2:08pm

        for i = 1:d
            for (jj,j) = enumerate(1+n1*(k-1):n1*k)
                out[i, j] = x1[i,jj] - x2[i, k]
            end
        end

Those two loops should be inverted for a performance speedup

out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k]

This does not use a view (you need @views or view(x2,:,k) for that).

For small arrays, the (well-ordered) loop version is faster, while for large arrays all versions (loops, views or no views) are equally fast for me. Even for small arrays the performance of the broadcasted version is acceptable (within 3x of the fast version), and it’s unlikely that a computation would be limited by this. If it is, consider using StaticArrays.

mauro3 · November 12, 2018, 3:15pm

Do you know where do the allocations come from in the out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k] variant?

kristoffer.carlsson · November 12, 2018, 3:47pm

Creating a view. Passing a view to any non inlined function will require it to be allocated

Ref:
https://github.com/JuliaLang/julia/pull/29867
https://github.com/JuliaLang/julia/pull/29868

antoine-levitt · November 12, 2018, 3:50pm

@kristoffer.carlsson: @mauro3 was asking about a code without views. x2[:, k] allocates, even if it’s contiguous: RHS slices are not views, even when participating in broadcast. LHS slices don’t allocate, I think.

kristoffer.carlsson · November 12, 2018, 3:57pm

No, the expression there gets lowered to views.

julia> Meta.@lower out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k]
:($(Expr(:thunk, CodeInfo(
 1 ─ %1  = (Base.getproperty)(Base.Broadcast, :materialize!)                                                                                                                                                                │
 │   %2  = k - 1                                                                                                                                                                                                            │
 │   %3  = n1 * %2                                                                                                                                                                                                          │
 │   %4  = 1 + %3                                                                                                                                                                                                           │
 │   %5  = n1 * k                                                                                                                                                                                                           │
 │   %6  = %4:%5                                                                                                                                                                                                            │
 │   %7  = (Base.dotview)(out, :, %6)  <------------------------------------ view                                                                                                                                                          │
 │   %8  = (Base.getproperty)(Base.Broadcast, :broadcasted)                                                                                                                                                                 │
 │   %9  = (Base.getindex)(x2, :, k)                                                                                                                                                                                        │
 │   %10 = (%8)(-, x1, %9)                                                                                                                                                                                                  │
 │   %11 = (%1)(%7, %10)                                                                                                                                                                                                    │
 └──       return %11                                                                                                                                                                                                       │
))))

The view on a LHS is not magical and it has to be tracked. When it gets passed to the noninlined copyto! it will cause the view to be allocated (inline copyto! for SubArrays by KristofferC · Pull Request #29868 · JuliaLang/julia · GitHub)

mbauman · November 12, 2018, 4:03pm

That’s only the LHS that gets view-ified. The indexing expression on the RHS is still just normal indexing.

kristoffer.carlsson:

julia> Meta.@lower out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k]
:($(Expr(:thunk, CodeInfo(
 1 ─ %1  = (Base.getproperty)(Base.Broadcast, :materialize!)                                                                                                                                                                │
 │   %2  = k - 1                                                                                                                                                                                                            │
 │   %3  = n1 * %2                                                                                                                                                                                                          │
 │   %4  = 1 + %3                                                                                                                                                                                                           │
 │   %5  = n1 * k                                                                                                                                                                                                           │
 │   %6  = %4:%5                                                                                                                                                                                                            │
 │   %7  = (Base.dotview)(out, :, %6)  <------------------------------------ view                                                                                                                                                          │
 │   %8  = (Base.getproperty)(Base.Broadcast, :broadcasted)                                                                                                                                                                 │
 │   %9  = (Base.getindex)(x2, :, k)   <------------------------------------ Not a view!                                                                                                                                                                                    │
 │   %10 = (%8)(-, x1, %9)                                                                                                                                                                                                  │
 │   %11 = (%1)(%7, %10)                                                                                                                                                                                                    │
 └──       return %11                                                                                                                                                                                                       │
))))

kristoffer.carlsson · November 12, 2018, 4:05pm

Yes, I was commenting on “asking about a code without views” when the code had a view and “LHS slices don’t allocate,” when they do allocate.

mbauman · November 12, 2018, 4:20pm

Fair enough. I was missing the tree for the forest. But those allocations don’t necessarily need to occur — often broadcast manages to arrange things in a way that it doesn’t hit that penalty. You can see this if you make the RHS slice participate in the dot fusion:

julia> d = 5; n1 = 10; n2 = 30; x1 = rand(d, n1); x2 = rand(d, n2); out = zeros(d, n1* n2);

julia> function BMdiff!(out, x1, x2)
           # Broadcasted vector diff

           n1 = size(x1, 2)
           n2 = size(x2, 2)

           for k = 1:n2
               out[:, 1+n1*(k-1):n1*k] .= x1 .- getindex.((x2,), axes(x2,1), k)
           end
       end
BMdiff! (generic function with 1 method)

julia> @btime BMdiff!($out, $x1, $x2)
  2.790 μs (0 allocations: 0 bytes)

antoine-levitt · November 12, 2018, 4:27pm

Ah right, sorry, I was only considering array allocations, not views. So there are two kinds of allocations going on there: arrays and views. out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k] allocates twice, a copy on the RHS and a view on the LHS that gets allocated. The copy on the RHS is semantics and unlikely to change soon, the allocation on the LHS is a performance issue that looks like it might get solved soon. By contrast, out[:, 1+n1*(k-1):n1*k] .= x1 .- view(x2,:,k) allocates two views, both being known performance issues (but one being easier than the other to get rid of). Correct?

kristoffer.carlsson · November 12, 2018, 4:33pm

Interestingly, commenting out the getindex.( part and we are back to copyto! and allocations:

julia> function BMdiff!(out, x1, x2)
           # Broadcasted vector diff

           n1 = size(x1, 2)
           n2 = size(x2, 2)

           for k = 1:n2
               out[:, 1+n1*(k-1):n1*k] .= x1 # .- getindex.((x2,), axes(x2,1), k)
           end
       end
BMdiff! (generic function with 1 method)

julia> @btime BMdiff!($out, $x1, $x2)
  726.189 ns (30 allocations: 1.88 KiB)

Edit: With https://github.com/JuliaLang/julia/pull/29868 we have:

julia> @btime BMdiff!($out, $x1, $x2)
  422.608 ns (0 allocations: 0 bytes)

HenrikM · November 13, 2018, 1:16pm

Thank you for an enlightening discussion! I see that the issue is quite delicate. With your help I was able to make the Julia code competitive. Below are my timings and code for Matlab, C++ and Julia.

Language	Time (s)
C++ (Eigen)	12
Matlab (normal use)	14
Julia	18
Matlab (single thread)	20

C++ (Eigen)

#include <iostream>
#include <chrono>

#include <Eigen/Dense>

using namespace Eigen;

template <typename Derived1, typename Derived2>
MatrixXd MFdiff(
  DenseBase<Derived1> const & x1,
  DenseBase<Derived2> const & x2)
{
  int d = x1.rows();
  int n1 = x1.cols();
  int n2 = x2.cols();

  MatrixXd out(d, n1*n2);

  for (int c = 0, idx = 0; c != n2; c+=1, idx += n1)
    out.block(0, idx, d, n1) = x1.colwise() - x2.col(c);

  return out;
}

void test(VectorXi const & ds,
	  VectorXi const & n1s,
	  VectorXi const & n2s)
{
  auto start = std::chrono::high_resolution_clock::now();
  MatrixXd z1 = MatrixXd::Random(ds.maxCoeff(), n1s.maxCoeff());
  MatrixXd z2 = MatrixXd::Random(ds.maxCoeff(), n2s.maxCoeff());
  std::chrono::duration<double> timer1 =
    std::chrono::high_resolution_clock::now() - start;

  start = std::chrono::high_resolution_clock::now();
  
  double s = 0;
  int k = 0;
  
  for (int idx = 0; idx!=ds.rows(); idx++) {
    for (int jdx = 0; jdx!=n1s.rows(); jdx++) {
      for (int kdx = 0; kdx!=n2s.rows(); kdx++) {
	MatrixXd K = MFdiff(z1.topLeftCorner(ds(idx), n1s(jdx)),
			     z2.topLeftCorner(ds(idx), n2s(kdx)));
	s += K(0,0);
	k += 1;
      }
    }
  }
  std::chrono::duration<double> timer2 =
    std::chrono::high_resolution_clock::now() - start;

  std::cout << "Generate " << ds.maxCoeff() * n1s.maxCoeff() + ds.maxCoeff() * n2s.maxCoeff()
	    << " random numers: " << timer1.count() << " s.\n";
  std::cout << "Calculate " << k << " blocks: "
	    << timer2.count() << " s.\n";
  std::cout << "Sum: " << s << '\n';
}  

int main() {
  VectorXi ds = VectorXi::LinSpaced(5, 1, 5);
  VectorXi n1s = VectorXi::LinSpaced(100, 5, 500);
  VectorXi n2s = VectorXi::LinSpaced(100, 5, 500);
  test(ds, n1s, n2s);
}

Matlab

% Reference implementation of the MFdiff
% matlab -singleCompThread -nojvm -nodisplay -r test6

ds = 1:5;
n1s = 5:5:500;
n2s = 5:5:500;

z1 = randn(max(ds), max(n1s));
z2 = randn(max(ds), max(n2s));

%%
tic;
s = 0;
t = 0;
for idx = 1:numel(ds)
    for jdx = 1:numel(n1s)
        for kdx = 1:numel(n2s)
            K = MFdiff(z1(1:ds(idx), 1:n1s(jdx)),...
                z2(1:ds(idx), 1:n2s(kdx)));
            s = s + K(1,1);
            t = t + 1;
        end
    end
end
toc

function M = MFdiff(x1, x2)
% Computes a flattened multidimensional x1' - x2
% x1 is (d x n1)
% x2 is (d x n2)
% M  is (d x n1 * n2)

n1 = size(x1, 2);
n2 = size(x2, 2);

M = repmat(x1, 1, n2) - repelem(x2, 1, n1);
end

Julia

function toTest(ds, n1s, n2s, z1, z2)
    s = 0
    t = 0
    for idx = 1:length(ds)
        for jdx = 1:length(n1s)
            for kdx = 1:length(n2s)
                K = BMdiff(z1[1:ds[idx], 1:n1s[jdx]], z2[1:ds[idx], 1:n2s[kdx]])
                s = s + K[1,1]
                t = t + 1
            end
        end
    end
    return s, t
end


function BMdiff(x1, x2)
    # Broadcasted matrix diff

    d, n1 = size(x1)
    d, n2 = size(x2)

    out = similar(x1, d, n1 * n2)

    BMdiff!(out, x1, x2)

    return out
end

function BMdiff!(out, x1, x2)
    # Broadcasted matrix diff

    n1 = size(x1, 2)
    n2 = size(x2, 2)

    for k = 1:n2
        out[:, 1+n1*(k-1):n1*k] .= x1 .- getindex.((x2,), axes(x2,1), k)
    end
end

const ds = 1:5
const n1s = 5:5:500
const n2s = 5:5:500

const z1s = randn(maximum(ds), maximum(n1s))
const z2s = randn(maximum(ds), maximum(n2s))

println(toTest(ds, n1s, n2s, z1s, z2s))

antoine-levitt · November 13, 2018, 1:31pm

Profiling indicates that about 30% of the time is spent in similar, so it looks like julia’s allocator is slower than that of these other languages (although I’m not sure if that’s a problem in “real” code)

zymbuzz · November 13, 2018, 1:51pm

Related question. Is it safe to use .= with a matrix operation on a side?

For example,
a=rand(2,2)
b=rand(2,2)

c=Array{Float64}(undef,2,2)
for i=1:2
c[:,i].=a[:,i].-b[:,i] + a*b[:,i]
end

HenrikM · November 13, 2018, 1:54pm

As a learner, what should I read to get into the correct mindset? I am a bit surprised that going from

for k = 1:n2
    out[:, 1+n1*(k-1):n1*k] = x1 - x2[:, k]
end

to

for k = 1:n2
    out[:, 1+n1*(k-1):n1*k] .= x1 .- x2[:, k]    # Added dots
end

to

for k = 1:n2
    out[:, 1+n1*(k-1):n1*k] .= x1 .- getindex.((x2,), axes(x2,1), k)
end

could make such a big difference. I think the semantics are exactly the same, only the syntax gets more complicated.

foobar_lv2 · November 13, 2018, 2:38pm

As a learner, I’d recommend the following:

Always count allocations. If the number of allocs scales with the size N of your testset, then you’re doing something wrong. If you have a handful of tiny allocs independently of N, this is normally benign (the allocator is fast; who cares about an us when your code takes ms). Look at @code_warntype.

Try writing out everything in loops first (no views, no broadcasts, no colons), excepting BLAS (matrix products). Once you have written everything in loops, remember that x[i,j,k] and x[i+1,j,k] correspond to x[n] and x[n+1] (for some n=n(size(x), i,j,k)) and these are adjacent in memory. This is good, and they should be used together (close in time). This tells you how to structure your loops and arrays (outer loop iterates over last index, innermost loop over first index), and permits you to spot terms that you can re-use (hoist one loop-level outwards). Also, spot temporary allocations that you can re-use. In julia, explicit loops are the fastest and easiest to understand way of writing things (excepting BLAS); this has the main disadvantage of being verbose and somewhat ugly in your source files.

Once you did that, try using slick broadcast syntax and views, until performance deteriorates. If you did this a couple of times then you get a feeling for what loop corresponds to what “vectorized” construct and can start writing one-liners that use broadcasting and views.

Remember not to time too small testsets, compared to your real application (branch prediction, call overhead, cache sizes, etc). Remember to profile your entire computation: If something is super cheap anyway then it is not worth optimizing, e.g. code executed only once for initialization, compared to some loop that eats 90% of your execution time.

HenrikM · November 19, 2018, 8:43am

Hello!

I’m still perplexed by this example. Below are many questions, I would much appreciate help to sort out my mental model of Julia. Thank you for your help!

The manual says that “a[i,j,...] is converted by the compiler to getindex(a, i, j, ...)”. So when I wrote x2[:, k] I should get getindex, right? My guess is that what I really want is getindex. (note the dot), and the compiler can’t give me that.

OK, let’s say I need to manually write getindex. But why do I need to write getindex.((x2,), ...) (note the tuple)? The manual says that getindex wants a collection. OK, I guess I must wrap the array in a tuple (sigh). But why then does the compiler accept

out[:, 1+n1*(k-1):n1*k] .= x1 .- getindex.((x2,), axes(x2,1), k)

must you not write

setindex!(out, x1 .- getindex.((x2,), axes(x2,1), k), axes(out,1), 1+n1*(k-1):n1*k)

or even

setindex!((out,), x1 .- getindex.((x2,), axes(x2,1), k), axes(out,1), 1+n1*(k-1):n1*k)

All three work, BTW.

Again, the manual says that setindex! wants a collection. Apparently it’s not the same type of collection that getindex wants.

mauro3 · November 19, 2018, 9:25am

First of all, I think there might be more rewarding parts of Julia to learn than the intricacies of broadcasting (metaprogramming and macros are a good rabbit-hole). I’ve been coding Julia for years and I must admit I don’t understand all of what was discussed in this thread and definitely only the surface of the broadcasting stuff. I think as a general rule you should follow: if broadcasting works as intended then use it, if not (without doing excessive gymnastics) just write the loop. (similar to @foobar_lv2’s statement above)

Anyway, the call getindex.((x2,), axes(x2,1), k) uses the . to broadcast over the (x2,) tuple, i.e. to apply getindex to each of the elements fo (x2,), i.e. it then just does a normal getindex on x2. This, together with axes, seems to be a trick to “[…] make the RHS slice participate in the dot fusion.” (@mbauman’s comment above; Do note that @mbauman is the master of broadcast!).

PS: have you watched An introduction to high performance custom arrays | Matt Bauman - YouTube?

Topic		Replies	Views
Understanding major order performance when broadcasting in column vs row operations Performance question , array , benchmark	9	1003	June 21, 2021
Array broadcasting slower than numpy? Performance	20	721	June 4, 2022
Trying to understand how to write good code (example) New to Julia	34	3006	February 16, 2017
Porting code from MatLab - performance tips New to Julia	18	425	June 26, 2024
Common allocation mistakes Performance memory-allocation	47	7123	August 21, 2023

Migrating code from Matlab: Accessing arrays without allocating

C++ (Eigen)

Matlab

Julia

Related topics