Passing sub-array by reference / performance

#1

In trying to pass subarrays by reference in a computer intensive code, I found the “view” function, which appeared to be the solution. However, I see important performance issues using that option relative to passing the actual arrays. The following example illustrates our observations:

The “distance” is computed with one of three functions. The first one receives the indexes and the whole vectors (thus their addresses), and is fast. The second receives a subarray (by copying), and is very slow. The third receives the vector ranges passed with the view function, and is faster than passing the subarray, but still much slower than the first one.

function dfunc_all(i,j,x,y)
  dist = (x[i,1]-y[j,1])^2 + (x[i,2]-y[j,2])^2 + (x[i,3]-y[j,3])^2
  return dist
end

function dfunc_subarray(x,y)
  dist = (x[1]-y[1])^2 + (x[2]-y[2])^2 + (x[3]-y[3])^2
  return dist
end

function dfunc_view(x,y)
  dist = (x[1]-y[1])^2 + (x[2]-y[2])^2 + (x[3]-y[3])^2
  return dist
end

function dist(n,x,y,type)
  for i in 1:n
    for j in 1:n
      if type == 1
        dist = dfunc_all(i,j,x,y)
      end
      if type == 2
        dist = dfunc_subarray(x[i,1:3],y[j,1:3])
      end
      if type == 3
        dist = dfunc_view(view(x,i,1:3),view(y,j,1:3))
      end
    end
  end
end

n = 10000
x = rand(n,3)
y = rand(n,3)

print("dfunc_all: ")
@time dist(n,x,y,1)
print("dfunc_subarray: ")
@time dist(n,x,y,2)
print("dfunc_view: ")
@time dist(n,x,y,3)

dfunc_all:   0.179748 seconds (93.56 k allocations: 4.416 MiB)
dfunc_subarray:   5.541814 seconds (200.00 M allocations: 20.862 GiB, 5.47% gc time)
dfunc_view:   1.602188 seconds (200.00 M allocations: 11.921 GiB, 8.95% gc time)

I could not find a solution to this problem, that is, to using a function that receives a vector as an argument while passing to it a subarray of a higher dimensional array.

Any help is appreciated.

#2

Try using x=rand(3,n) so that you access the data by column. Views get a speedup of a factor of 30 and memory allocation goes down to almost 0 (views still allocate).

4 Likes
#3

Impressive. I didn’t expect that to be that important, particularly for the difference between using view and the direct vector passing. It completely solves the problem, and view becomes even faster than the explicit indexing:

dfunc_all:   0.162120 seconds (101.27 k allocations: 4.682 MiB)
dfunc_subarray:   5.205288 seconds (200.00 M allocations: 20.862 GiB, 3.70% gc time)
dfunc_view:   0.102676 seconds (4 allocations: 160 bytes)
#4

Judging by (1) the fact that you are using @time to benchmark your versions, and (2) the large number of allocations in the dfunc_all version above, I would say that you’re probably not measuring what you think you are: there is probably some compilation happening in your tests in the dfunc_all version but not the dfunc_view version, which biases things.

This particular problem can be solved by using @btime (from the BenchmarkTools package) instead of @time.

Another problem might arise from the fact that you are doing nothing with the results of your computation, which allows the compiler to perform various optimizations, again leading you to benchmark things that are potentially not comparable. This problem can be solved by doing something (anything) with the results, for example accumulating them in a variable. A collateral benefit from this is that you get a way of catching errors in case all variants would compute different things.

This is what I get in the end: the implementation based on views performs as well as the implementation using the whole array.

julia> @btime dist($n,$x,$y,1)
  396.097 ms (0 allocations: 0 bytes)
4.971435615520719e7

julia> @btime dist($n,$x,$y,2)
  7.582 s (200000000 allocations: 20.86 GiB)
4.971435615520719e7

julia> @btime dist($n,$x,$y,3)
  397.164 ms (0 allocations: 0 bytes)
4.971435615520719e7

I also took the liberty of fixing a few stylistic issues in your code:

  • you don’t have to explicitly return from a function: the result of the last expression is taken as the return value of the function;
  • it is easy to use variable names that shadow function names (like your use of the name dist for many different things). In more complex situations, this could lead to hard-to-debug issues;
  • the same goes for type, which is a keyword of the language. I’m not sure it is that problematic to have a variable named like this, but at least it messes up the automatic indentation and syntax coloring in my editor.

Here is the complete code if you want to test it:

(click to unfold)
function dfunc_all(i,j,x,y)
    (x[1,i]-y[1,j])^2 + (x[2,i]-y[2,j])^2 + (x[3,i]-y[3,j])^2
end

function dfunc_subarray(x,y)
    (x[1]-y[1])^2 + (x[2]-y[2])^2 + (x[3]-y[3])^2
end

function dfunc_view(x,y)
    (x[1]-y[1])^2 + (x[2]-y[2])^2 + (x[3]-y[3])^2
end

function dist(n,x,y,typ)
    acc = zero(eltype(x))
    for i in 1:n
        for j in 1:n
            if typ == 1
                d = dfunc_all(i,j,x,y)
            end
            if typ == 2
                d = dfunc_subarray(x[:,i],y[:,j])
            end
            if typ == 3
                d = dfunc_view(view(x,:,i),view(y,:,j))
            end

            acc += d
        end
    end
    acc
end

n = 10000;
x = rand(3,n);
y = rand(3,n);

using BenchmarkTools
print("dfunc_all: ")
@btime dist($n,$x,$y,1)
print("dfunc_subarray: ")
@btime dist($n,$x,$y,2)
print("dfunc_view: ")
@btime dist($n,$x,$y,3)
5 Likes
#5

Just wanted to point out that they only allocate if they escape. For example, in @ffevotte’s code, they did not:

4 Likes
#6

I added summing the distances and returning the sum to the code. Using @btime, I get:

With the original (row-based vector):

dfunc_all:   348.380 ms (1 allocation: 16 bytes)
dfunc_subarray:   5.346 s (200000001 allocations: 20.86 GiB)
dfunc_view:   1.640 s (200000001 allocations: 11.92 GiB)

And with the better column-based vectors:

dfunc_all:   239.150 ms (1 allocation: 16 bytes)
dfunc_subarray:   5.433 s (200000001 allocations: 20.86 GiB)
dfunc_view:   193.442 ms (1 allocation: 16 bytes)

I really appreciate all the comments.

Two small doubts:

  1. Why some of you call the function with dist($n,$x,$y,3)? (that is, why the “$” there?).
  2. What views escaping means (in which case it allocates)?

Thank you.

#7

This is good practice when using BenchmarkTools. Quoting the documentation: “A good rule of thumb is that external variables should be explicitly interpolated into the benchmark expression


(I’m not necessarily the best person to answer this, so please bear with my handwavy explanation.
Also, for the sake of clarity I’m using the word “function” below, when I should really be speaking of “methods”, but I don’t think the difference matters much here).

This has to do with how the compiler works. You probably know that Julia uses a Just Ahead of Time (JAoT) compiler, which compiles the code of your functions specifically for the type of arguments they are called with, when these arguments are known. This means for example that your functions dfunc_subarray and dfunc_views are exactly the same: you could have only one dfunc function, which would be recompiled to produce different machine code depending whether it is called with x and y being array views or copied sub-arrays.

Now this dfunc function is so small that it probably gets inlined, that is to say its code is (more or less) injected in the place of the function call. When you are calling dist(n, x, y, 3), the compiler thus compiles a specific function which behaves (more or less) as if you had declared:

function dist3(n, x, y) # the 4th argument is a constant
  for i in 1:n
    for j in 1:n 
        # no more ifs: the tests are resolved statically since `type` is
        # known to be 3. This is known as "constants propagation".

        # arguments to dfunc_views
        xx = view(x, :, i)
        yy = view(y, :, j)

        # inlined call to dfunc_views
        d = (xx[1]-yy[1])^2 + (xx[2]-yy[2])^2 + (xx[3]-yy[3])^2
      end
    end
  end
end

And now the compiler is smart enough to compile this to (more or less) the same machine code as what gets produced by your first variant (where the full arrays are passed along with the corresponding indices).

Does this make sense?


EDIT:
I got lost somewhere and did not finish explaining the link between all this and the escape (or lack thereof) of views.

When the compiler sees things such as

xx = view(x, :, i)
xx[1] # equivalent to getindex(xx, 1)

it knows that xx is of a certain type, so that it uses a very specific version of the [] operator (or rather the getindex function), which is specialized for the type of view objects. This variant is defined in such a way that xx[1] above is equivalent to x[1,i]. And the compiler knows this statically. So no view object actually gets created in the end (hence no allocation, as highlighted by Elrod above). This is what is meant by “no views escape”. If the dfunc_views function had not been inlined, then view objects would have had to be instanciated, to be passed as arguments to dfunc_views: these views would have “escaped” the scope of dist. And these instanciations would have been revealed by the presence of (small) allocations.

3 Likes
#8

Fantastic! Thank you!

#9

Following up the performance analysis with a more realistic code:

function dfunc_view(x,y)
  return (x[1]-y[1])^2 + (x[2]-y[2])^2 + (x[3]-y[3])^2
end

function dsum(n,x,y,itype)
  dsum = 0.
  for i in 1:n
    for j in 1:n
      if itype == 1
        dsum = dsum + dfunc_view(view(x,1:3,i),view(y,1:3,j))
      end
      if itype == 2
        dsum = dsum + ((x[1,i]-y[1,j])^2 + (x[2,i]-y[2,j])^2 + (x[3,i]-y[3,j])^2)
      end
      if itype == 3
        dsum = dsum + (x[1,i]-y[1,j])^2 + (x[2,i]-y[2,j])^2 + (x[3,i]-y[3,j])^2
      end
    end
  end
  return dsum
end

n = 10000
x = rand(3,n)
y = rand(3,n)

print("view: ")
@btime dsum($n,$x,$y,1)
print("parenthesis: ")
@btime dsum($n,$x,$y,2)
print("no parenthesis: ")
@btime dsum($n,$x,$y,3)

Result:

view:   195.145 ms (0 allocations: 0 bytes)
parenthesis:   217.608 ms (0 allocations: 0 bytes)
no parenthesis:   308.934 ms (0 allocations: 0 bytes)

(in practice only one of the options persists, and the times are slightly reduced, but with the same trend).

I think I understand these results, and only wanted to register them.

#10

That is not really a good way to benchmark because you have conditionals in the loop which depend on run time values. It is possible that the compiler does loop duplication and move the conditional outside the loop, or perhaps Julia’s constant propagation takes care of it, but in general, it is good to have the benchmark code as “untainted” as possible.

#11

I have tested without the conditionals, the times are all reduced, but the trend is the same. (edit: In this case!)

#12

The results are, leaving only one of the options each time:

view:   147.059 ms (0 allocations: 0 bytes)
parenthesis:   175.737 ms (0 allocations: 0 bytes)
no parenthesis:   300.870 ms (0 allocations: 0 bytes)
#13

In this case yes, but not in general. There might be optimizations available for one of the algorithm that gets ruined by the conditional while not for others, so the advice still stands.

2 Likes
#14

They allocate if the compiler cannot prove that they don’t escape. This is an important distinction: In order to write fast julia code, you need to either understand the details of the compiler (I don’t) or at least have a gut feeling for what kind of properties the compiler is good at proving (how I work since the 0.6 -> 0.7 switch).

Unfortunately this affects API design: dfunc_view was a terrible design choice in 0.6 and has become viable since 0.7 due to compiler optimizations; but it is somewhat intransparent which APIs are viable today. This is one of my eternal gripes: Starting with 0.7, it is very hard (and undocumentable) to figure out what abstractions are zero-cost, depending on context, compiler version and phase of the moon.

Example:

julia> using BenchmarkTools
julia> N=1_000; a=rand(3, N); arr=[];


julia> @inline function dot_escape(a,b)
       @fastmath @inbounds s=a[1]*b[1]+a[2]*b[2]+a[3]*b[3]
       if rand()<1e-12
       global arr
       push!(arr, a)
       push!(arr, b)
       @assert false
       end
       return s
       end;

julia> @inline function dot_noescape(a,b)
       @fastmath @inbounds s=a[1]*b[1]+a[2]*b[2]+a[3]*b[3]
       if rand()<1e-12
       @assert false
       end
       return s
       end;

julia> @noinline function kloop(kernel, mat)
       s=0.0
       @inbounds for i=1:size(mat, 2)
       for j=1:size(mat, 2)
       s += kernel(view(mat, :, i), view(mat, :, j))
       end
       end
       s
       end;

julia> @btime kloop(dot_escape, a)
  32.272 ms (2000001 allocations: 91.55 MiB)
750662.4844627458

julia> @btime kloop(dot_noescape, a)
  3.610 ms (1 allocation: 16 bytes)
750662.4844627458

We see that the hypothetical (never executed! We didn’t see the assertion failure) code slowed down execution significantly and caused lots of extra allocations.

The above is easy to figure out: dot_escape has an execution path leading to escape, and the compiler is currently bad at deferring allocation until the last moment. Some people are working on that.

Somewhat harder to figure out is what exception-based (never taken) execution-paths lead to eager allocations in the current compiler.

1 Like
#15

You are missing a third option, which is to run the code and measure it.

Could you elaborate on what API you feel you can’t write because of performance limitations in Julia?

1 Like
#16

Such is the way with nearly any sufficiently advanced compiler. Measurement is indeed the answer, but Julia also has a fourth option: you can ask it for its intermediate work. It takes some time to learn, but if you start with small kernels like this and work your way up it’s quite doable. And it has been getting much better lately — now it’s at the point where you can just look at @code_typed and look for the %new. Then you can simply look for the function calls that use that object and try to avoid them in one way or another.

But yes, I agree it makes writing some algorithms much trickier — using views in a hot inner loop can be fraught with performance challenges if you don’t have control of absolutely everything from there on down.

1 Like
#17

Yes! That’s how you get a gut feeling. But at the moment of writing some code, you normally don’t have all the context worked out (i.e. you don’t have all the callers written before designing the callee). Nor do you necessarily know the types you will use. And context matters!

Something that took me embarrasingly long is to figure out how to combine stack-allocation and mutability (like stack-allocated C arrays).

The right API is to use immutable tuples, and for modification do the tup = modify(tup, params) dance, where

function modify(tup, params)
    rtup = Ref(tup)
    ptr = pointer_from_objref(rtup)
    #use unsafe_store! to modify rtup, depending on params
    return rtup[]
end

This works reasonably well. Alternatively, one can pass a reference to modify; if this gets inlined, then the allocation can be elided. So one has the choice whether the caller or callee creates the Ref; if the caller creates the mutable container then everything depends on inline-heuristics for performance.

Something that still annoys me is the following: Often, one wants wrapper-types. That is,

struct MyWrap{T}
parent::T 
end 

in order to override some dispatches of the parent. Overriding iterate is especially important: for-loops are nicer syntax than the iteration protocol (the while-dance with tuple unpacking and benign type-instability). Also, iterate, getfield and getproperty are afaik treated specially (special-cased inlining / inference).

But if parent is not bitstype, then this abstraction is not zero-cost: It creates an extra pointer indirection and allocation (just like views).

Hence, MyWrap should never be stored in an array or structure, and only be passed to @inline or very expensive functions (these can hoist the indirection out of loops), and the caller needs to think very hard about two possible strategies: Create MyWrap early and reuse it, or create it on-demand. But for efficient reuse, it should be mutable instead of immutable!

Therefore, the entire API and the kind of types depend massively on available compiler optimizations and inline heuristics.

It would be really nice if I could write efficient code / APIs that do not unpack and repack the MyWrap all the time. That is, if I could actually abstract away the fact that MyWrap has a parent, instead of having to do some mix & mash between passing parent and MyWrap around, and having to test every single use for whether inline heuristics and escape analysis succeed.

So basically the old problem that immutable types are not inlined in arrays, structures and tuples, i.e. the fact that e.g. String and struct foo a::String end have different layout (first is char *, second is char**). This is a terrible problem that makes a lot of abstractions very expensive, and is only partially alleviated by escape analysis (the escape analysis is really cool! But allocation-avoidance based on that is a semi-reliable compiler optimization, instead of a property of the data layout).

One specific example that I gave up on:

I want to store a densish graph in a bitmap. Small graphs work fine with something like Vector{NTuple{N, UInt64}} (i.e. size of graph rounded to multiple of 64 is part of the type). For larger graphs, this does not fly anymore. So neighbors(g, i) needs to return some kind of view into the backing contiguous Matrix{UInt64}. And iterate(neighbors(g, i)) needs to iterate over set bits. Next, many graph algorithms need to iterate over something like neighbors(g, i) & neighbors(g, j) & ~neighbors(g, k). And now we have a problem: This needs to run without allocation of temporaries. The broadcast machinery is good at that, so lazy.(neighbors(g, i) .& neighbors(g, j) .& (~).(neighbors(g, k))) is a sensible syntax (cf related discussion, with awesome future @: syntax by @tkf ).

I am too stupid to get the resulting Broadcasted-view-construction in a way that avoids allocations for typical operations (all, any, iterate, count). The view is a custom type, not a standard view (because it needs to use the logical indexing code for iteration).

3 Likes
#18

Nice discussion, but I am mostly lost :slight_smile:

For me what makes Julia a real alternative (to Fortran) is the possibility of NOT using all those tweaks and still getting a reasonably efficient code. For instance I can write a function that computes a scalar product “as is” and that is quite fine. The fact that we can write code in a natural way without having to appeal to specialized functions is what keeps, I think, many people in the numerical computing community using Fortran and away from interpreted languages. We are users of the language and what we want is to think about the problems instead of the tweaks of the code. To obtain the most efficient code, in any language, the understanding of the inner workings of the compilers and of the hardware are necessary anyway, but for the most important aspects of a numerical code, what really matters is the understanding of the problem to be solved.

6 Likes
#19

I apologize for the derail!

The TLDR was: always respect the memory layout (mat[i, j] and mat[i+1, j] are contiguous), and if you pass short-lived views, make the callee @inline. Then, views of the type view(mat, :, i) are convenient, fast and allocation-free (at least most of the time). Views of any other kind than view(mat, s:e, i) or view(mat, :, s:e) or view(mat, :, i) are often slower than copying a subarray: They are not contiguous in memory, and if you do more than a single pass over the data then the copy will probably amortize. These views are mainly useful if you need to mutate the input array, or only do a single pass over the data.

Also, benchmark early and often, and read @code_typed and maybe @code_native in order to develop a gut feeling for what constructs end up fast. Also, cthulhu is amazing.

6 Likes