# Passing sub-array by reference / performance

In trying to pass subarrays by reference in a computer intensive code, I found the “view” function, which appeared to be the solution. However, I see important performance issues using that option relative to passing the actual arrays. The following example illustrates our observations:

The “distance” is computed with one of three functions. The first one receives the indexes and the whole vectors (thus their addresses), and is fast. The second receives a subarray (by copying), and is very slow. The third receives the vector ranges passed with the `view` function, and is faster than passing the subarray, but still much slower than the first one.

``````function dfunc_all(i,j,x,y)
dist = (x[i,1]-y[j,1])^2 + (x[i,2]-y[j,2])^2 + (x[i,3]-y[j,3])^2
return dist
end

function dfunc_subarray(x,y)
dist = (x-y)^2 + (x-y)^2 + (x-y)^2
return dist
end

function dfunc_view(x,y)
dist = (x-y)^2 + (x-y)^2 + (x-y)^2
return dist
end

function dist(n,x,y,type)
for i in 1:n
for j in 1:n
if type == 1
dist = dfunc_all(i,j,x,y)
end
if type == 2
dist = dfunc_subarray(x[i,1:3],y[j,1:3])
end
if type == 3
dist = dfunc_view(view(x,i,1:3),view(y,j,1:3))
end
end
end
end

n = 10000
x = rand(n,3)
y = rand(n,3)

print("dfunc_all: ")
@time dist(n,x,y,1)
print("dfunc_subarray: ")
@time dist(n,x,y,2)
print("dfunc_view: ")
@time dist(n,x,y,3)

``````
``````dfunc_all:   0.179748 seconds (93.56 k allocations: 4.416 MiB)
dfunc_subarray:   5.541814 seconds (200.00 M allocations: 20.862 GiB, 5.47% gc time)
dfunc_view:   1.602188 seconds (200.00 M allocations: 11.921 GiB, 8.95% gc time)
``````

I could not find a solution to this problem, that is, to using a function that receives a vector as an argument while passing to it a subarray of a higher dimensional array.

Any help is appreciated.

Try using x=rand(3,n) so that you access the data by column. Views get a speedup of a factor of 30 and memory allocation goes down to almost 0 (views still allocate).

4 Likes

Impressive. I didn’t expect that to be that important, particularly for the difference between using view and the direct vector passing. It completely solves the problem, and view becomes even faster than the explicit indexing:

``````dfunc_all:   0.162120 seconds (101.27 k allocations: 4.682 MiB)
dfunc_subarray:   5.205288 seconds (200.00 M allocations: 20.862 GiB, 3.70% gc time)
dfunc_view:   0.102676 seconds (4 allocations: 160 bytes)
``````

Judging by (1) the fact that you are using `@time` to benchmark your versions, and (2) the large number of allocations in the `dfunc_all` version above, I would say that you’re probably not measuring what you think you are: there is probably some compilation happening in your tests in the `dfunc_all` version but not the `dfunc_view` version, which biases things.

This particular problem can be solved by using `@btime` (from the `BenchmarkTools` package) instead of `@time`.

Another problem might arise from the fact that you are doing nothing with the results of your computation, which allows the compiler to perform various optimizations, again leading you to benchmark things that are potentially not comparable. This problem can be solved by doing something (anything) with the results, for example accumulating them in a variable. A collateral benefit from this is that you get a way of catching errors in case all variants would compute different things.

This is what I get in the end: the implementation based on views performs as well as the implementation using the whole array.

``````julia> @btime dist(\$n,\$x,\$y,1)
396.097 ms (0 allocations: 0 bytes)
4.971435615520719e7

julia> @btime dist(\$n,\$x,\$y,2)
7.582 s (200000000 allocations: 20.86 GiB)
4.971435615520719e7

julia> @btime dist(\$n,\$x,\$y,3)
397.164 ms (0 allocations: 0 bytes)
4.971435615520719e7
``````

I also took the liberty of fixing a few stylistic issues in your code:

• you don’t have to explicitly `return` from a function: the result of the last expression is taken as the return value of the function;
• it is easy to use variable names that shadow function names (like your use of the name `dist` for many different things). In more complex situations, this could lead to hard-to-debug issues;
• the same goes for `type`, which is a keyword of the language. I’m not sure it is that problematic to have a variable named like this, but at least it messes up the automatic indentation and syntax coloring in my editor.

Here is the complete code if you want to test it:

(click to unfold)
``````function dfunc_all(i,j,x,y)
(x[1,i]-y[1,j])^2 + (x[2,i]-y[2,j])^2 + (x[3,i]-y[3,j])^2
end

function dfunc_subarray(x,y)
(x-y)^2 + (x-y)^2 + (x-y)^2
end

function dfunc_view(x,y)
(x-y)^2 + (x-y)^2 + (x-y)^2
end

function dist(n,x,y,typ)
acc = zero(eltype(x))
for i in 1:n
for j in 1:n
if typ == 1
d = dfunc_all(i,j,x,y)
end
if typ == 2
d = dfunc_subarray(x[:,i],y[:,j])
end
if typ == 3
d = dfunc_view(view(x,:,i),view(y,:,j))
end

acc += d
end
end
acc
end

n = 10000;
x = rand(3,n);
y = rand(3,n);

using BenchmarkTools
print("dfunc_all: ")
@btime dist(\$n,\$x,\$y,1)
print("dfunc_subarray: ")
@btime dist(\$n,\$x,\$y,2)
print("dfunc_view: ")
@btime dist(\$n,\$x,\$y,3)
``````
5 Likes

Just wanted to point out that they only allocate if they escape. For example, in @ffevotte’s code, they did not:

4 Likes

I added summing the distances and returning the sum to the code. Using `@btime`, I get:

With the original (row-based vector):

``````dfunc_all:   348.380 ms (1 allocation: 16 bytes)
dfunc_subarray:   5.346 s (200000001 allocations: 20.86 GiB)
dfunc_view:   1.640 s (200000001 allocations: 11.92 GiB)
``````

And with the better column-based vectors:

``````dfunc_all:   239.150 ms (1 allocation: 16 bytes)
dfunc_subarray:   5.433 s (200000001 allocations: 20.86 GiB)
dfunc_view:   193.442 ms (1 allocation: 16 bytes)
``````

I really appreciate all the comments.

Two small doubts:

1. Why some of you call the function with `dist(\$n,\$x,\$y,3)`? (that is, why the “\$” there?).
2. What `views` escaping means (in which case it allocates)?

Thank you.

This is good practice when using `BenchmarkTools`. Quoting the documentation: “A good rule of thumb is that external variables should be explicitly interpolated into the benchmark expression

(I’m not necessarily the best person to answer this, so please bear with my handwavy explanation.
Also, for the sake of clarity I’m using the word “function” below, when I should really be speaking of “methods”, but I don’t think the difference matters much here).

This has to do with how the compiler works. You probably know that Julia uses a Just Ahead of Time (JAoT) compiler, which compiles the code of your functions specifically for the type of arguments they are called with, when these arguments are known. This means for example that your functions `dfunc_subarray` and `dfunc_views` are exactly the same: you could have only one `dfunc` function, which would be recompiled to produce different machine code depending whether it is called with `x` and `y` being array views or copied sub-arrays.

Now this `dfunc` function is so small that it probably gets inlined, that is to say its code is (more or less) injected in the place of the function call. When you are calling `dist(n, x, y, 3)`, the compiler thus compiles a specific function which behaves (more or less) as if you had declared:

``````function dist3(n, x, y) # the 4th argument is a constant
for i in 1:n
for j in 1:n
# no more ifs: the tests are resolved statically since `type` is
# known to be 3. This is known as "constants propagation".

# arguments to dfunc_views
xx = view(x, :, i)
yy = view(y, :, j)

# inlined call to dfunc_views
d = (xx-yy)^2 + (xx-yy)^2 + (xx-yy)^2
end
end
end
end
``````

And now the compiler is smart enough to compile this to (more or less) the same machine code as what gets produced by your first variant (where the full arrays are passed along with the corresponding indices).

Does this make sense?

EDIT:
I got lost somewhere and did not finish explaining the link between all this and the escape (or lack thereof) of views.

When the compiler sees things such as

``````xx = view(x, :, i)
xx # equivalent to getindex(xx, 1)
``````

it knows that `xx` is of a certain type, so that it uses a very specific version of the `[]` operator (or rather the `getindex` function), which is specialized for the type of view objects. This variant is defined in such a way that `xx` above is equivalent to `x[1,i]`. And the compiler knows this statically. So no view object actually gets created in the end (hence no allocation, as highlighted by `Elrod` above). This is what is meant by “no views escape”. If the `dfunc_views` function had not been inlined, then view objects would have had to be instanciated, to be passed as arguments to `dfunc_views`: these views would have “escaped” the scope of `dist`. And these instanciations would have been revealed by the presence of (small) allocations.

4 Likes

Fantastic! Thank you!

Following up the performance analysis with a more realistic code:

``````function dfunc_view(x,y)
return (x-y)^2 + (x-y)^2 + (x-y)^2
end

function dsum(n,x,y,itype)
dsum = 0.
for i in 1:n
for j in 1:n
if itype == 1
dsum = dsum + dfunc_view(view(x,1:3,i),view(y,1:3,j))
end
if itype == 2
dsum = dsum + ((x[1,i]-y[1,j])^2 + (x[2,i]-y[2,j])^2 + (x[3,i]-y[3,j])^2)
end
if itype == 3
dsum = dsum + (x[1,i]-y[1,j])^2 + (x[2,i]-y[2,j])^2 + (x[3,i]-y[3,j])^2
end
end
end
return dsum
end

n = 10000
x = rand(3,n)
y = rand(3,n)

print("view: ")
@btime dsum(\$n,\$x,\$y,1)
print("parenthesis: ")
@btime dsum(\$n,\$x,\$y,2)
print("no parenthesis: ")
@btime dsum(\$n,\$x,\$y,3)

``````

Result:

``````view:   195.145 ms (0 allocations: 0 bytes)
parenthesis:   217.608 ms (0 allocations: 0 bytes)
no parenthesis:   308.934 ms (0 allocations: 0 bytes)
``````

(in practice only one of the options persists, and the times are slightly reduced, but with the same trend).

I think I understand these results, and only wanted to register them.

That is not really a good way to benchmark because you have conditionals in the loop which depend on run time values. It is possible that the compiler does loop duplication and move the conditional outside the loop, or perhaps Julia’s constant propagation takes care of it, but in general, it is good to have the benchmark code as “untainted” as possible.

I have tested without the conditionals, the times are all reduced, but the trend is the same. (edit: In this case!)

The results are, leaving only one of the options each time:

``````view:   147.059 ms (0 allocations: 0 bytes)
parenthesis:   175.737 ms (0 allocations: 0 bytes)
no parenthesis:   300.870 ms (0 allocations: 0 bytes)
``````

In this case yes, but not in general. There might be optimizations available for one of the algorithm that gets ruined by the conditional while not for others, so the advice still stands.

2 Likes

They allocate if the compiler cannot prove that they don’t escape. This is an important distinction: In order to write fast julia code, you need to either understand the details of the compiler (I don’t) or at least have a gut feeling for what kind of properties the compiler is good at proving (how I work since the 0.6 -> 0.7 switch).

Unfortunately this affects API design: `dfunc_view` was a terrible design choice in 0.6 and has become viable since 0.7 due to compiler optimizations; but it is somewhat intransparent which APIs are viable today. This is one of my eternal gripes: Starting with 0.7, it is very hard (and undocumentable) to figure out what abstractions are zero-cost, depending on context, compiler version and phase of the moon.

Example:

``````julia> using BenchmarkTools
julia> N=1_000; a=rand(3, N); arr=[];

julia> @inline function dot_escape(a,b)
@fastmath @inbounds s=a*b+a*b+a*b
if rand()<1e-12
global arr
push!(arr, a)
push!(arr, b)
@assert false
end
return s
end;

julia> @inline function dot_noescape(a,b)
@fastmath @inbounds s=a*b+a*b+a*b
if rand()<1e-12
@assert false
end
return s
end;

julia> @noinline function kloop(kernel, mat)
s=0.0
@inbounds for i=1:size(mat, 2)
for j=1:size(mat, 2)
s += kernel(view(mat, :, i), view(mat, :, j))
end
end
s
end;

julia> @btime kloop(dot_escape, a)
32.272 ms (2000001 allocations: 91.55 MiB)
750662.4844627458

julia> @btime kloop(dot_noescape, a)
3.610 ms (1 allocation: 16 bytes)
750662.4844627458
``````

We see that the hypothetical (never executed! We didn’t see the assertion failure) code slowed down execution significantly and caused lots of extra allocations.

The above is easy to figure out: `dot_escape` has an execution path leading to escape, and the compiler is currently bad at deferring allocation until the last moment. Some people are working on that.

Somewhat harder to figure out is what exception-based (never taken) execution-paths lead to eager allocations in the current compiler.

1 Like

You are missing a third option, which is to run the code and measure it.

Could you elaborate on what API you feel you can’t write because of performance limitations in Julia?

1 Like

Such is the way with nearly any sufficiently advanced compiler. Measurement is indeed the answer, but Julia also has a fourth option: you can ask it for its intermediate work. It takes some time to learn, but if you start with small kernels like this and work your way up it’s quite doable. And it has been getting much better lately — now it’s at the point where you can just look at `@code_typed` and look for the `%new`. Then you can simply look for the function calls that use that object and try to avoid them in one way or another.

But yes, I agree it makes writing some algorithms much trickier — using views in a hot inner loop can be fraught with performance challenges if you don’t have control of absolutely everything from there on down.

1 Like

Yes! That’s how you get a gut feeling. But at the moment of writing some code, you normally don’t have all the context worked out (i.e. you don’t have all the callers written before designing the callee). Nor do you necessarily know the types you will use. And context matters!

Something that took me embarrasingly long is to figure out how to combine stack-allocation and mutability (like stack-allocated C arrays).

The right API is to use immutable tuples, and for modification do the `tup = modify(tup, params)` dance, where

``````function modify(tup, params)
rtup = Ref(tup)
ptr = pointer_from_objref(rtup)
#use unsafe_store! to modify rtup, depending on params
return rtup[]
end
``````

This works reasonably well. Alternatively, one can pass a reference to `modify`; if this gets inlined, then the allocation can be elided. So one has the choice whether the caller or callee creates the `Ref`; if the caller creates the mutable container then everything depends on inline-heuristics for performance.

Something that still annoys me is the following: Often, one wants wrapper-types. That is,

``````struct MyWrap{T}
parent::T
end
``````

in order to override some dispatches of the parent. Overriding `iterate` is especially important: `for`-loops are nicer syntax than the iteration protocol (the `while`-dance with tuple unpacking and benign type-instability). Also, `iterate`, `getfield` and `getproperty` are afaik treated specially (special-cased inlining / inference).

But if `parent` is not bitstype, then this abstraction is not zero-cost: It creates an extra pointer indirection and allocation (just like views).

Hence, `MyWrap` should never be stored in an array or structure, and only be passed to `@inline` or very expensive functions (these can hoist the indirection out of loops), and the caller needs to think very hard about two possible strategies: Create `MyWrap` early and reuse it, or create it on-demand. But for efficient reuse, it should be mutable instead of immutable!

Therefore, the entire API and the kind of types depend massively on available compiler optimizations and inline heuristics.

It would be really nice if I could write efficient code / APIs that do not unpack and repack the `MyWrap` all the time. That is, if I could actually abstract away the fact that `MyWrap` has a `parent`, instead of having to do some mix & mash between passing `parent` and `MyWrap` around, and having to test every single use for whether inline heuristics and escape analysis succeed.

So basically the old problem that immutable types are not inlined in arrays, structures and tuples, i.e. the fact that e.g. `String` and `struct foo a::String end` have different layout (first is `char *`, second is `char**`). This is a terrible problem that makes a lot of abstractions very expensive, and is only partially alleviated by escape analysis (the escape analysis is really cool! But allocation-avoidance based on that is a semi-reliable compiler optimization, instead of a property of the data layout).

One specific example that I gave up on:

I want to store a densish graph in a bitmap. Small graphs work fine with something like `Vector{NTuple{N, UInt64}}` (i.e. size of graph rounded to multiple of 64 is part of the type). For larger graphs, this does not fly anymore. So `neighbors(g, i)` needs to return some kind of view into the backing contiguous `Matrix{UInt64}`. And `iterate(neighbors(g, i))` needs to iterate over set bits. Next, many graph algorithms need to iterate over something like `neighbors(g, i) & neighbors(g, j) & ~neighbors(g, k)`. And now we have a problem: This needs to run without allocation of temporaries. The broadcast machinery is good at that, so `lazy.(neighbors(g, i) .& neighbors(g, j) .& (~).(neighbors(g, k)))` is a sensible syntax (cf related discussion, with awesome future `@:` syntax by @tkf ).

I am too stupid to get the resulting `Broadcasted`-view-construction in a way that avoids allocations for typical operations (`all`, `any`, `iterate`, `count`). The `view` is a custom type, not a standard view (because it needs to use the logical indexing code for iteration).

3 Likes

Nice discussion, but I am mostly lost For me what makes Julia a real alternative (to Fortran) is the possibility of NOT using all those tweaks and still getting a reasonably efficient code. For instance I can write a function that computes a scalar product “as is” and that is quite fine. The fact that we can write code in a natural way without having to appeal to specialized functions is what keeps, I think, many people in the numerical computing community using Fortran and away from interpreted languages. We are users of the language and what we want is to think about the problems instead of the tweaks of the code. To obtain the most efficient code, in any language, the understanding of the inner workings of the compilers and of the hardware are necessary anyway, but for the most important aspects of a numerical code, what really matters is the understanding of the problem to be solved.

7 Likes

I apologize for the derail!

The TLDR was: always respect the memory layout (`mat[i, j]` and `mat[i+1, j]` are contiguous), and if you pass short-lived views, make the callee `@inline`. Then, views of the type `view(mat, :, i)` are convenient, fast and allocation-free (at least most of the time). Views of any other kind than `view(mat, s:e, i)` or `view(mat, :, s:e)` or `view(mat, :, i)` are often slower than copying a subarray: They are not contiguous in memory, and if you do more than a single pass over the data then the copy will probably amortize. These views are mainly useful if you need to mutate the input array, or only do a single pass over the data.

Also, benchmark early and often, and read `@code_typed` and maybe `@code_native` in order to develop a gut feeling for what constructs end up fast. Also, cthulhu is amazing.

7 Likes