Allocations when constructing a matrix from columns

I’ve tried my best to distill my problem into a minimal working example to illustrate exactly what’s wrong, instead of dumping my entire 2k+ line project here; have I been wrong in this? I feel pretty confident that I’ll be able to write my program to be efficient, but it’s hard when I don’t understand why Julia is generating very slow code on operations I know can be written very fast.

(Likewise the zygote issue , gradient(identity, 1) works pretty well!)

I see you’ve replied on the issue as well, but I’ll save a click to other readers by saying, no, my problem wasn’t that I wanted to differentiate the identity function and got stuck :wink:

I wasn’t suggesting that this isn’t a bottleneck for you, simply that focusing on the allocations per se (and keeping other things unchanged) may not be the best approach.

This is the quote I was referencing:

if this is really a bottleneck in your code then

Looking at it again, I might have been too rash, but at the time it looked like you were saying “If this is really the bottleneck in your code, then [you should do something completely different”. My bad if this wasn’t what you meant :slight_smile:

Eg if the xs small, you can write very clean, idiomatic code using StaticArrays (as suggested by others), which should be rather fast, and Zygote will just work fine with it.

xs is indeed very small, and I tried the StaticArrays route, but see the issue linked by @mcabbott.

The Julia ecosystem has evolved some very efficient techniques for dealing with the issues you are facing, but it is very hard to help without context.

My issue was that the simplest way I could think of of defining a single matrix of 9 elements ended up being 13 allocations, which is almost 150% of the number of elements. This was very surprising to me, and having spent some time in my life looking at compiler output I do have a feeling for when the compiler is doing the Wrong Thing, and in this case it 100% was.

Now, using the broadcasting version suggested by @DNF did fix it, I would like to better understand why Julia generated sub-par code before, so that I don’t have to flip every stone in my program or run benchmark every time I write code just to avoid having bad code generated. Furthermore, since I’ve already ran into these problems with the optimizer for the primal, and I’m ADing this, I’d like to make the job of the compiler as simple as possible so that it can do what it does best. This includes not having allocations unless your really need to, even if you can’t immediately spot it as a hot spot on a profile.

I’m not sure if this is a fruitful approach, but if mutation is the only way to get the performance you need, then perhaps the optimal strategy is to use something like FiniteDiff.jl for calculating the pullbacks of the mutating code and then using Zygote’s custom adjoint machineryto embed that so that the rest of your code can be handled by Zygote.

Thanks! I’ll definitely look into it.


Oh dear, I realize just now I’m completely butchering the reply system on Discourse

No offense, but looking at the original code, it’s pretty obvious that it has to make a lot of allocations. Every slice is an allocation, every addition of slices is an allocation, and then the final assembly. So, that’s 9 allocations before the concatenation part. I can’t really say if one really needs four allocations to concatenate three vectors, but it doesn’t shock me.

1 Like

The key advantage of Julia is that it supports fast composable generic code. The complicated machinery you are discovering here (eg for broadcasting) is necessary for that.

Obviously in some situations, having extra information about your data, you can do better. This does not mean that the compiler is doing anything wrong here. It is apparent that you consider yourself very experienced in programming, but you may have to adjust to Julia a bit to use it efficiently.

No, you did the right thing, but sometimes having a bit more context is useful, which is why people ask for it.

In any case, I have the impression that while you are asking for advice here, you take suggestions which do not conform to your expectations as an insult on your abilities or intelligence. This was not intended on my part, and I don’t think the other participants intended this either.

Nevertheless, I don’t find contributing further very rewarding while you maintain this kind of attitude. I hope you find the solution you are looking for.

What about the explicit version where I supply all 9 elements without any slicing or broadcasting? That yields the same number of allocations.

I assumed Julia would be able to do some liveness analysis of the vectors and remove the heap allocations, but apparently that’s not the case. It doesn’t have to do the extra allocations at all, since just writing it differently removes most of them.

Looking at it again, there still seems to be 3 allocations where only one would be needed, but maybe there’s some Julia internals that required multiple allocations for single values (eg. that the matrix is one allocation, and the matrix data is another, and the third is for good measure :wink: ).


Discourse didn’t let me write more replies since my account is new, so I’ll have to reply to you here, @Tamas_Papp

The key advantage of Julia is that it supports fast composable generic code. The complicated machinery you are discovering here (eg for broadcasting) is necessary for that.

Generic code is only fast if you can specialize for the concrete types. This is exactly what I’m trying to do, and it obviously didn’t work, since the number of allocations is of a different magnitude than what you’d write in e.g. C.

Obviously in some situations, having extra information about your data, you can do better. This does not mean that the compiler is doing anything wrong here.

I’ve tried to supply the concrete types, and nothing changed. As far as I understand Matrix internals, there’s no need for the extra allocations, and that absolutely does mean that the compiler is doing something wrong here.

It is apparent that you consider yourself very experienced in programming

This is condescending.

you may have to adjust to Julia a bit to use it efficiently.

That’s the goal of this post; I don’t understand why Julia is generating bad code, and I would like to understand why it generates the code that it does. So far, there has been no explanation, but a mitigation.

you take suggestions which do not conform to your expectations as an insult on your abilities or intelligence

I definitely don’t feel this way, so maybe we’re having trouble understanding each other. What I will say is that getting told that you’re solving the wrong problem, getting suggested solutions to problems you don’t have, and getting told that your problem isn’t really a problem, when I know that this is, and will continue to be if left unchecked, a problem, is very tiresome. I’m sorry if I’ve come across as rude due to this, but I really though my initial problem was a small and isolated issue that could be explained in short, without any context.

Nevertheless, I don’t find contributing further very rewarding while you maintain this kind of attitude. I hope you find the solution you are looking for.

Thanks anyways.

[Apologies in advance for the unwanted feedback] @Tamas_Papp for what it’s worth I also felt some of your posts came out as condescending (although I’m sure you don’t meant it). To be frank the whole discussion seems needlessly argumentative to me… especially considering the “This is the first time mht has posted — let’s welcome them to our community!” banner at the top.

But let’s focus on the core question… I also would love to understand why

A = rand(3,4)
@btime B = [$A[1] 2 3 4; 5 6 7 8; 9 10 11 12]

results in 16 allocations… It’s rather counter intuitive.

(Edited to use @btime instead of @time).

2 Likes

It’s actually 16 allocations (remember to use BenchmarkTools for stuff like this, the @time macro just isn’t suited for microbenchmarks.)

But, yes, I don’t know why you get a large number of allocations for ‘literal’ constructions like this, and the example that @mht had.

@DNF right, thanks for the correction, I edited my post.

For this particular one, it’s one allocation if A = rand(Int, 3,4). So my guess is that it’s hitting one of these “iterate, and widen the type when you have to” steps, which copy to a new array.

julia> @btime collect((1,2,3,4,5,6));
  37.975 ns (1 allocation: 128 bytes)

julia> @btime collect((1,2,3,4.0,5,6)); # one Float64 element
  176.427 ns (8 allocations: 480 bytes)

@mcabbott interesting, it’s even worse the other way:

julia> @btime collect((1.0,2.0,3.0,4,5.0,6.0)); # one Int64 element
  170.537 ns (15 allocations: 592 bytes)

And not to forget @mht’s problem:

julia> x = rand(3,4);

julia> f(xs) = [ xs[1,1]-xs[1,4]  xs[1,2]-xs[1,4]  xs[1,3]-xs[1,4] ;
                 xs[2,1]-xs[2,4]  xs[2,2]-xs[2,4]  xs[2,3]-xs[2,4] ;
                 xs[3,1]-xs[3,4]  xs[3,2]-xs[3,4]  xs[3,3]-xs[3,4] ];

julia> g(xs) = [ xs[1,1]  xs[1,2]  xs[1,3] ;
                 xs[2,1]  xs[2,2]  xs[2,3] ;
                 xs[3,1]  xs[3,2]  xs[3,3] ];

julia> @btime f(x);
  111.391 ns (10 allocations: 304 bytes)

julia> @btime g(x);
  106.538 ns (10 allocations: 304 bytes)

So there is still one allocation per element when the type is uniform.

I don’t have any good insight here, but this one is pretty nice:

julia> X = rand(3, 4);

julia> function shapematrix6(xs)
           return @SMatrix [ xs[1,1]-xs[1,4]  xs[1,2]-xs[1,4]  xs[1,3]-xs[1,4] ;
             xs[2,1]-xs[2,4]  xs[2,2]-xs[2,4]  xs[2,3]-xs[2,4] ;
             xs[3,1]-xs[3,4]  xs[3,2]-xs[3,4]  xs[3,3]-xs[3,4] ]
       end

julia> @btime shapematrix6($X);
  4.568 ns (0 allocations: 0 bytes)

Or, if you need mutability:

julia> function shapematrix7(xs)
           return @MMatrix [ xs[1,1]-xs[1,4]  xs[1,2]-xs[1,4]  xs[1,3]-xs[1,4] ;
             xs[2,1]-xs[2,4]  xs[2,2]-xs[2,4]  xs[2,3]-xs[2,4] ;
             xs[3,1]-xs[3,4]  xs[3,2]-xs[3,4]  xs[3,3]-xs[3,4] ]
       end
shapematrix7 (generic function with 1 method)

julia> @btime shapematrix7($X);
  15.587 ns (1 allocation: 80 bytes)
3 Likes

yuyichao and Keno’s comments in this thread may help; I found them very interesting and informative.

Zygote is focused on machine learning, where 1 microsecond of overhead per operation is negligible in the face of gigantic array operations. I’ve often found ForwardDiff to be a better choice when doing lots of small operations (especially with StaticArrays, which it plays nicely with). Have you considered it?

2 Likes

Right, and perhaps my vector examples are orthogonal to this. Your f and g call hvcat, which is (in my understanding) an extremely flexible function to make all kinds of concatenations of rows containing varying sizes of arrays of look easy… but is not fast. I’m not sure why it can’t call something like g_shape when all inputs are numbers:

julia> h(xs) = [xs[1,1], xs[1,2], xs[1,3]]; # no *cat functions

julia> @btime h($x);
  33.016 ns (1 allocation: 112 bytes)

julia> g_shape(xs) = reshape([xs[1,1], xs[2,1], xs[3,1], xs[1,2], xs[2,2], xs[3,2]], (2,3))
g_shape (generic function with 1 method)

julia> @btime g_shape($x)
  69.588 ns (2 allocations: 192 bytes)
2×3 Array{Float64,2}:

Thanks! I found (through trial and error) and forward passes seems to work way better for me, and as far as I can tell, Zygotes 'forwarddiff` uses ForwardDiff. Maybe I’ll end up switching all the way, but I don’t really understand the difference between forward and backward mode well enough to make an informed decision just yet :slight_smile:

I found the MIT lecture notes on the topic to be pretty good:

I think it’s well worth the time invested in learning the differences between forward and reverse diff and how those apply to one’s target goal. And, if you have a good understanding of the math behind it, you can mix and match where you need to.

The same goes for autodiff frameworks. In the end, although Zygote is very flexible, it doesn’t support mutation, and that can be a real thorn for high-performance linear algebra work as you’ve found. However, as far as I know all autodiff platforms only work on a subset of the language they’re embedded in, and Zygote’s non-mutation subset is among the least restrictive. In the end, like all things, it depends on your application.

4 Likes

If you view your function as a series of operations:

f(g(h(x)))

by the chain rule, the derivative is

f'(g(h(x))) * g'(h(x)) * h'(x)

In forward mode, you go forward:

hx = h(x)
dhx = h'(x)
gx = g(hx)
dgx = g'(hx) * dhx
fx = f(gx)
dfx = f'(gx) * dgx
return dfx

In reverse mode, you first go forward, and then backwards:

# forward
hx = h(x)
gx = g(hx)
fx = f(gx)
# reverse
pfx = f'(gx)
pgx = pfx * g'(hx)
phx = pgx * h'(x)
return phx

If you pretend that everything is a dense Jacobian (in reality, it’ll probably be a lot more efficient than that), you can then speculate on performance as a function of input → output sizes. E.g., if h is a 100 → 50 function, g maps 50 → 10, and f 10 → 1,
then forward mode would require multiplying a 100 by 50 matrix with a 50 by 10, and then multiplying the 100 by 10 with a 10 by 1 vector.
Reverse mode would instead multiply the 10 by 1 vector with the 50 by 10 matrix, and then multiply the 50 by 1 product with the 100 by 50 matrix. The latter involves many less operations; roughly (1005010 + 100101)2 ops for forward mode vs (50101 + 10050*1)*2 for reverse.

Assuming your function has just a single return value, reverse mode therefore scales much better as a function of the number of inputs. However, it is more complicated to implement, and generally has higher overhead – you need to figure out and execute the reverse pass somehow.

Using ForwardDiff from Zygote would let you use mixed mode (mixing the two).

3 Likes