Allocations when constructing a matrix from columns

Hey all!

I’m constructing a matrix from another matrix like this:

shapematrix(xs) = [xs[:,1].-xs[:,4]  xs[:,2].-xs[:,4]  xs[:,3].-xs[:,4]]

This is somewhat in an inner loop of my program, and I would like to reduce the number of allocations happening. @time reports that there are 13 allocations here, when I would think a single one would be sufficient, and it seems no matter how I rewrite the function, the number of allocations is always in the double digits.

Is there a way of writing this that makes Julia be a little more clever about it memory management?

I’m also using Zygote.jl, which doesn’t support mutating Arrays, so I can’t make a zero array and fill in the entries afterwards.

I’m new-ish to Julia, so maybe there’s something obvious going on here?

Note first that you can use broadcasting to rewrite this:

shapematrix2(xs) = xs[:, 1:3] .- xs[:, 4]

In order to reduce allocations, you can use views:

shapematrix3(xs) = @views xs[:, 1:3] .- xs[:, 4]

Let’s benchmark, and remember to use BenchmarkTools for this, not the @time macro, although the allocation results look similar:

julia> X = rand(1000, 4);

julia> @btime shapematrix($X);
  11.534 μs (14 allocations: 95.05 KiB)

julia> @btime shapematrix2($X);
  5.226 μs (5 allocations: 54.97 KiB)

julia> @btime shapematrix3($X);
  1.895 μs (4 allocations: 23.63 KiB)
3 Likes

I haven’t used Zygote, but this sounds odd. All of these array operations initialize arrays and then modify them afterwards, even if you don’t see it explicitly in the top-level code.

Thanks! I’ve tried to use views before, but for each column and as written in the OP, and this didn’t help anything so I figured that wasn’t the problem.

I’d still like not to use broadcasting excessively, since I find it utterly unreadable, but it’ll do for now.

I don’t know anything about Zygotes internals, but an educated guess is that if only internal procedures do the mutation then Zygote has a well defined boundary where it can reason about the operations done to the matrices and calculate gradients based on that.

I know for a fact that the mutation doesn’t work, because if you try you get the error ERROR: Mutating arrays is not supported :smile:

Really? It exists to make code easier to read and write, that’s sort of the point. Comparing

[xs[:,1].-xs[:,4] xs[:,2].-xs[:,4] xs[:,3].-xs[:,4]]
# and 
xs[:, 1:3] .- xs[:, 4]

I find the former really hard to read, and the latter very clear.

(Stream of consciousness: “The first line is, hmm a jumble of expressions, oh, it’s horizontal concatenation. And it’s hard to see the pattern, because the indices are jumping, and oh, it’s minus between each term, is it? yes, and 1, 4, 2, something, oh, 1,4,2,4,3,4, ah, 1,2,3 minus the fourth. ok.” The latter is just “Subtract the 4th column from the other three.”)

I’d be interested in hearing what you don’t like about it. Maybe there’s some little mental block that can be cleared?

1 Like

In addition to @DNF’s excellent suggestions, if this is really a bottleneck in your code then you could write a micro-optimized mutating version and then define the adjoint manually. See the Zygote manual for details. (I am just pointing out the technical possibility, but I would guess that it is overkill here).

That said, if you are new to Julia then focusing on allocations like this may not be the best allocation (heh) of your time.

It’s shorter, no doubt, but it’s also way denser. Originally, I wrote out the 9 terms without any use of :, and I think that’s the way I ultimately prefer, but I had to go back and forth a bit, in case there were some performance tricks that would only happen with a specific syntax (and wouldn’t you know!)

Here’s my stream of consciousness:

[ xs[1,1]-xs[1,4]  xs[1,2]-xs[1,4]  xs[1,3]-xs[1,4] ;
  xs[2,1]-xs[2,4]  xs[2,2]-xs[2,4]  xs[2,3]-xs[2,4] ;
  xs[3,1]-xs[3,4]  xs[3,2]-xs[3,4]  xs[3,3]-xs[3,4] ]

oh, it’s a 3x3 matrix, and A[i,j] = …uuh lets see … xs[i,j] - xs[i,4].

[xs[:,1].-xs[:,4] xs[:,2].-xs[:,4] xs[:,3].-xs[:,4]]

Okay so it’s an array, and the elements are, oh no wait, xs[:,1] is a vector, so we’ll end up with a matrix where … uuh … the vectors here become the columns?, and then we broadcast xs[:,4] which is also a vector, so the broadcasting is really just element wise, okay.

xs[:, 1:3] .- xs[:, 4]

Okay so we index the vector, oh with ranges in both places, okay, I guess we’ll get the sub matrix? and then we’ll broadcast sub xs[:,4] which is a vector, and subtraction of a vector needs another vector, so probably this will go across the columns? uuh wait xs is 3 high and so xs[:,4] is 3 long or uuh high, and so we’ll have to go across the columns in xs.


Whether I write the second one, or eventually get to the last one, isn’t really important to me, since I find the first version way clearer. In addition, I would think that spelling it out explicitly would make Julia’s job of generating good code for this easier, so that I don’t have to rewrite all of my functions when I realize that bad code is being generated. Somehow I got it completely backwards this time?

Anyways, thanks for the help! :slight_smile:

This single change cut down 1/3 of my allocations for my test case, so this is absolutely a bottleneck.

and then define the adjoint manually.

I could, but the entire reason I’m using Zygote is so that I don’t have to write out derivatives manually, and so it doesn’t make much sense for me to spend time to learn the ins and outs of Zygote just to write more code, when the whole reason for me using it is to not write code :smile:

That said, if you are new to Julia then focusing on allocations like this may not be the best allocation (heh) of your time.

I appreciate that you’re trying to help, but I got a simulation that runs orders of magnitude too slow than expeted because Julia generates bad code (due to, it seems, me writing bad Julia), and so this is definitely worth my time. I think it’s unfortunate that any time anyone is asking about making things faster people (and not to pick on you in particular!) jumps in screaming premature optimization :frowning_face:

You might get a lot of performance benefit by reusing the memory by passing along the matrix from iteration to iteration and updating it in-place. If you’re matrices are this small, you might also see a lot of benefit from using StaticArrays from StaticArrays.jl (in which case you would not write it in-place). I’m on my phone so I won’t write more here but there’s a lot of examples on these forums of doing these changes and getting big performance improvements.

I think you are mistaken about this.

Having AD does not mean that you never, ever have to code derivatives, just that you can focus on which ones you want to code manually (note that you also have a choice here, if you stick to the non-modifying version), and that the chaining will be organized nicely for you.

I am not sure I understand you. No one is screaming here.

It is very likely that there are alternative solutions depending the context of this function (which we do not have, so it is hard to help), and focusing on optimizing this particular form may not be the best solution. This is natural if you are new to the language, and pointing this out should not be taken personally.

This was my first idea too, but I’m using Zygote to differentiate the functions, which complains when I mutate arrays, as written above :frowning_face:

Having AD does not mean that you never, ever have to code derivatives, just that you can focus on which ones you want to code manually

I’ve already written some adjoints manually due to exessive allocations in other places; in the limit it looks to me that I’ll have to write my primals carefully to be 100% sure that Julia is generating decent code, and write the adjoints manually to make sure that Zygote understands my primals. At this point, it doesn’t seem to me that using Zygote, or even Julia, makes a lot of sense. Yet, as far as I understand the goals of both projects, this is textbook usage. It’s frustrating, and I don’t understand if I’m just not getting it, or what’s going on.

I am not sure I understand you. No one is screaming here.

No, but you were suggesting that the this really wasn’t a bottleneck, and that this might not be the best use of my time, which at best, is presumptuous. The reason I’m out here asking for help is, of course, because it is a problem. (Again, I didn’t mean to pick on you, but this is what you always get when asking for help with performance related things pretty much anywhere online, and it’s a pet peeve of mine :wink:).

It is very likely that there are alternative solutions depending the context of this function (which we do not have, so it is hard to help), and focusing on optimizing this particular form may not be the best solution.

This might be, but I intentionally posted a very small sub-problem, since it was a small self-contained part of Julia that I had (and still have!) problems properly understanding.

1 Like

If 3x4 is the size, then the above options are:

julia> @btime shapematrix(xs) setup=(xs=randn(3,4));
  413.700 ns (13 allocations: 1.23 KiB)

julia> @btime shapematrix3(xs) setup=(xs=randn(3,4));
  64.027 ns (3 allocations: 272 bytes)

These are dwarfed by the cost of the gradient… partly because every indexing operation first makes a new zero(xs), then writes into that. Writing even a pretty crude gradient really does pay:

julia> @btime Zygote.gradient(sum∘shapematrix3, xs) setup=(xs=randn(3,4));
  6.862 μs (122 allocations: 4.70 KiB)

julia> Zygote.@adjoint shapematrix4(xs) = shapematrix4(xs), dys -> (dys .- [0,0,0,1]' .* sum(dys, dims=2),)

julia> @btime Zygote.gradient(sum∘shapematrix4, xs) setup=(xs=randn(3,4));
  773.673 ns (13 allocations: 848 bytes)

But for such small arrays, as others have mentioned, you probably want to do something more like this:

julia> using ForwardDiff, StaticArrays

julia> @btime ForwardDiff.gradient(sum∘shapematrix4, xs) setup=(xs=randn(3,4));
  598.392 ns (5 allocations: 4.02 KiB)

julia> @btime ForwardDiff.gradient(sum∘shapematrix4, xs) setup=(xs=@SArray randn(3,4));
  181.366 ns (2 allocations: 1.34 KiB)

You can also do Zygote.gradient(xs -> sum(Zygote.forwarddiff(shapematrix4, xs)), xs) but this seems to have more overhead here.

1 Like

Maybe you’re new here, but this is a board full of almost obsessive microoptimization enthusiasts. You can hardly drop a single line of code here without a hailstorm of optimization advice raining down on you, solicited or unsolicited.

So, if you have a different impression, this must be an off day.

4 Likes

The function I’m taking the gradient of is much bigger than my toy example. I’m trying to move things to SMatrix, but ran into some problems with Zygote and StaticArrays (which I posted in Zygote’s issue tracker).

Thanks anyways :smile:

I am indeed new here! :wink:

OK, but then it’s hard to guess what will help. Care to write a more representative toy problem?

(Likewise the zygote issue, gradient(identity, 1) works pretty well!)

1 Like

I wasn’t suggesting that this isn’t a bottleneck for you, simply that focusing on the allocations per se (and keeping other things unchanged) may not be the best approach.

Eg if the xs small, you can write very clean, idiomatic code using StaticArrays (as suggested by others), which should be rather fast, and Zygote will just work fine with it.

The Julia ecosystem has evolved some very efficient techniques for dealing with the issues you are facing, but it is very hard to help without context.

I’m not sure if this is a fruitful approach, but if mutation is the only way to get the performance you need, then perhaps the optimal strategy is to use something like FiniteDiff.jl for calculating the pullbacks of the mutating code and then using Zygote’s custom adjoint machinery to embed that so that the rest of your code can be handled by Zygote.

I tried to make a toy example, but I don’t have enough information about your actual problem and I also wasn’t too sure about the correct way to mix Zygote.jl with FiniteDiff.jl, but maybe someone here can cook up a good example.