Idea for setting both mutables and immutables without allocation

Here’s a crazy proposal for working with mutables and immutables more generically. Sorry it ended up a bit long.
One great feature of Julia is the ability to perform generic programming: write just one function method that works for a large variety of inputs, including some that maybe you weren’t considering when you first wrote it. So long as the input types conform to the operators and function you call, all will work well. For instance, a Taylor series evaluation function for numbers would probably work perfectly well for square matrices.

So let’s say we had a Taylor expansion evaluation (note, this might not be optimal):

function taylor_exp(x, coeffs)
    tmp = one(x)
    out = zero(x)
    for c in coeffs
        out = out + c*tmp
        tmp = tmp*x            
    end
    return out
end

Here, coeffs can be any iterable collection, and the method works fine for x being either a number or a square matrix (or any other type supporting +, *, zero and one), however while for numbers this is close to optimal there is a big problem if x is of a mutable (or non-isbits immutable) type.

The problem is that every call to out = out + c*tmp would be allocating, making my code (possibly quite signficantly) slower than it needs to be. We can fix this in certain cases, e.g. using fused mutating broadcast, so that out .= out .+ c*tmp but then this won’t work with immutable numbers (or the immutable arrays from StaticArrays.jl). I admit, there is also the problem of c*tmp being allocating and tmp = tmp*x being allocating (I have a better example below) - but this highlights a bigger, structural problem, that in certain cases I can’t write generic and performant code for both immutable and mutable types. Adding to the pain is that it is not easy to write just two methods - one for heap-allocated, mutable objects and another for stack-allocated, mutable objects - because it is difficult to catch the distinction at dispatch (I would have to write a bunch of methods for types that I know are immutable, and another bunch of methods for types that I know are mutable, or some such madness).

So how to fix this? Let’s take a better example than the above, for which (at work) I have already written optimal code for both cases. Say we have a collection of 3-vectors representing points in space, and we want to find the 3x3 “covariance” matrix for how they are scattered in space (say, because its Eigenvalues tell us something interesting about the distribution of points - whether they are linear, planar or spread out in 3D). If we didn’t have to worry about allocation, then I would write this code (for simplicity I also assume the mean of the points is at the origin and ignore normalization):

function covariance_immutable(points)
    cov = zero(first(points)*first(points)')
    for p in points
        cov = cov + p*p'
    end
    return cov
end

In fact, this code works perfectly well if points is a collection of static vectors from StaticArrays.jl, such as a Vector{SVector{3,Float64}}, and this function is much, much slower using Base.Array because it reallocates cov many, many times in the loop (and it is just a small array, so allocation and garbage collection costs dominate the actual computations).

To remove the allocations we can use:

function covariance_mutable(points)
    cov = zero(first(points)*first(points)')
    for p in points
        cov .= cov .+ p.*p'
    end
    return cov
end

Let’s assume for a moment that we have inline non-isbits immutables so that p' is a non-allocating view of p. Then broadcast loop fusion would eliminate all the allocation within the loop, and only the output matrix cov would be heap-allocated in the first line. Then this function becomes competitive to the above when p is a Base.Vector or StaticArrays.MVector (or more importantly, when cov is a Base.Matrix or StaticArrays.MMatrix), but it no longer works for SVector.

While I could implement the first method as default and implement the second wherever I know it works, it would be preferable if I could just write one method. Here I propose one idiom for assigments which are overwriting for mutables and work for immutables, as an alternative to x = y or x .= y (or the new x <- y):

  • Introduce := such that x := y is lowered to x = Base.set(x,y), where types can implement setting as they need to (via some function like Base.set!). The idea of x := y is that the memory allocated to x is overwritten in place (whether that is the stack or the heap) - if that is not possible then an error would be thrown (e.g. arrays of incompatible size).
  • For immutable x we would simply have set(x,y) = y.
  • For mutable x we would do the equivalent of set(x,y) = (set!(x, y); return x) (but it would be up to the types to implement this correctly, e.g. composite types/structs could do this too, not just arrays, but the default behavior of set! could be broadcast! or maybe setindex!).
  • We could go futher and implement broadcast_set and do dot-call fusion with :=, e.g. x := y .+ z.

Then we could end up with:

function covariance_generic(points)juliua
    cov = zero(first(points)*first(points)')
    for p in points
        cov := cov .+ p.*p'
    end
    return cov
end

which works well in both cases!

Here’s a possible implementation of set:

@generated function Base.set(x, y)
    if x.mutable
        return :(Base.set!(x,y); return x)
    else
        return :y
    end
end

where it is the resposibility of type owners to define a specialized method for Base.set! if their type is mutable (maybe with a fallback of broadcast! or setindex! (or vice-versa), or some generated function for structs).

(Please note: I don’t think the lack of genericism with mutable/immutable objects is limited to just geometry and static arrays as I’ve focussed on in the examples - I’m bringing this up because I feel that this could be useful more broadly in the language. It is also possible that better compiler technology could re-use allocated memory in cases like shown in the example, making the first “functional” approach work well also for mutable structs and arrays, but that sounds tricky when e.g. there are no static guarantees as to the size of the array being compatible).

8 Likes

Yes, this sounds really good. I recently noted how DifferentialEquations.jl has two implementations for each method for this very reason:

If the := syntax could really work, it would be a major win for packages which provide performant yet general code.

:relieved: mission critical infrastructure :wine_glass:

You could use a macro, right? I’m not a big fan of ascii syntax:

using MacroTools

set(es...) = "unimplemented"

assignment_pieces(e) = @match e begin
    (a_ = b_) => (a, b)
    a_ => error("Expecting an assignment")
end

macro set(e)
    left, right = assignment_pieces(e)
    esc( :( $left = set($left, $right) ) )
end

a = 1
@set a = 2

This is far too common of a thing to obfuscate with a macro. The broadcasting already gives the math elegance and readability, := is a small addition that then makes it all work in one code. I’ll take.

1 Like

For mutables, doesn’t the right hand side still allocate, before it copies that temporary into cov? That is, doesn’t set! need to be a higher order function in order for it to be equivalent to broadcast!?

Isn’t this a similar idea to:

https://github.com/JuliaLang/julia/issues/4176

?

I think that historically there has been significant objections to having mutable number types.

@bramtayl A macro is a good idea for developing the concept and see how it plays out (we can do that with a package), but I would hate to see a very large number of lines of code beginning with @set.

For mutables, doesn’t the right hand side still allocate, before it copies that temporary into cov? That is, doesn’t set! need to be a higher order function in order for it to be equivalent to broadcast!?

@danielmatz You’re right - I think I touched on this in the OP, but yes we need p' to be a RowVector (#19670), we’d need to “inline” immutables with references to mutables so that result doesn’t allocate (a long discussed issue in Julia, though I can’t find an open issue right now), and we’d definitely need full integration with broadcast dot-call fusion so it lowers to cov = set_broadcast((x,y,z) -> x+y*z, cov, cov, p, p') with set_broadcast() behaving a bit like set() (e.g. calling broadcast or broadcast! as appropriate).

Firstly, a minor nitpick the use of @generated is almost certainly overkill: you should be able to simply use

isimmutable(x) ? (Base.set!(x,y); return x) : y

I think perhaps a better solution might be to make .= more flexible. One other gripe I have is that it can only be used for broadcasting. In other words, I can’t do

y .= A*x

to get an inplace matrix-vector multiplication.

One idea would be to have this lowered to something like:

y = inplace!(y, *, A, x)

(this was something I originally tried doing with InplaceOps.jl, though that code is quite old and predates the 0.5 functions changes).

5 Likes

OK - nice package. And yes, if isimmutable is handled specially by inference that that would be great (or maybe there is a @pure function for isimmutable(typeof(x))?).

Having a nifty way of ending up with daxpy, dgemv, dgemm and so-on would be nice. While this works for *, it gets a little tricky for

y := alpha*A*x + beta*y

to be converted to dgemv with no run-time overhead when y::Vector{Float64}, beta::Float64, alpha::Float64, A::Matrix{Float64} and x::Vector{Float64}. Seems challenging. To do this we probably need some kind of (isbits) immutable expression so we can analyse it statically and dispatch on combinations of expression patterns and types.

And, sure, if we can co-opt .= to work for immutables while remaining close to backwards-compatible, that would be awesome.

If x .= ... would be lowered to x = broadcast!(contructed_lambda, x, ...) instead of broadcast!(contructed_lambda, x, ...) and broadcast!(f, x, ...) would be changed to forward to broadcast(f, ...) if x is immutable, that would already be quite helpful, wouldn’t it?
Alas, that would change the semantics of broadcast!(f, x, ...) from “updates x” to “returns computed result, may change x”. Instead, we could introduce a new wrapper function for this which gets used for .= lowering. That would be a relatively small, non-breaking change, and would already be quite useful for StaticArrays, IIUC. Thoughts?

3 Likes

Instead, we could introduce a new wrapper function for this which gets used for .= lowering.

Right - that function sounds a bit like set (and set_broadcast), above.

Indeed it does. Just that set would not be needed, being replaced by set_broadcast(identity, ...).

1 Like

I totally agree we need to address this kind of generic programming and this class of performance issues. Unfortunately doing it right is difficult. In this example:

function covariance_generic(points)juliua
    cov = zero(first(points)*first(points)')
    for p in points
        cov := cov .+ p.*p'
    end
    return cov
end

if I understand the proposal, we’d call set!(cov, cov .+ p.*p'). It looks to me like that allocates a new array for cov .+ p.*p', then copies the data to cov. That doesn’t save any allocations.

The key feature of this example is that cov is a newly-allocated object with only one reference (ownership of which is passed to .+, then passed back). If cov were an argument instead, this would not work the way we want.

I think a full solution would require a combination of compiler analysis and runtime checks to ensure reference uniqueness, plus some standard way to define in-place methods. For example, given

x = y .+ z

if we know this overwrites the last reference to x, we can instead call

x = inplace!(x, .+, y, z)

with fallback definition

inplace!(x, f, y, z) = f(y,z)

Then you can define inplace! for other functions and types. I feel like this idea needs some tweaking though, since you don’t want to write a lot of redundant definitions.

3 Likes

Actually, you would need broadcast to fuse together with set to do what I was suggesting. Really, the idea is to bring the same (or similar) broadcast fusion syntax to we have now with .= to also work with immutable arrays, etc. Using a := ..., or set()/inplace!() is telling the compiler that it is free to trash whatever used to be in the memory of the variable - “I don’t need it anymore”.

1 Like

It’s still a mutating operation; calling it a compiler hint doesn’t help. If people really want the “mutate if mutable” behavior, we could add it to x .= y by changing the lowering to x = broadcast!(..., x, y), and saying that broadcast! is allowed to be non-mutating on immutable arguments.

I think a full solution should also work for BigInts. And I do not want to tell people, to get fast BigInt code, carefully look to make sure there are no shared references, then change all your =s to .=s. That is ugly and very error-prone. Plus we also want to be able to overwrite intermediate results in x + y + z + w, where there is no assignment. Ideally the compiler would know whether z + w returns a new object, and then tell the next + to overwrite it if possible.

3 Likes

As far as BigInt (and BigFloat) are concerned, I think the only way to really get good performance is to move away from trying to use a type that points at C managed memory and to change things to use an immutable type, with Julia managed memory.
See [WIP] implement BigInt (and BigFloat?) with native Array by rfourquet · Pull Request #17015 · JuliaLang/julia · GitHub and Latest (WIP) version of BigFlt (to replace BigFloat) module, with Flt type · GitHub.

I completely agree, but we would get even better performance adding the ability to safely overwrite numbers instead of allocating new ones when possible, on top of that.

2 Likes

Oh, yes, I was not at all saying that that part wasn’t critical! If BigInt and BigFloats were truly immutable, you wouldn’t have to worry about shared references.
It’s interesting, decimal floating point numbers via DecFP (which are stored as immutables) are currently quite a bit faster than similar precision BigFloats [and take much less space as well]).

The problem is that if an object is immutable, you want to be able to have shared references to it so you can pass it without the overhead of copying. Then of course the flip side is that mutating it “behind the scenes” is not safe. You have to pick your poison: either copy on assignment and be able to mutate, or avoid copies but not be able to mutate. AFAIK the only way to improve on that unhappy choice is to somehow track which objects are shared. I think there’s a good chance we should use an object tag bit for this purpose.

The problem with GMP is that it internally reallocates numbers while it works. We don’t (yet?) have a way to avoid it putting a pointer to new malloc’d storage in a BigInt that needs to be freed via a finalizer. I guess MPFR doesn’t have this problem, since the precision is fixed?

2 Likes