Loops, allocations and helping the coder

One of the performance pitfalls in Julia is allocations, especially in loops. For instance, with lots of arrays being created inside a long loop.

This can be handled by creating “workspaces” and then filling/overwriting the arrays inside the loop. Unfortunately, however, that often leads to code that looks clumsy and it’s certainly not straightforward for users with a non-CS background (scientists, say).

Therefore, my question: could “compiler” development possibly help the users with this? For instance, by letting local X inside a loop mean that that X is created in the first iteration and subsequently any X = will lead to mutating the existing X?

6 Likes

I wouldn’t really recommend it to beginners, but my package GitHub - MasonProtter/Bumper.jl: Bring Your Own Stack is specifically designed for speeding up this pattern of allocating intermediate arrays, using them, then deallocating them soon afterwards.

6 Likes

There is a set of optimizations a sufficiently smarter optimizer can do. We are sometimes to hoist and allocation outside of a loop / or sink it outside that loop.

This currently only happens for Julia objects and not arrays and it happens fairly late in the pipeline (LLVM passes).

I do believe that we should be able to do a whole host of memory based optimizations in the compiler, especially now that we have effect and escape analysis on the higher level, but we may still need alias analysis.

I have a old and stale PR up for loop invariant code motion, but we probably need much better infrastructure to work with loops in the compiler first. At the time I got frustrated by how annoying it was to insert a basic block.

I do think the situation has much improved now and implementing LICM may be a good project to learn more about how the Julia compiler works.

11 Likes

This particular proposal would break backwards compatibility, which is a no-go:

1 Like

This seems like one possible feature that could help, in addition to the already existing features like the allocation profilers.

It’s not a proposal (as I am not qualified to make one). Rather, just an attempt to explain a general idea/question. (Of course we cannot change local, so let’s call it local2.)

2 Likes

Problem is this isn’t just a compiler development, this actually changes language semantics in a breaking way. You can get around that with a new keyword like local2, but you really shouldn’t add irreversible changes to a language if you don’t have to. In this case, we really don’t have to, not even with a macro. Allocating X before the loop and writing mutation code in the loop is clearer and less effort than patching allocation code with keywords and hoping the compiler knows what you intend it to do.

Forgetting about the manual keyword, escape analysis could hypothetically help heap reuse if it recognizes that each X does not outlive the iteration’s scope. It’s easy for us to intend that, but it’s hard to enforce that throughout our method calls and for the compiler to efficiently prove. And if X = has unfixed sizes, allocations still need to happen for some resizings.

Not always; especially when you have to calculate types for a container.

I am sympathetic to @Paul_Soderlind’s problem, as I often encounter it when solving economic models. Specifically, a lot of calculations for heterogeneous agent models require work arrays which I can reuse for many calls of a function, eg

function calculate_moments_for_agent_type(model, type::Integer, work_buffers)
    # I need work buffers from here on
    ...
    # and I don't care what happens to them after
end

function calculate_aggregate_moments(model)
    work_buffers = allocate_work_buffers(model) # ideally this could be outside
    mapreduce(i -> calculate_moments_for_agent_type(model, i, work_buffers),
              add_moments, model.agent_types)
end

The problem is most acute when I want to use type-based AD (ForwardDiff), parallel threads so that each gets to use one work buffer one at a time, and the lifetime of buffers does not coincide with functions. It is of course manageable, but a pain in the neck.

Bumper.jl looks promising though, I just haven’t had a chance to evaluate it in large codebases. But if it works, it would solve my problem.

8 Likes

I think I’m misunderstanding this, because if the array in every iteration doesn’t have the same element type, then it would be impossible to refactor to reuse of 1 array across iterations. Doesn’t sound like an issue for a bump allocator to allocate and free.

I’m confused because your phrasing sounds like you are disagreeing with Tamas, but then you seem to just be restating his point: just preallocating an array and re-using it can be really awkward if you want to use it to store different types in different places.

2 Likes

In practice it is restricted to a few types (eg Float64, and the corresponding ForwardDiff.Dual{nightmare}), or maybe BigFloat when I am doing numerical debugging. Within the same evaluation, it is the same type.

Just a side note if you’re trying to evaluate whether Bumper.jl will work for you, only isbits types are supported in Bumper.jl. You could use something like MultiFloats.jl though since those are isbits, or just replace calls to alloc(BigFloat, buf, size...) with calls to Array{BigFloat}(undef, size...) while testing.

I see now I didn’t mention that anywhere in the README, so I’ll fix that now.

4 Likes

Oh I was just confused why it was mentioned because the use case sounded impossible to do with the array-reuse scheme I was talking about, whether it’s manual or transformed from allocation-per-iteration code. But it seems like it’s not actually what I imagined.

1 Like

You can do it with array re-use, you just need a reinterpret step. In fact, I had a version of Bumper.jl which worked just fine using only reinterpret and view, but it was too slow because of inefficiencies in working with reinterpreted arrays. It was still faster than heap allocating new arrays for a lot of use-cases, but it was a lot slower than using PtrArrays.

1 Like

On one project, I have a long async pipeline that needs arrays that flow all the way through. I created an “array manager” channel that allocates a new array when you call take on it, and when you call put(manager, arr), it empties the array and offers it to the next call for take

It had me wondering if there was anyway for GC to link into the array allocator so that you don’t need allocate from scratch every time. Instead of freeing them, repurpose the allocated memory. Then have some kind of heuristic for freeing “stale” allocations if they aren’t needed after a while.

3 Likes

Why does it have to allocate every time? Why not store the array in the channel and just mutate it?

Just to give you another perspective. My background is engineering, not CS.

Not sure I agree. I think it is a matter of learning how to write performant code. The same lesson could apply in writing performant C (allocating an array once, passing the pointer around, then refilling as needed). In my experience, writing code this way is not particularly clumsy compared to any other style.

If you want to write performant code I think you should have to learn some of the basic best practices to writing good code, knowing a little bit about the software & hardware and why allocation is slow. This post comes to mind.

1 Like

Sorry, I was a unclear! It allocates a new array if none have been returned yet when take is called. Once the arrays start getting returned, it just serves as a queue for preallocated arrays! Since I have multiple jobs working in parallel, there are usually something like ntasks allocated arrays, but it depends on the timing of when one task is finishing and the next is starting.

4 Likes

The problem in practice is that Julia’s type system is orders of magnitudes richer than C, and you run into this if your code uses packages that make use of it. Specifically, the same lines of source code will not necessarily run with the same types.

Idiomatic Julia code relies on semi-automatic type calculations and these do not mesh well with pre-allocated arrays. It can be made to work, but it is not as easy as you imagine.

3 Likes

Julia’s garbage collector already does that.

julia> length(unique(x -> pointer(zeros(10_000)), 1:10_000))
576

if I allocate 10,000 arrays in a row each of which as 10,000 entries, there’s only 576 unique pointers. This is called object pooling, and it’s generally only worth it for big arrays, e.g. here’s what happens when I do length 100:

julia> length(unique(x -> pointer(zeros(100)), 1:10_000))
10000

but there’s a lot of other stuff involved in gabage collection that makes things slow other than just the allocation of an array. In fact, allocation is the fastest part, it’s setting things up so that memory can be tracked and freed that is generally slow.

8 Likes