Triple buffering

Hi, I’m curious if anyone has tried triple buffering. This is a generally useful framework, but I’m specifically thinking allocation-free MCMC (or maybe more accurately, iteration-independent allocation count).

I think it’s something like this:

function triple_buffer!(init, write!, log!, num_iterations)
    reading = init
    writing = deepcopy(init)
    logging = deepcopy(init)

    # Fill the pipeline
    write!(writing, reading)
    (reading, writing, logging) = (writing, logging, reading)
    write!(writing, reading)

    for i in 1:(num_iterations-2)
        (reading, writing, logging) = (writing, logging, reading)

        @sync begin
            @async write!(writing, reading)
            @async log!(logging)
        end       
    end

    # Empty the pipeline
    (reading, writing, logging) = (writing, logging, reading)
    log!(logging)
    (reading, writing, logging) = (writing, logging, reading)
    log!(logging)

    return nothing
end

Here write!(dest, src) takes a step in the whatever space, e.g. an HMC step, and log! outputs the results. I’m assuming we might not need to store the entire state at each step, and also that write! and log! might be closures, so they could have some fixed “scratch space” set aside to use for intermediate computations as needed. Oh, and both could themselves be parallel as well, of course.

Any thoughts? Too simple? Too complicated?

Especially interested if performance-oriented folks like @Elrod and @mohamed82008 think this approach could get good performance, or if there’s generally a better way to go about this.

Why the @async?
This seems rather general.
I think I’d have to try it out before I can make more informative comments.
My first comment would be that I think you could generally do better than 3x copies of init.
But anything heavy should of course be in the write! closure (or whatever the appropriate closure is).

For HMC I had just one, and copied the contents into the saved samples. The need to copy is the price you pay for having zero-allocation trees; you can’t just store the last state if that state is part of a preallocated buffer and about to be over written.
I think this is generally a lower cost with HMC, especially when sampling chains from multiple threads.
I think this is a similar idea.

1 Like

The idea is to keep reading, writing, and logging from slowing each other down. Separating them makes it so we can do all three at the same time. For example, if I only had one copy I’d have to take a step and then wait for information to be copied out. Double-buffering should do better. Triple might be a bit extreme, guess it may depend whether we end up bound by IO or computation.

I haven’t done much with parallel programming in Julia (mostly OpenMP years ago), but the @async is just there to allow the parallelism.