Save complex state in function and reuse on next function call

I have the following function makebatch that creates batches for my model from a text file. The file is processed line-by-line and sometimes processing a line will lead the output to exceed the desired batch size. In this case I want to reuse the output the next time I call makebatch. The output of the function is a vector of integers with length BATCH_SIZE.

Currently I have a working version which will return the finished batch and the part to be reused for the next batch as a tuple. I see that I could also declare a global next batch which will then be used at the function call. However, I think this will come with a performance penalty(?) and it is also kind of ugly.

I saw In Julia, how to create a function that saves its own internal state? but I am not exactly sure how to apply this because the state I save is somewhat more complex (again, just like the output a vector of integers but not as long as BATCH_SIZE).

Very grateful for any tips on how to handle this!

function makebatch(IO::IOStream, VOCAB::Vocabulary, BATCH_SIZE::Int, next_batch::Vector{Int} = Vector{Int}())
    batch = next_batch
    while length(batch) < BATCH_SIZE
        line = readline(IO)
        idcs = words2idcs(line, VOCAB)
        isnothing(idcs) ? continue : push!(batch,idcs...)
    end

    if length(batch) > BATCH_SIZE
        next_batch = Vector{Int}()
        while length(batch) > BATCH_SIZE
            push!(next_batch, pop!(batch))
        end
    end
    return(batch, next_batch)
end

Reading the post, I’m not sure what your goal is. Do you want faster code, or is your goal a more elegant implementation?
Your current design looks already quite straightforward.
(As you said, using a global here is probably worse.)

If you have more data to pass on to the next iteration, you could also consider using NamedTuple (and UnPack.jl).
Alternativly, maybe something like python’s yield would be useful, e.g. Manual · Continuables.jl

Sorry if I was a bit unclear. I want the former (faster code) and was hoping to get the latter (more elegant implementation) along the way.

I’m taking a look at both of your suggestions now and will make sure to report back :slight_smile:

1 Like

Just a small thing. I think you can replace

    if length(batch) > BATCH_SIZE
        next_batch = Vector{Int}()
        while length(batch) > BATCH_SIZE
            push!(next_batch, pop!(batch))
        end
    end

with

next_batch = batch[BATCH_SIZE+1:end]
resize!(batch, BATCH_SIZE)

That’s at least shorter :slight_smile: maybe even faster.

1 Like

Maybe as a follow up for future reference:

I have now stuck with the implementation of exporting next_batch as a global and putting next_batch as a function argument, when calling the function repeatedly. The performance seems good enough for my use case and I don’t lose type stability anyway because the state saved in the global variable is passed back as a function argument and not as a global.

(Btw, !resize was just as fast as the original approach for the small batch sizes that I use and tested with, but it is a little more concise which I liked)