Simulate Nested Array Without Allocating New Memory Every Loop?

To start off, here is some necessary information

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Reproducible Example: Start off each simulation on the same number and generate random integers until reaching an even number. Store that run into an array and repeat 5 times.

Goal: Accomplish this correctly (correctly being defined in the problem below) while minimizing memory allocation.

Problem:

  • a) Minimize memory allocation by pre-allocating array, but simulation is ruined as the array is “passed by reference” (not quite sure if that’s technically correct) such that once the first run reaches an even number, all subsequent runs end with an even number and immediately break from the simulation or
  • b) Correctly simulate all runs but destroy run-time by allocating a new starting array for each simulation.

Note: This is a reprex of a much larger problem I am working on now. So while the time & allocation difference in this example is minimal, it is having a stronger effect in my code.

To start off, I’ll show you the code from problem part (a):

example01.jl

function gen_array(seq::Array{Int64,1})
    while true
        seq[end] % 2 == 0 && break

        num = rand(Int64)
        push!(seq, num)
    end
    return seq
end

function simulate_chains(n::Int64, seq::Array{Int64,1})
    map(i -> gen_array(seq), 1:n)
end

function main()
    N::Int64 = 5

    START = Array{Int64,1}(undef, 1)
    fill!(START, 1233)

    simulation = simulate_chains(N, START)
end

Here is me loading the script into the REPL and benchmarking it twice to allow it to compile:

julia> include("example01.jl")
main (generic function with 1 method)

julia> @time main();
  0.048245 seconds (220.16 k allocations: 11.701 MiB)

julia> @time main()
  0.000004 seconds (9 allocations: 480 bytes)
5-element Array{Array{Int64,1},1}:
 [1233, -4418181421764378021, -6736001003188972884]
 [1233, -4418181421764378021, -6736001003188972884]
 [1233, -4418181421764378021, -6736001003188972884]
 [1233, -4418181421764378021, -6736001003188972884]
 [1233, -4418181421764378021, -6736001003188972884]

Notice how the output is the same for each simulation. While this may be possible, it is certainly not probable. As mentioned in problem part (a), the START array is remaining “persistent” across simulations. This is not desired.

Now for the code from problem part (b)

example02.jl

function gen_array(seq::Array{Int64,1}=[1233])
    while true
        seq[end] % 2 == 0 && break

        num = rand(Int64)
        push!(seq, num)
    end
    return seq
end

function simulate_chains(n::Int64)
    map(i -> gen_array(), 1:n)
end

function main()
    N::Int64 = 5

    simulation = simulate_chains(N)
end

Notice how I’ve changed the code by giving gen_array() a default value and removing my pre-allocated array from main(). Here I am in a new REPL session running the code:

julia> include("example02.jl")
main (generic function with 1 method)

julia> @time main();
  0.061380 seconds (256.65 k allocations: 13.881 MiB)

julia> @time main()
  0.000004 seconds (15 allocations: 1008 bytes)
5-element Array{Array{Int64,1},1}:
 [1233, 1117329081981021236]
 [1233, -6637213868465549306]
 [1233, 4235110922391529754]
 [1233, 6888918319811551361, 8043136562908430093, 3767038082823508588]
 [1233, 6019512490854919842]

As you can see I’ve allocated more memory in this code (although the time hasn’t changed so much). My real code has significant time difference by only changing this aspect.

Is there a way to have my cake and eat it too?

You are constructing one array in your main and passing that by reference to the gen_array function. You should construct a new array in your gen_array function and pass in a starting value. See the following:

function gen_array(START_VALUE)
    seq = Array{Int64, 1}()
    push!(seq, START_VALUE)
    while true
        seq[end] % 2 == 0 && break

        num = rand(Int64)
        push!(seq, num)
    end
    return seq
end

function simulate_chains(n::Int64, START_VALUE)
    map(_ -> gen_array(START_VALUE), 1:n)
end

function main()
    N::Int64 = 5

    #START = Array{Int64,1}(undef, 1)
    #fill!(START, 1233)

    simulation = simulate_chains(N, 1233)
end

edit: oh I see. I am not sure how you would get rid of the allocations since you don’t know the size of the array to begin with.

In your first code example, simulate_chains just gets the same vector, seq, in each iteration. The first time, it fills it up until the last value is even (you can just write while isodd(seq[end]) btw). All the other times, it just gets the same vector over and over, checks that the last value is even and bails out immediately.

In the second code example, you give it a new starting vector for each iteration, so that issue disappears.

Sorry, I don’t have time to look at the performance at the moment.

Yes, that much is clear to me. My question is more along the lines of how can I keep the speed/mem allocation of my first example while getting the right answer of my second example.

I think I’d pass in some cache vector into gen_array! (use a ! in the name when mutating arguments). The cache vector could have some reasonable length to make it unlikely to be to short (e.g. if it has length 20, the risk of overflowing is approximately 1 to a million). Then fill that up and finally return a slice.

If you get too many elements, start pushing or use resize!.

Something along these lines (warning: untested, and hastily written):
Edit: the code did have (at least) one bug. Fixed now.

function gen_array!(cache)
    for i in 2:length(cache)
        num = rand(eltype(cache))
        cache[i] = num
        iseven(num) && return cache[1:i]  # slices makes copies
    end
    num = 1  # initialize
    while isodd(num)
        num = rand(eltype(cache))
        push!(cache, num)
    end
    return copy(cache)  # remember to copy
end
2 Likes

BTW, you should definitely use the BenchmarkTools package to get better timing and memory estimates. Especially for fast functions like the ones you have posted, the timings from @time are almost meaningless, because its precision is too poor.

No, that’s the point. You cannot keep that speed, because it’s misleading. It’s only that fast because seq[end] % 2 == 0 is always true immediately (except for the very first time), so then it just bails out.

Ahh I see what you’re saying. I can’t believe that didn’t dawn on me. I am currently testing your code above.

Is there any special about a “cache vector” or is that just a normal pre-allocated array like I made in my first example?

Nothing special, but it gets reused across calls to gen_array! it’s just an ordinary vector.

1 Like

Perfect, You’re code works very well. I like eltype, I hadn’t thought to use that before. Do we return a copy so as to not really update the cache vector on the heap?

Or are we are constantly rewriting over the array after updating it every time. I think looking at the code longer, this is what we are doing.

We are updating the cache vector, but we need a copy (or slice) in order to make sure that each return vector is independent.

Apologies for coming back to this later as I’ve been thinking about this code all day. Instead of resizing the array if it gets to big, could we potentially use recursion to repeat until it hits an even value?

Something along these lines:

function gen_array!(cache)
    for i in 2:length(cache)
        num = rand(eltype(cache))
        cache[i] = num
        iseven(num) && return cache[1:i]  # slices makes copies
    end
    gen_array(cache)
end

My reasoning being, if we hit the size limit of the vector the loop will have stopped without returning, so we’ll go again and since we are only returning cache[1:i] we don’t really have to worry about what is leftover in the array ever.

depends on what output you desire, this way you don’t get the whole sequence of random numbers but the sequence modulo the cache vector length. Do you need the sequence at all?

@DNF’s solution is the preferable one, for reasons that are only apparent if you know the implementation in array.c.

push! needs to sometimes resize a vector. In addition to their length, vectors have a capacity: The amount of extra memory reserved. When you push! and the capacity is reached, then the vector is grown by a large amount (typically 2x, details in array.c), the old contents are copied over, and the old buffer is freed (no garbage collection needed). In the cheap case (capacity left over), push! is still surprisingly expensive, due to compiler/linker limitations (foreign function call that cannot be inlined, 10 cycles down the drain).

From this, you learn:

  1. push! is slower than setindex!, even if nothing gets allocated (enough capacity), and
  2. push! into a vector with enough capacity is medium-cheap, so make sure to have capacity
  3. if you ever push! into a vector, then it will waste up to 2x the memory in spare capacity, in order to make subsequent push! fast.

Watch how @DNF takes all of these facts into account: His code has only a single cache object (nit: name it buffer) that gets dynamically resized, and the return is always a copy/slice, i.e. uses tight allocation (no spare capacity that would likely be wasted, because you don’t plan to push! into your runs after construction). Further, cache retains its size between subsequent runs, avoiding push! calls for faster setindex! (the number of push! calls is the length of longest run, and does not scale with the number of runs).

2 Likes