Hi, I’ve enjoyed using Julia on my first research project quite a lot, and I’m now thinking of how to set up my code for my second research project. The computational bottleneck in this project lies in performing many computationally cheap simulations; I would like to make sure that I set this up in an efficient way, so let me begin by explaining what I will need to do:

I will need to simulate N independent time-series. Each of these simulations will consist of an ex-ante unknown number of simulation steps (we repeatedly simulate the time-series from time-step i to (i+1) until some condition is met). For subsequent calculations, I will need to have access to each step of each time-series.

Q1. I assume that initialising a large array/expanding an array to fit the various time-series as they are being simulated is computationally expensive, so I guess it is better to proceed differently? Would something like a list of arrays (each list entry is 1 time series, each array contains the value of the process at the various time-points) be a good idea? Or would it be better to use a list of lists of arrays (each outer list entry is 1 time series, each inner list entry is 1 time-step, each array is the value of the process at this time-step)? Or is yet another idea better?

Q2. I will need to simulate many different combinations of processes, so in my previous project, I’ve been building small simulation blocks for different processes, patching them together as I needed them. This means that several functions (‘layers’) need to be called in sequence to simulate any given process from step i to step i+1. Is this (modulo implementation) fine, or would it be better to somehow ‘compile’ the various ‘layers’ into one function that performs the simulation from step i to step i+1 directly?

I’m curious to hear your thoughts, and of course I can clarify/elaborate where necessary!

I think you description is still too vague to really give good recommendations. Could you perhaps provide some MWE/mock-up of the actual structure of the code? It does not necessarily need to be runnable but would be nice. Bonus points if the computations have realistic characteristics.

Hi, I can certainly type out some code for this tomorrow! For now, let me break out a brief description of the algorithm to see if that conveys my intentions clearly. I’m very open to setting up the code in a way that pays off on the long run: large parts of my old code will need a massive overhaul for this project anyway, so I want to make as good use of it as possible! This is why I don’t have a good piece of code to post as MWE yet!

A typical simulation looks like this, where we have a 2-dimensional process (X,M) to keep track of. We will need to repeat the entire procedure N times.

Step 0: initial values are used, say X[0] = 0 and M[0] = 0.

Step i + 1: Function 1: Simulate X[i+1] = X[i] + Standard_normal. Function 2: M[i+1] = M[i] + 1 if X[i+1] > X[i], M[i] otherwise. When N[i+1] = 10, stop simulating this path.

Effectively, we are simulating a 1-dimensional random walk with gaussian steps, where we track the number of up-steps and stop after stepping up 10 times. As the number of steps taken before reaching 10 up-steps can be arbitrarily large, I foresee problems in initializing an 3-dimensional array to store all my simulations in.

I’ve illustrated that in step i+1, I’m calling two different functions: in another simulation, I might simulate two random walks X and Y, and e.g. let M count how often we have seen X[i] > Y[i] so far. For ease of reference, I’ll call any such sequential calls to functions to simulate a step ‘layers’; this is what my Q2 refers to.

If necessary/preferred, I’m happy to send a more elaborate example tomorrow, but I hope it gets gist across?

For Q2: Composing functions from simpler pieces sounds fine, e.g, `f2 ∘ f1`

is essentially “compiled into one function” by the Julia compiler.

For Q1: One option might be to hook into the iterator interface, i.e., effectively making your simulation an iterator of unknown size and rely on `collect`

being efficiently implemented for such already. There might be better options though …

Thank you for your suggestions!

For Q1, I didn’t think of setting it up as an iterator yet, this may certainly work, in particular if I can run multiple simulations in parallel and fill up the iterator as it goes. If I get you correctly, you would then get an iterator (walking through simulations) over iterators (walking through steps), right?

One follow-up question: in my analysis I will then need to efficiently access the data, with the typical queries being:

- iterating through simulations, perform some routine at a given step
- extract step n for all simulations that have at least n steps, e.g. to perform some kind of regression.

I think the former works nicely with an iterator, what are your thoughts on the latter?

Is a more detailed (pseudo-)code or MWE still desirable, or is the outline of my problem clear at this point?