Hi, I’ve enjoyed using Julia on my first research project quite a lot, and I’m now thinking of how to set up my code for my second research project. The computational bottleneck in this project lies in performing many computationally cheap simulations; I would like to make sure that I set this up in an efficient way, so let me begin by explaining what I will need to do:
I will need to simulate N independent time-series. Each of these simulations will consist of an ex-ante unknown number of simulation steps (we repeatedly simulate the time-series from time-step i to (i+1) until some condition is met). For subsequent calculations, I will need to have access to each step of each time-series.
Q1. I assume that initialising a large array/expanding an array to fit the various time-series as they are being simulated is computationally expensive, so I guess it is better to proceed differently? Would something like a list of arrays (each list entry is 1 time series, each array contains the value of the process at the various time-points) be a good idea? Or would it be better to use a list of lists of arrays (each outer list entry is 1 time series, each inner list entry is 1 time-step, each array is the value of the process at this time-step)? Or is yet another idea better?
Q2. I will need to simulate many different combinations of processes, so in my previous project, I’ve been building small simulation blocks for different processes, patching them together as I needed them. This means that several functions (‘layers’) need to be called in sequence to simulate any given process from step i to step i+1. Is this (modulo implementation) fine, or would it be better to somehow ‘compile’ the various ‘layers’ into one function that performs the simulation from step i to step i+1 directly?
I’m curious to hear your thoughts, and of course I can clarify/elaborate where necessary!
I think you description is still too vague to really give good recommendations. Could you perhaps provide some MWE/mock-up of the actual structure of the code? It does not necessarily need to be runnable but would be nice. Bonus points if the computations have realistic characteristics.
Hi, I can certainly type out some code for this tomorrow! For now, let me break out a brief description of the algorithm to see if that conveys my intentions clearly. I’m very open to setting up the code in a way that pays off on the long run: large parts of my old code will need a massive overhaul for this project anyway, so I want to make as good use of it as possible! This is why I don’t have a good piece of code to post as MWE yet!
A typical simulation looks like this, where we have a 2-dimensional process (X,M) to keep track of. We will need to repeat the entire procedure N times.
Step 0: initial values are used, say X[0] = 0 and M[0] = 0.
Step i + 1: Function 1: Simulate X[i+1] = X[i] + Standard_normal. Function 2: M[i+1] = M[i] + 1 if X[i+1] > X[i], M[i] otherwise. When N[i+1] = 10, stop simulating this path.
Effectively, we are simulating a 1-dimensional random walk with gaussian steps, where we track the number of up-steps and stop after stepping up 10 times. As the number of steps taken before reaching 10 up-steps can be arbitrarily large, I foresee problems in initializing an 3-dimensional array to store all my simulations in.
I’ve illustrated that in step i+1, I’m calling two different functions: in another simulation, I might simulate two random walks X and Y, and e.g. let M count how often we have seen X[i] > Y[i] so far. For ease of reference, I’ll call any such sequential calls to functions to simulate a step ‘layers’; this is what my Q2 refers to.
If necessary/preferred, I’m happy to send a more elaborate example tomorrow, but I hope it gets gist across?
For Q2: Composing functions from simpler pieces sounds fine, e.g, f2 ∘ f1
is essentially “compiled into one function” by the Julia compiler.
For Q1: One option might be to hook into the iterator interface, i.e., effectively making your simulation an iterator of unknown size and rely on collect
being efficiently implemented for such already. There might be better options though …
Thank you for your suggestions!
For Q1, I didn’t think of setting it up as an iterator yet, this may certainly work, in particular if I can run multiple simulations in parallel and fill up the iterator as it goes. If I get you correctly, you would then get an iterator (walking through simulations) over iterators (walking through steps), right?
One follow-up question: in my analysis I will then need to efficiently access the data, with the typical queries being:
- iterating through simulations, perform some routine at a given step
- extract step n for all simulations that have at least n steps, e.g. to perform some kind of regression.
I think the former works nicely with an iterator, what are your thoughts on the latter?
Is a more detailed (pseudo-)code or MWE still desirable, or is the outline of my problem clear at this point?