My specific question is whether TypedTables will continue to be maintained and whether it may be a risk to use it for my project or not. This comes at the bottom after explaining the peculiarities of why it seems advantageous for my use case (borne out by some preliminary benchmarking with a 2M row array).
I have changed a group-based simulation of COVID to an individual level model. In the group model, I have groups for 5 age groups by 8 disease conditions by 25 lags (days sick, which must be tracked). The group model is crazy fast: it is fast to add/subtract a few hundred or thousand people from a group when their status changes. But, there are some limitations to the scenarios and policy “models” that can be run with groups–“test and trace” is tricky, for example.
An individual level model (the original work stems from livestock infection simulations but is “state of the art” for elaborate epidemiological modelling) allows grouping of individuals by many characteristics. Test and trace is easier because we can make sure someone doesn’t get tested every day, for example. When we model decline in immunity, we can track how long it has been since someone has been vaccinated or has recovered. And much can be done with groups by filtering on a trait that many individuals will have. But, the tradeoff is running slower. To update the status of 5000 people means creating a filter/index and updating all the matching rows vs. adding 5000 to a number.
The population data is all integer. All the data is program generated so we don’t need categorical arrays, etc. The original encoding already provides for the categorizations, etc. Not that hard to keep track of and keep consistent because it’s all programmatic.
This is a very different use case than working with realworld live data and all of its lovely messiness. Everything must be mutable. Each day of a simulation can change anything (except agegroup). So I saw little benefit to DataFrames. Indexing strategies also don’t help because any column may change for an arbitrary number of rows–so an index must be re-calculated each “day”. I do precalc the agegroup index (groups by 20 years). Even though there are birthdays all the time, I can safely ignore age changes when someone goes from 59 to 60. There are already so many approximations…
I have the individual level model running at 5X the time of the group model, which seems pretty good actually. There are a few more optimizations possible by either switching to pre-allocated bit arrays to save allocations even though the values change each day (size won’t change–must match the size of the entire population) or finding some way to pre-allocate linear indices. Linear indices to rows are much more convenient to work with because they are smaller and easy to iterate–with the disadvantage of lots of allocations because each “day” an index will be a different size–different number of people affected.
Still the algorithms for the individual model are much more obvious and I’ve made some algorithm improvements that got me within 5x.
The question: it seems like TypedTables offers some performance advantages:
I don’t use many columns: currently 13, but most of the work is done with 4. I could easily make a 2nd population array to hold the less frequently accessed columns.
Rows (e.g., people) never go away and are never re-arranged. Someone recovers or dies simply by changing a column value. So, index to “who” is stable across the entire simulation. Index to traits is stable within a single day (at least the “before” indices are).
The population matrix can be large for a large locale (NY at 8.3 million rows, for example). Even with all integer data, the memory accesses to get multiple values (across columns) for the individuals who meet a filter are a real cost despite consistent stride. With TypedTables, it is very fast to reference the row tuple values.
I did a bit of benchmarking: Every operation I perform is either equal time between arrays of int and TypedTables or the latter can be 10x faster. I am going to do a branch that uses TypedTables and see what the real gains are.
My question is that the package seems close to abandonement. I realize that there has been tons of progress on different kinds of table storage and APIs so Julia is making tons of progress in this area. Is it risky to use something that may lose support? I am not complaining about performance or about less interest in the discrete simulation use case. Analyzing data from the real world is a vastly more significant use case. One isn’t better than another: just more common or less common. I just don’t want to commit to using something that might not be maintained.