Low usage of TypedTables.jl

lewis · October 6, 2020, 7:44pm

My specific question is whether TypedTables will continue to be maintained and whether it may be a risk to use it for my project or not. This comes at the bottom after explaining the peculiarities of why it seems advantageous for my use case (borne out by some preliminary benchmarking with a 2M row array).

I have changed a group-based simulation of COVID to an individual level model. In the group model, I have groups for 5 age groups by 8 disease conditions by 25 lags (days sick, which must be tracked). The group model is crazy fast: it is fast to add/subtract a few hundred or thousand people from a group when their status changes. But, there are some limitations to the scenarios and policy “models” that can be run with groups–“test and trace” is tricky, for example.

An individual level model (the original work stems from livestock infection simulations but is “state of the art” for elaborate epidemiological modelling) allows grouping of individuals by many characteristics. Test and trace is easier because we can make sure someone doesn’t get tested every day, for example. When we model decline in immunity, we can track how long it has been since someone has been vaccinated or has recovered. And much can be done with groups by filtering on a trait that many individuals will have. But, the tradeoff is running slower. To update the status of 5000 people means creating a filter/index and updating all the matching rows vs. adding 5000 to a number.

The population data is all integer. All the data is program generated so we don’t need categorical arrays, etc. The original encoding already provides for the categorizations, etc. Not that hard to keep track of and keep consistent because it’s all programmatic.

This is a very different use case than working with realworld live data and all of its lovely messiness. Everything must be mutable. Each day of a simulation can change anything (except agegroup). So I saw little benefit to DataFrames. Indexing strategies also don’t help because any column may change for an arbitrary number of rows–so an index must be re-calculated each “day”. I do precalc the agegroup index (groups by 20 years). Even though there are birthdays all the time, I can safely ignore age changes when someone goes from 59 to 60. There are already so many approximations…

I have the individual level model running at 5X the time of the group model, which seems pretty good actually. There are a few more optimizations possible by either switching to pre-allocated bit arrays to save allocations even though the values change each day (size won’t change–must match the size of the entire population) or finding some way to pre-allocate linear indices. Linear indices to rows are much more convenient to work with because they are smaller and easy to iterate–with the disadvantage of lots of allocations because each “day” an index will be a different size–different number of people affected.

Still the algorithms for the individual model are much more obvious and I’ve made some algorithm improvements that got me within 5x.

The question: it seems like TypedTables offers some performance advantages:

I don’t use many columns: currently 13, but most of the work is done with 4. I could easily make a 2nd population array to hold the less frequently accessed columns.
Rows (e.g., people) never go away and are never re-arranged. Someone recovers or dies simply by changing a column value. So, index to “who” is stable across the entire simulation. Index to traits is stable within a single day (at least the “before” indices are).
The population matrix can be large for a large locale (NY at 8.3 million rows, for example). Even with all integer data, the memory accesses to get multiple values (across columns) for the individuals who meet a filter are a real cost despite consistent stride. With TypedTables, it is very fast to reference the row tuple values.
I did a bit of benchmarking: Every operation I perform is either equal time between arrays of int and TypedTables or the latter can be 10x faster. I am going to do a branch that uses TypedTables and see what the real gains are.

My question is that the package seems close to abandonement. I realize that there has been tons of progress on different kinds of table storage and APIs so Julia is making tons of progress in this area. Is it risky to use something that may lose support? I am not complaining about performance or about less interest in the discrete simulation use case. Analyzing data from the real world is a vastly more significant use case. One isn’t better than another: just more common or less common. I just don’t want to commit to using something that might not be maintained.

Thoughts?

pdeffebach · October 6, 2020, 7:57pm

I think you should give data frames a shot, particularly

Using GroupedDataFrame as a dictionary to look up patients. A grouped data frame is basically a data.table with a sorted lookup.
The new transform and select functions make the typed-ness of TypedTables less important, since there is a function barrier that gets the types right before a function on columns is applied.

lewis · October 8, 2020, 2:45am

Thanks.

The big advantage of DataFrames is the ability to have dissimilar column types. All this data is Integer.
Not sure GroupedDataFrame is applicable because the data changes every cycle of the simulation. But, if the groups can be calc’ed quickly it could help. The big groups are: who is infected? Age groups.

Type stability isn’t a huge thing in this case for any approach because all the data is Int.

I did the implementation in TypedTables and I save about 20%. It’s not huge. The changes are decently clean code. I don’t think TypedTables represents a particularly general solution but it’s “ok” for this use case because it is really just an array under the hood.

pdeffebach · October 8, 2020, 2:52am

Show us the code! I’m sure we can help optimize.

w.r.t to the grouping, I mean that you should group the data so that each observation is one group. I could be wrong but the lookup is fast, even for many many groups. You can treat this like a dictionary.

lewis · October 9, 2020, 8:12pm

I did an implementation using DataFrames. It was twice as slow. Not so good.

Happy to show the code though I don’t want to chop it up into minimum viable solution. It is on github and there are two functions that run in a hot loop.

I am not understanding why I’d group. That would mean that every row is in a group by itself. Are you saying that accessing a row from a dict would be faster than the array row access? It might be. Dict’s are really fast. Let me know if this is what you mean.

Here are a couple of benchmarks of setting a single cell file using different DataFrame syntaxes:

julia> @btime $locdat[1,1] = 1;
  19.304 ns (0 allocations: 0 bytes)

julia> @btime $locdat.status[1] = 4;
  23.627 ns (0 allocations: 0 bytes)

Here is setting a single cell using TypedTables:

julia> @btime $locdat2.status[1] = 4;
  1.461 ns (0 allocations: 0 bytes)

That’s sort of game over at 16x for a very atomic operation that happens a lot in the hot loop.

What am I missing? Again, I acknowledge updating a big matrix in a discrete event simulation is not nearly the high volume use case that analyzing data from the realworld is–which is perhaps the use case where DataFrames really excels–and should prioritize.

What was cool is that I only needed to change 6-10 lines of code in very obvious ways.

I must also point out that I do not use DataFrames for the “history” data of simulation runs. I use 2d arrays in a Dict. The challenge is that every simulated day I update the data in the history series (several). This is efficiently done by pre-allocating the arrays for the entire simulation run upfront (we know how many days a given run will be…) and then updating in place. I haven’t tried this as typedtable but my hunch is that it won’t be worth it. Typedtables update quickly for some updates, but not others: in particular, if an update spans many columns it doesn’t do as well as a pure array. the history series is a case of updating across many columns (e.g., a row).

I can probably optimize that somewhat by making sure I iterate across the array using column dominance.

A good exercise and nothing wrong with the Julia data ecosystem: just courses for horses…

pdeffebach · October 9, 2020, 8:29pm

I am not understanding why I’d group. That would mean that every row is in a group by itself. Are you saying that accessing a row from a dict would be faster than the array row access? It might be. Dict’s are really fast. Let me know if this is what you mean.

What I mean is that if you are looking up details for a patient based on some variable, say, patient_id, then you can either do

df[df.patient_id .== "id123", :]

which is slow. However you can cache that lookup with

gd = groupby(df, :patient_id)
gd[(; patient_id = "id123")]

which will be faster.

But maybe I interpreted your question wrong and this isn’t what you are trying to do.

Without an MWE it’s hard to know how to improve your code. I see what you mean about updating individual entries in a data frame. I think the cause of your problem is that Julia doesn’t know the type of locdat.status, so it’s calling an inefficient method. Plus there is some stuff under the hood to make sure that :status is a column in the data frame.

TypedTables doesn’t have this problem because both the names and types of the table are inferrable.

The solution is to do do what DataFramesMeta does and wrap your operations in functions that take in the vectors. Maybe this doesn’t work for your use-case but if you make a function

replace_ind!(x)
    x[1] = 400
end

then replace_ind!(df.a) should be just as fast as working with a normal vector.

There are other utilities that can help in a tight loop, Tables.namedtupleiterator creates a type-stable iterator through rows, for example.

lewis · October 10, 2020, 12:03am

No: you understand. The two major functions that do the work start with a filter.

If the groupby will execute quickly, then this could be wort doing. And your example suggests that it is the caching that helps because patient_id would be unique to each person. I’ll try it out.

What is the type of GD in your example?

I effectively do the same thing with a findall to create a linear index to all the rows satisfying the filter and then I iterate over the filter because I will touch each “qualifying” person.

It looks like I’d need a “replacement function” for each column I want to access—that’s no prohibitive because there are only 13 columns. A bit odd, but not really a problem.

It would have to be a meaningful performance gain w.r.t TypedTables though as there really is no big payoff to DataFrames for this use case. I might see if some of your suggestions can work somehow with the TypedTables approach. I’ll admit that TypedTables is a somewhat odd package that might not meet a lot of people’s needs (and can’t be performant with “zillions” of columns), but it fits this use case and is nicely lightweight and mostly uses array syntax.

Thanks for all of your suggestions and following up.

pdeffebach · October 10, 2020, 1:26am

A grouped data frame, you can see that I do groupby(df, :patient_id), which returns a grouped data frame.

No, I’m not saying that, exactly, I’m saying that it makes sense to isolate the entirety of your performance critical code into function that takes in only vectors.

You can also put all your performance critical code into a named tuple of vectors or similar. My point is ultimately that DataFrames is a very convenient package, and there are lots of ways to “leave” DataFrames when you need to for your performance critical code but keep it for all the other more generic table operations.

lewis · October 12, 2020, 6:39pm

Thanks for offering this help. I’ve pasted the critical segments/functions below. Let me know if a gist would be better. Or you can clone the repository (ugh–too much work). I am pasting the dataframes version. Note that the typedtables version is trivially different: data structure is different and indexing into it has slightly different syntax. I would say that the typedtable approach corresponds pretty closely to your suggestion: it is a named tuple of columns. Usually, it’s pretty fast. In one place (not highlighted below, but it is in the code) I continued to use an array: for some reason updating 2 columns (part of a row) was surprisingly slow.

Here is where the population data matrix is created:

"""
Pre-allocate and initialize population data for one locale in the simulation.
"""
function pop_data(pop; age_dist=age_dist, intype=T_int[], cols="all")

    if cols == "all"
        status = fill(intype(unexposed), pop) # Array{Int,1}(undef, popsize)
                parts = apportion(pop, age_dist)
        agegrp = reduce(vcat,[fill(i, parts[i]) for i in agegrps])
        cond = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        lag = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        recov_day = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        dead_day = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        cluster = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        vax = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        vax_day = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        test = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        test_day = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        quar = zeros(intype, pop)
        quar_day = zeros(intype, pop)

        # dat = hcat(status, agegrp, cond, lag, cluster, recov_day, dead_day, vax, 
        #     vax_day, test, test_day, quar, quar_day)

        # use DataFrame
        dat = DataFrame(status=status, agegrp=agegrp, cond=cond, lag=lag, cluster=cluster, 
            recov_day=recov_day, dead_day=dead_day, vax=vax, vax_day=vax_day, 
            test=test, test_day=test_day, quar=quar, quar_day=quar_day)
    elseif cols == "track"
        status = fill(intype(unexposed), pop) # Array{Int,1}(undef, popsize)
                parts = apportion(pop, age_dist)
        agegrp = reduce(vcat,[fill(i, parts[i]) for i in agegrps])
        cond = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 
        lag = zeros(intype, pop)  # Array{Int,1}(undef, popsize) 

        # dat = hcat(status, agegrp, cond, lag)
        dat = Table(status=status, agegrp=agegrp, cond=cond, lag=lag)
    else
        @error "Wrong choice of cols in pop_data: $cols"
    end    

    return dat       
end

Note that the only change to use TypedTables is to replace “DataFrame” with “Table”.

There are two critical functions in the simulation loop:
spread!: people who are infected may spread the virus to others. The infected folks contact others. These other people may be unexposed, infected or recovered. Many of the contacts are fleeting; some are consequential (assume > 15 minutes and < 6 feet). So, we calculate which contacts are touches. Of the touches, not all result in virus transmission: the person touched must be unexposed and a binomial prob. draw determines if that touch transmits the virus. The result is that an unexposed person has become newly infected. spread! is complicated by the option to run cases that split the population into people complying and not complying with social distancing. So, I show the inner function _spread! that actually implements this process.

transition!: People who have been infected with the virus advance through some disease stages (nil, mild, sick, severe) on various paths/times and result in dying or recovering.

Again, thanks for your offer of suggestions,
Lewis

Here is _spread!:

function _spread!(locdat, spread_idx, contactable_idx, contact_factors, touch_factors, riskmx, shape, density_factor)

    # how many contacts?

    # println("size spread_idx $(size(spread_idx, 1))")
    # error("that's all....")

    # init = zeros(Int, size(spread_idx,1),2) # second column for lag of the spreader
    # spreaders_to_contacts = Table(nc = init[:,1], lag = init[:,2])
    spreaders_to_contacts = zeros(Int, size(spread_idx,1), 2) # second column for lag of the spreader

    @inbounds @views for i in 1:size(spread_idx, 1)  # for each spreader  # size(spreaders_to_contacts, 1)
        # cond = locdat[spread_idx[i], cpop_cond]-4
        # agegrp = locdat[spread_idx[i], cpop_agegrp]
        p = spread_idx[i]
        # p_tup = locdat[p]
        p_tup = locdat[p,:]

        thiscond = p_tup.cond - 4  # map 5-8 to 1-4
        thisagegrp = p_tup.agegrp
        thislag = p_tup.lag

        scale = density_factor * contact_factors[thiscond, thisagegrp]

        # for typedtable implementation
        # spreaders_to_contacts.nc[i] = round(T_int[],rand(Gamma(shape, scale))) # cnt of contacts for 1 spreader
        # spreaders_to_contacts.lag[i] = thislag

        # for array implementation
        spreaders_to_contacts[i, 1] = round(Int,rand(Gamma(shape, scale))) # cnt of contacts for 1 spreader
        spreaders_to_contacts[i, 2] =  thislag
    end

    # n_contacts = sum(spreaders_to_contacts.nc)     # sum(spreaders_to_contacts[:,1])   sum(spreaders_to_contacts.nc) 
    n_contacts = sum(spreaders_to_contacts[:,1])  

    n_contactable = size(contactable_idx, 1)


    # assign the contacts 
    n_target_contacts = min(n_contacts, n_contactable)

    contact_people = sample(contactable_idx, n_target_contacts, replace=false)


    # which contacts are consequential touches? which touched get infected?
    n_touched = 0
    n_newly_infected = 0

    stop = 0
    @inbounds for i in 1:size(spread_idx,1)  # nc=numContacts, lag=lag of spreader

        # nc = spreaders_to_contacts.nc[i]   # spreaders_to_contacts.nc[i]  spreaders_to_contacts[i,1]
        # lag = spreaders_to_contacts.lag[i]  #  spreaders_to_contacts.lag[i]  spreaders_to_contacts[i,2]

        nc = spreaders_to_contacts[i,1]
        lag = spreaders_to_contacts[i,2]

        start = stop + 1; stop = stop + nc
        # stop = stop > n_target_contacts ? n_target_contacts : stop  
        # @assert stop <= n_target_contacts

        # println("start $start    stop $stop  nc $nc")

        @inbounds @views for person in contact_people[start:stop]
            # person's characteristics
            # p_tup = locdat[person]
            p_tup = locdat[person,:]

            status = p_tup.status  # TODO below crap needs to be fixed
            agegrp = p_tup.agegrp
            cond = p_tup.cond
            lookup = if status == unexposed
                        1  # row 1
                     elseif status == recovered
                        2
                     else
                        cond - 2  # 5:8 - 2 -> 3:6
                     end

            # touch outcome
            touched = rand(Binomial(1, touch_factors[lookup, agegrp]))
            n_touched += touched

            # infection outcome
            if (touched == 1) && (status == unexposed)    # (characteristic == unexposed)
                prob = riskmx[lag, agegrp]
                newly_infected = rand(Binomial(1, prob))
                if newly_infected == 1
                    locdat.cond[person] = nil # nil === asymptomatic or pre-symptomatic
                    locdat.status[person] = infectious
                    # lag remains zero because person was unexposed; transition! function updates lag
                end
                n_newly_infected += newly_infected
            end
        end
    end

    return n_contacts, n_touched, n_newly_infected
end

Here is transition!

"""
    transition!(dat, locale, dt_dict)

People who have become infectious transition through cases from
nil (asymptomatic) to mild to sick to severe, depending on their
agegroup, days of being exposed, and some probability. Finally,  
they move to recovered or dead.

Works for a single locale.
"""
function transition!(dat, locale::Int, dt_dict)

    locdat = locale == 0 ? dat : dat[locale]

    dt = dt_dict["dt"]  # decision tree for illness changes over time to recovery or death

    @inbounds for p in findall(locdat.status .== infectious)            
        # p_tup = locdat[p]  # returns named tuple of the columns for row at p (for person)
        p_tup = locdat[p,:]  # returns named tuple of the columns for row at p (for person)

        dtkey = (p_tup.lag, p_tup.cond)

        node = get(dt[p_tup.agegrp], dtkey, ())

        @inbounds if isempty(node)  
            # @assert p_tup.lag < laglim "Person made it to last day and was not removed:\n     $p_tup\n"
            locdat.lag[p] += 1
        else
            choice = rand(Categorical(node["probs"]), 1)
            tocond = node["outcomes"][choice][]

            if tocond in (dead, recovered)  # change status, leave cond and lag as last state before death or recovery
                
                locdat.status[p] = tocond  # change the status
                
                if tocond == dead  # set the day of the status change
                    locdat.dead_day[p] = ctr[:day]
                else
                    locdat.recov_day[p] = ctr[:day]
                end

            else   # change disease condition
                locdat.cond[p] = tocond   # change the condition
                locdat.lag[p] += 1  
            end    
        end

    end  # for p
end

pdeffebach · October 12, 2020, 6:49pm

Thanks for this,

BUt this is a lot of code for me to read through.

It doesn’t look like you implemented my advice above, you are still indexing into a DataFrame a lot when your functions could be operating on vectors.

Additionally, note that df[5, :] is not a very efficient object, it is a DataFrameRow, not a NamedTuple, which is not type-stable. Fortunately it can be modified while a NamedTuple cannot.

lewis · October 12, 2020, 7:04pm

I never said I did implement it. Just that typedtables passes a collection of vectors.

I told you it was a bit hard and this is a tiny subset.

A dataframe row is indeed inefficient.

I wanted to see what your additional suggestions were.

So, I need to convert that dataframe into a collection of vectors each pass through the loop? Or will they alias the dataframe columns so I can do it once? (if so, then that’s a pretty good strategy…)

pdeffebach · October 12, 2020, 7:09pm

It looks like you are only ever using a few columns at a time, right? For example, in transition!, you seem to only use the columns :lag, :cond, :status, :dead_day, and :recov_day. That’s 6 columns.

Write a function that takes in 6 vectors and performs the loop on those 6 vectors. Call the function on

transition!(df.lag, df.cong, df.status, df.dead_day, df.recov_day)

updating all the vectors in-place.

lewis · October 12, 2020, 7:33pm

That’s where I am headed. Can I still pass in df and have the function reference the columns? Or must the caller do that? Seems like you are saying that the receiving functions should not be aware of the dataframe–only of vectors.

Yes, I do need to update the “rows”: that’s kinda what a simulation does. And yes, it is slower for both dataframes and typedtables to work with rows. Question: is there really any extra price to doing: df.cond[person_id] = x; df.lag[person_id = y; etc… Seems like I index each vector to the desired location (no longer a row). That should be really cheap (single digit number of nanoseconds) so there is no way to save that indexing time even though it is the same position in each vector.

BTW, moderately efficient way to treat a row as a row is use arrays–which is where I started. But, even though going across a row of an array is simple calculation of the strides (under the hood) to get arr[i, :] it still takes time, which is why the vector approach is a bit better, especially for a lot of rows.

I will say that the column referencing syntax is identical to typedtables so the code will tend to converge.

pdeffebach · October 12, 2020, 7:36pm

Definitely have a function barrier, the function should not know about the data-frame.

the important part is not the cost of the updating, but the ability for Julia to generate efficient code in general, which it can’t do if passed only a data frame.

lewis · October 12, 2020, 7:44pm

Can I simplify my life a little by passing a named tuple of vectors? Can’t change the NT, but I can change the container the NT.v refers to. Now I only use 7 columns but in more complex scenarios (people recover and get sick again; people get tested and vaccinated) I’ll use 17 columns. It’s about a nanosecond to deref an NT with . notation.

I am going to try these things since I am on the right path and won’t keep asking. Will report back.

pdeffebach · October 12, 2020, 7:46pm

Yes, that is another good idea. You can do Tables.columntable(df) to get a named tuple. I mentioned that above, see:

lewis · October 12, 2020, 10:23pm

OK. That was easy. I turn the dataframe into a “column table” once before the simulation loop starts. Because the vectors of actual data are aliased across the datastructures, we only need to do this once. Updating the values in the vectors updates that dataframe.

I then pass the namedtuple of vectors to the spread! and transition! functions. Never access (or set) anything from a representation of a row. Just index into the appropriate vector.

The timings for the Dataframes approach with this modification are identical to the TypedTables implementation. At the level of the functions in the hotloop where the real work is done they are essentially the same: both named tuples of vectors. Actually, I was initially off a little in TypedTables implementation because I was still extracting a namedtuple (faster than a DataFrameRow, but still slow), which this exercise enabled me to catch.

In fact, the syntax is identical:

x = dat.column_name[row_idx] # get a value
dat.column_name[row_idx] = x # set a value

This is a good performance enhancement for using dataframes with changing values. If I were directly using the dataframes then this would be great. I don’t; I could–just don’t use the data in such away. But, the flexibility is there.

Learned some important things. Not sure anyone else will by wading through this, but well worth it.

Thanks @pdeffebach

Topic		Replies	Views
Simulation Framework with Logging to Tabular Data General Usage question	25	2877	June 19, 2018
Stably indexed table General Usage	10	1544	April 1, 2017
New user: DataFrames or TypedTable Data question , dataframes , tables	8	945	August 29, 2021
16000 cases lost in UK Covid reporting Offtopic csv , excel	16	1023	October 15, 2020
[ANN] RowTables.jl Data announcement	6	1121	July 26, 2018

Low usage of TypedTables.jl

Related topics