Manipulating Panel Data in Julia

Hey everyone,

I was looking for a straightforward way to do common manipulations in panel data within Julia. For instance, let’s say I have a panel of many countries and years. I would like to calculate the growth rate for a variable country by country. What is the most convenient way to do this? Is there a way to do this using packages such as Query.jl?

I just found something that seems pretty close in R: A new package for panel data analysis in R | R-bloggers

Would be great if Julia offered something similar!

Thanks a lot!

Ilja

Is there any reason you would rather chose panelr over using dplyr directly? For Julia, I would look into DataFramesMeta.

At the moment I mainly need a lead/lag operator that respects the panel structure, i.e., applies the lag operator for each id in the panel separately, generates NA where necessary, and puts everything back together. If that is possible with DataFramesMeta directly, that would be great! I guess that is also the added convenience of panelr over deplyr.

julia> using DataFrames, ShiftedArrays

julia> df = DataFrame(stock = ["a", "a", "a", "b", "b", "b"], time = repeat(1:3, 2), price = rand(100:120, 6))
6Γ—3 DataFrame
 Row β”‚ stock   time   price
     β”‚ String  Int64  Int64
─────┼──────────────────────
   1 β”‚ a           1    108
   2 β”‚ a           2    105
   3 β”‚ a           3    107
   4 β”‚ b           1    101
   5 β”‚ b           2    106
   6 β”‚ b           3    109

julia> transform!(groupby(df, :stock), :price => (x -> log.(x) .- log.(lag(x))) => :return)
6Γ—4 DataFrame
 Row β”‚ stock   time   price  return
     β”‚ String  Int64  Int64  Float64?
─────┼───────────────────────────────────────
   1 β”‚ a           1    108  missing
   2 β”‚ a           2    105       -0.0281709
   3 β”‚ a           3    107        0.0188685
   4 β”‚ b           1    101  missing
   5 β”‚ b           2    106        0.0483186
   6 β”‚ b           3    109        0.0279088
2 Likes

Thanks, that worked :slight_smile:

Just one last question: This method does not work when there are gaps in the variable time, correct?

Well, it assumes that subsequent rows represent one time step. If you have missing observations which aren’t in the data (i.e. not represented by missing but absent altogether) you can construct the range of time steps first (something like minimum(df.date):Day(1):maximum(df.date)) and then leftjoin your data onto that, which will generate the missing observations. Any return (in my example) where one of the two days used to compute it is missing will then be missing.

3 Likes

At the dangers of over-promoting, but I’ve written a package that takes care of some the issues you mention @IljaK91: PanelDataTools.jl

It is at it’s core just a wrapper around some existing packages and solutions, but from my tests it deals well with for example missing times, use DateTime to keep track of time instead of row numbers (which is what shifted arrays basically does).