Multilevel DataFrame

Hi all,

probably the question was asked before, but I didn’t find it here in the discourse.

Basically, I’m importing multiple tables (.csv) with DataFrames.jl to Julia. In my example, each table represents a robot with coordinates, timestamps, etc.
To keep my code simple and allow simple stack processing for all robots I’d like to add a level and rearrange the DataFrames in one DataFrame and call the robots by a hash key.

e.g. lab.robot1.x_pos

is this even possible with the DataFrames package? Is it the right tool?

Thanks in advance •ᴗ•

Here is an example what I understand you need:

julia> using DataFrames

julia> robots = [DataFrame(x=rand(2), y=rand(2)) for i in 1:3] # 3 robots
3-element Vector{DataFrame}:
 2×2 DataFrame
 Row │ x         y
     │ Float64   Float64
─────┼────────────────────
   1 │ 0.832353  0.395642
   2 │ 0.250032  0.403825
 2×2 DataFrame
 Row │ x         y
     │ Float64   Float64
─────┼────────────────────
   1 │ 0.308752  0.560986
   2 │ 0.420184  0.595701
 2×2 DataFrame
 Row │ x         y
     │ Float64   Float64
─────┼────────────────────
   1 │ 0.860213  0.633597
   2 │ 0.785208  0.829677

julia> vcat(robots..., source=:robot_id)
6×3 DataFrame
 Row │ x         y         robot_id
     │ Float64   Float64   Int64
─────┼──────────────────────────────
   1 │ 0.832353  0.395642         1
   2 │ 0.250032  0.403825         1
   3 │ 0.308752  0.560986         2
   4 │ 0.420184  0.595701         2
   5 │ 0.860213  0.633597         3
   6 │ 0.785208  0.829677         3

julia> res = reduce(vcat, robots, source=:robot_id) # alternative syntax that might be faster if you have millions of robots
6×3 DataFrame
 Row │ x         y         robot_id
     │ Float64   Float64   Int64
─────┼──────────────────────────────
   1 │ 0.832353  0.395642         1
   2 │ 0.250032  0.403825         1
   3 │ 0.308752  0.560986         2
   4 │ 0.420184  0.595701         2
   5 │ 0.860213  0.633597         3
   6 │ 0.785208  0.829677         3

Now if you want to pick robot by :robot_id do:

julia> gdf = groupby(res, :robot_id)
GroupedDataFrame with 3 groups based on key: robot_id
First Group (2 rows): robot_id = 1
 Row │ x         y         robot_id
     │ Float64   Float64   Int64
─────┼──────────────────────────────
   1 │ 0.832353  0.395642         1
   2 │ 0.250032  0.403825         1
⋮
Last Group (2 rows): robot_id = 3
 Row │ x         y         robot_id
     │ Float64   Float64   Int64
─────┼──────────────────────────────
   1 │ 0.860213  0.633597         3
   2 │ 0.785208  0.829677         3

julia> gdf[(robot_id=1,)] # passing a NamedTuple with robot id
2×3 SubDataFrame
 Row │ x         y         robot_id
     │ Float64   Float64   Int64
─────┼──────────────────────────────
   1 │ 0.832353  0.395642         1
   2 │ 0.250032  0.403825         1

julia> gdf[(1,)] # the same, but passing a Tuple with robot id
2×3 SubDataFrame
 Row │ x         y         robot_id
     │ Float64   Float64   Int64
─────┼──────────────────────────────
   1 │ 0.832353  0.395642         1
   2 │ 0.250032  0.403825         1

In this case group number and robot id are the same so you could write:

julia> gdf[1]
2×3 SubDataFrame
 Row │ x         y         robot_id
     │ Float64   Float64   Int64
─────┼──────────────────────────────
   1 │ 0.832353  0.395642         1
   2 │ 0.250032  0.403825         1

to get the first group, but in general robot id could be any identifier, eg.:

julia> reduce(vcat, robots, source=:robot_random_id => rand(UInt64, length(robots)))
6×3 DataFrame
 Row │ x         y         robot_random_id
     │ Float64   Float64   UInt64
─────┼──────────────────────────────────────────
   1 │ 0.832353  0.395642   8303239618947412160
   2 │ 0.250032  0.403825   8303239618947412160
   3 │ 0.308752  0.560986   2175806809747680279
   4 │ 0.420184  0.595701   2175806809747680279
   5 │ 0.860213  0.633597  12898194048670657353
   6 │ 0.785208  0.829677  12898194048670657353

This is a solution if you want to store all robots in a single data frame (this has many benefits - e.g. later aggregation is much easier if you do so). But maybe you would prefer something else. If this is the case please comment.

On the other hand - if you want to use robot information for compute intensive operations that are not aggregations (e.g. in simulation, where you would access intividual elements of vectors) then most likely you will be better off by not using DataFrame to store robot information but rather e.g. Tables.columntable (i.e. NamedTuple of vectors) as it will be type stable and faster in such applications. DataFrame is designed for efficient working with whole columns.

2 Likes

Just to add to Bogumil’s answer, if you have the data in a bunch of csv files with the same format you can simply read them into a single DataFrame by doing:

CSV.read(vector_of_filepaths_of_csvs, DataFrame; source = vector_of_names)

here the source kwarg will add a column source to your final DataFrame identifying which csv file the row in the table came from.

3 Likes

Wow thanks Bogumil for the detailed answer! This has helped me a lot to understand a lot of things :slight_smile: But as you correctly guessed in the last section, I’m in the situation where I use robot information for computationally intensive operations … so I think I’ll stick to your suggestion using a NamedTuple and try to get that working :blush:

If you need to do compute intensive operations then do:

using CSV
using Tables
robots = [CSV.read(file, Tables.columntable) for file in vector_of_file_names]

and then each robot has a numeric id in this vector (1, 2, 3 etc.), so you can write e.g.

robots[1].x_pos[end]

To get a last x_pos of robot 1.

This is type stable and efficient pattern for your work. In particular it is important to number robots from with 1 and consecutive natural numbers as this will be fastest way to look them up.

2 Likes

How about this?

using DataFrames

lab = DataFrame(randn(4,10), sort(["robot$i.x" for i in 1:5] ∪ ["robot$i.y" for i in 1:5]))
julia> lab = DataFrame(randn(4,10), sort(["robot$i.x" for i in 1:5] ∪ ["robot$i.y" for i in 1:5]))
4×10 DataFrame
 Row │ robot1.x   robot1.y   robot2.x   robot2.y   robot3.x   robot3.y   robot4.x   robot4.y     robot5.x   robot5.y  
     │ Float64    Float64    Float64    Float64    Float64    Float64    Float64    Float64      Float64    Float64   
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │  1.18797   -0.986746  -2.13126    0.410994  -0.382896   1.57188    1.2566     0.23703     -1.63068    1.1288
   2 │ -0.5893    -0.812499  -0.172445  -0.136242  -1.44124   -0.427348   0.717058   2.1018       0.216043   0.102947
   3 │  1.49847   -0.342672  -0.377164  -0.7423     1.35134    0.39429    1.43345   -1.6108      -0.357507   0.915174
   4 │ -0.213267  -0.733188   0.257646  -1.04185   -0.41242   -1.27896   -0.964437  -0.00448036   0.533967  -0.207548

julia> lab."robot1.x"
4-element Vector{Float64}:
  1.1879717987881062
 -0.5893002692628563
  1.4984705071964186
 -0.2132672510579861

Maybe this is not a ‘natural implementation’. I know, it looks like kinda joke, but actually very flexible, easy to use, simple and fast in special cases. At least, you can handle this as one variable lab so many operations what you know could be applied.

1 Like

DataFrames.jl is designed to support this pattern without a problem (you can even have millions of robots). And you can write, e.g.:

robot_idx = 10
lab[!, "robot$(robot_idx).x"

programmatically.

The downside is that this code is not going to be type stable (and as you say - in some cases it is not a problem, but in the other it might be a problem and you would need to use function-barrier technique) so there is a trade-off here.

Thanks for all the informative answers! I guess I’ll stick to that one but store the columntables in a dictionary since the robots have actually names and I don’t want to call them by index, but just by their name. Not sure if this is nice from a programming point of view, but this way I have all the functions i was hoping for :smiling_face:

Thank you, folks!

Yes - this is a correct approach. The only thing is that it will be a bit slower.