Data structure for convenient access to tabular data

bbrunaud · February 12, 2023, 5:31pm

Hello community

I use Julia to build JuMP models. What you usually have is indexed parameters that I would like to store in DataFrames and have convenient access when writing models. Think of a DataFrame where you have the first few columns acting as index and each other column as the properties. What data structure do you recommend to keep things organized. So far I had to construct a dictionary out of every column, which is not very clean.

This is my use case

julia> df
3×3 DataFrame
 Row │ Material  StdCost  LeadTime
     │ Symbol    Int64    Int64
─────┼─────────────────────────────
   1 │ A              10         1
   2 │ B              20         2
   3 │ C              30         3

In this example I want a convenient way of accessing the StdCost for material C. I figured out that the indexing can be handled by a GroupedDataFrame, but the access is still not pretty

julia> gdf = groupby(df, :Material)
GroupedDataFrame with 3 groups based on key: Material
First Group (1 row): Material = :A
 Row │ Material  StdCost  LeadTime
     │ Symbol    Int64    Int64
─────┼─────────────────────────────
   1 │ A              10         1
⋮
Last Group (1 row): Material = :C
 Row │ Material  StdCost  LeadTime
     │ Symbol    Int64    Int64
─────┼─────────────────────────────
   1 │ C              30         3

# The Notation I have to make (not pretty)
julia> gdf[(:C,)].StdCost[1]
30

# The Notation I would like to have
julia> gdf[:C].StdCost
30

Any suggestions on how to solve this?

thanks!

rocco_sprmnt21 · February 12, 2023, 6:05pm

I don’t think what you ask for is possible, in the form you expect.
In general, the result of groupby is a subdataframe consisting of multiple rows.
Therefore selecting a column returns as a result a vector of values, which as a special case, can be of length 1.

If you know for sure that your groups are single row you can use the only() function. Otherwise, if you only need only one value, for example the first one, you could use findfirst()

using DataFrames
m='A':'C'
sc=10:10:30
lt=1:3
df=DataFrame(;m,sc,lt)

only(df[m.=='C',:sc])

df[findfirst(==('C'),df.m),:sc]

bkamins · February 12, 2023, 6:20pm

I think in your case, if you are sure that each group has one row it is better to use Dict of NamedTuple:

julia> d = Dict("a" => (x=1, y=2), "b" => (x=3, y=4))
Dict{String, NamedTuple{(:x, :y), Tuple{Int64, Int64}}} with 2 entries:
  "b" => (x = 3, y = 4)
  "a" => (x = 1, y = 2)

julia> d["a"].x
1

julia> d["b"].y
4

bbrunaud · February 13, 2023, 3:02am

Thank you!, this is what I was looking for. Is there a construct that from a DataFrame without iterating over the rows?

bkamins · February 13, 2023, 8:11am

Is this what you want (assuming column :a has unique values)?

julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      4      7
   2 │     2      5      8
   3 │     3      6      9

julia> Dict(row.a => copy(row[Not(:a)]) for row in eachrow(df))
Dict{Int64, NamedTuple{(:b, :c), Tuple{Int64, Int64}}} with 3 entries:
  2 => (b = 5, c = 8)
  3 => (b = 6, c = 9)
  1 => (b = 4, c = 7)

rocco_sprmnt21 · February 13, 2023, 1:26pm

another possibility is to use the IndexedTables package

using IndexedTables
m='A':'C'
sc=10:10:30
lt=1:3


t2 = ndsparse((;m),(;sc,lt))

t2['C'].sc

Topic		Replies	Views
Why is it so complicated to access a row in a DataFrame? General Usage dataframes	13	2005	August 25, 2023
Nested access of DataFrame New to Julia dataframes	8	672	June 25, 2021
DataFrame Groupby New to Julia dataframes	2	2149	April 26, 2018
DataFrame grouped by a column; How to access a group by a particular value in that column General Usage dataframes	1	2684	January 5, 2022
Can indexes to DataFrame column be added to inprove selection performances? Data	2	384	December 28, 2020

Data structure for convenient access to tabular data

Related topics