Data structure for convenient access to tabular data

Hello community

I use Julia to build JuMP models. What you usually have is indexed parameters that I would like to store in DataFrames and have convenient access when writing models. Think of a DataFrame where you have the first few columns acting as index and each other column as the properties. What data structure do you recommend to keep things organized. So far I had to construct a dictionary out of every column, which is not very clean.

This is my use case

julia> df
3ร—3 DataFrame
 Row โ”‚ Material  StdCost  LeadTime
     โ”‚ Symbol    Int64    Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ A              10         1
   2 โ”‚ B              20         2
   3 โ”‚ C              30         3

In this example I want a convenient way of accessing the StdCost for material C. I figured out that the indexing can be handled by a GroupedDataFrame, but the access is still not pretty

julia> gdf = groupby(df, :Material)
GroupedDataFrame with 3 groups based on key: Material
First Group (1 row): Material = :A
 Row โ”‚ Material  StdCost  LeadTime
     โ”‚ Symbol    Int64    Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ A              10         1
โ‹ฎ
Last Group (1 row): Material = :C
 Row โ”‚ Material  StdCost  LeadTime
     โ”‚ Symbol    Int64    Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ C              30         3

# The Notation I have to make (not pretty)
julia> gdf[(:C,)].StdCost[1]
30

# The Notation I would like to have
julia> gdf[:C].StdCost
30

Any suggestions on how to solve this?

thanks!

I donโ€™t think what you ask for is possible, in the form you expect.
In general, the result of groupby is a subdataframe consisting of multiple rows.
Therefore selecting a column returns as a result a vector of values, which as a special case, can be of length 1.

If you know for sure that your groups are single row you can use the only() function. Otherwise, if you only need only one value, for example the first one, you could use findfirst()

using DataFrames
m='A':'C'
sc=10:10:30
lt=1:3
df=DataFrame(;m,sc,lt)

only(df[m.=='C',:sc])

df[findfirst(==('C'),df.m),:sc]
1 Like

I think in your case, if you are sure that each group has one row it is better to use Dict of NamedTuple:

julia> d = Dict("a" => (x=1, y=2), "b" => (x=3, y=4))
Dict{String, NamedTuple{(:x, :y), Tuple{Int64, Int64}}} with 2 entries:
  "b" => (x = 3, y = 4)
  "a" => (x = 1, y = 2)

julia> d["a"].x
1

julia> d["b"].y
4
3 Likes

Thank you!, this is what I was looking for. Is there a construct that from a DataFrame without iterating over the rows?

Is this what you want (assuming column :a has unique values)?

julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3ร—3 DataFrame
 Row โ”‚ a      b      c
     โ”‚ Int64  Int64  Int64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1      4      7
   2 โ”‚     2      5      8
   3 โ”‚     3      6      9

julia> Dict(row.a => copy(row[Not(:a)]) for row in eachrow(df))
Dict{Int64, NamedTuple{(:b, :c), Tuple{Int64, Int64}}} with 3 entries:
  2 => (b = 5, c = 8)
  3 => (b = 6, c = 9)
  1 => (b = 4, c = 7)

another possibility is to use the IndexedTables package

using IndexedTables
m='A':'C'
sc=10:10:30
lt=1:3


t2 = ndsparse((;m),(;sc,lt))

t2['C'].sc