Is there a DictTables.jl?

I have a problem in which the data is tabular, but one of the columns is the keys for the rest of the row. For example:

│ id     │ x2    │ y2    │
│ String │ Int64 │ Int64 │
┼────────┼───────┼───────┤
│ a      │ 1     │ 4     │
│ b      │ 2     │ 5     │
│ k      │ 3     │ 6     │
│ w      │ 3     │ 5     │
│ c      │ 4     │ 5     │

This data can be represented best as a Dict of Dicts. For example, the first row would be:

"a" => Dict(:x2 => 1 , :y2 => 4)

However, since the data is tabular, having the Tables.jl will help in many situations. I was wondering if the Tables’s interface can be extended to support the concept of keys.
It should have an iterator similar to Dict:

for (key, rest_of_the_row) in dicttable
    # iteration over kesys
end

The problems I can solve with this interface naturally:

# In `DictTable`, I specify the keys column in the first argument

# Suppose I want to merge `data` into `old_data` such that:
# - its `id`s are already in `old_data`, and
# - the output table includes `x1,x2`, and
# - a new column called `z` which for each row, its value is the `y2+y1`
data = DictTable( :id,  :id => ["a","b","k","w","c"], :x2=>[1,2,3,3,4], :y2 => [4,5,6,5,5])
​
old_data =  DictTable( :id, :id => ["c","b","a"], :x1=>[0,1,2], :y1 => [4,5,6])

This is a valid, but somewhat special use case which may not justify changing the Tables API. I see two solutions:

  1. a manual conversion to a Dict,

  2. a type implementing DictTable, which supports the Tables.jl interface and AbstractDict (there is no conflict that is apparent to me, but maybe I am missing something).

2 Likes

Yes. Here by extending the Tables.jl I meant the case that we can extend it in a backward-compatible manner. Doing it this way will allow all of the already defined types to have this new feature/iterator.

If adding this breaks the API in any ways, then we shall consider a new DictTables.jl package. If want to define the type as a subtype of AbstractDict then probably we will need DictTables.jl.

You can kinda already do that with

using DataFrames

df = DataFrame(a = ["a", "b","c"], i = 1:3, j = 4:6)


dfg = groupby(df, :a)



for (key, group) in zip(keys(dfg), dfg)
    println(key)
    println(group)
end

the only thing missing is direct indexability. If you are just iterating through, the above is fine.

4 Likes

You can index a grouped data frame

julia> dfg[tuple("a")]
1×3 SubDataFrame
│ Row │ a      │ i     │ j     │
│     │ String │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ a      │ 1     │ 4     │

It will return a subdataframe and not a data frame row, but that’s still easy to work with.

3 Likes

Then there is really no need for DictTable. Everything works. But is the index “indexed” like is it fast like in O(1) for dictionaries?

Why is the best representation a Dict of Dicts? If the data is immutable, it could be a NamedTuple of NamedTuples. Or just a custom struct, something like:

struct KeyedRow{K, V} <: AbstractDict{K, V}
    key::K
    row::V
end

Something like this is maybe possible, but nothing has been formally proposed. We tend to try and keep the API surface area as simple and small as possible, but if there’s enough momentum, we could maybe figure something out.

Note that the StructTypes.jl package defines the StructTypes.idproperty, which allows, for a custom struct, to define what the “key” field is. This is used, for example, in the Strapping.jl package to identify unique rows when building custom structs from (Tables.jl-compatible) resultsets.

So using the KeyedRow example from before, we’d tweak the definitions like:

struct KeyedRow{V} <: Tables.AbstractRow
    key::Symbol
    row::V
end

# Tables.jl interface for row
Tables.columnnames(x::KeyedRow) = Tables.columnnames(getfield(x, :row))
Tables.getcolumn(x::KeyedRow, i::Int) = Tables.getcolumn(getfield(x, :row), i)
Tables.getcolumn(x::KeyedRow, nm::Symbol) = Tables.getcolumn(getfield(x, :row), nm)

# StructTypes.jl interface
StructTypes.StructType(::Type{<:KeyedRow}) = StructTypes.Struct()
StructTypes.idproperty(::Type{<:KeyedRow}) = :key

With this, you could “wrap” any valid Tables.jl row in the KeyedRow struct and provide the key property. You could then use the normal Tables.jl interface (Tables.rows, etc.) and a KeyedRow would act just like a normal row it wraps, except you could also call StructTypes.idproperty

2 Likes

Yes, it is since 0.21 thanks to @bkamins.

2 Likes