Construct DataFrame From Uneven Named Tuples

quietlight · August 19, 2023, 1:59am

I am trying to construct a DataFrame from an array of named tuples where some key:value pairs may be missing from some tuples.

They come from querying Airtable using Airtable,jl. If you have missing values in your table, they are not present in the data returned by the airtable API.

This is what an array may look like:

data = [(Number = 4, Name = “Abc Efgg”, Address = “48 Mont Rd”),
(Number = 6, Name = “Ruf Sly”, Address = “19A Keke ava”),
(Number = 10, Name = “Jack Bog”),
(Number = 5, Name = “Gid Hoo”, Address = “120 Mut Street”)]

I seem to be struggling to find a simple solution. Any ideas welcome.

Regards
David

Dan · August 19, 2023, 2:42am

One way would be:

cols = union(keys.(data)...)
df = DataFrame([c => get.(data,c, missing) for c in cols]...)

which gives:

4×3 DataFrame
 Row │ Number  Name      Address        
     │ Int64   String    String?        
─────┼──────────────────────────────────
   1 │      4  Abc Efgg  48 Mont Rd
   2 │      6  Ruf Sly   19A Keke ava
   3 │     10  Jack Bog  missing        
   4 │      5  Gid Hoo   120 Mut Street

But the real question is how the annoying Unicode double quotes ““” managed to get into the OP

quietlight · August 19, 2023, 3:14am

Thank you for your kind reply. get! I forgot about that.

Regards
David

rocco_sprmnt21 · August 19, 2023, 6:52am

Probably this scheme is valid also without the intervention of the foraeach function (that is, using only the named tuples and some splatting/broadcasting), but I haven’t found the way yet

df=DataFrame()
foreach(d->push!(df,d,cols=:union),data)

i would have expected that push! worked the same way in the following two cases

push!([1,2],3,4)

push!(df,data...,cols=:union)

Oddly enough this way, it works

push!.([df],data,cols=:union)

push!.([df],data,cols=:union)[1]

trying append!(…,cols=:union) I would have expected it to handle the missing field

julia> append!(df,data,cols=:union)
ERROR: type NamedTuple has no field Address

which dictrowtable does instead

append!(df,Tables.dictrowtable(data))

# or better

julia> DataFrame(Tables.dictrowtable(data))
4×3 DataFrame
 Row │ Number  Name      Address        
     │ Int64   String    String?
─────┼──────────────────────────────────
   1 │      4  Abc Efgg  48 Mont Rd
   2 │      6  Ruf Sly   19A Keke ava
   3 │     10  Jack Bog  missing
   4 │      5  Gid Hoo   120 Mut Street

quietlight · August 19, 2023, 7:55am

Thanks for your reply. I will give it a go my tomorrow. Julia is an interesting language for sure.

Regards
David

bkamins · August 19, 2023, 6:18pm

Yes Tables.dictrowtable is the intended way to handle this:

julia> DataFrame(Tables.dictrowtable(data))
4×3 DataFrame
 Row │ Number  Name      Address
     │ Int64   String    String?
─────┼──────────────────────────────────
   1 │      4  Abc Efgg  48 Mont Rd
   2 │      6  Ruf Sly   19A Keke ava
   3 │     10  Jack Bog  missing
   4 │      5  Gid Hoo   120 Mut Street

(the issue is that your problem is unrelated with DataFrames.jl but is a consequence of Tables.jl design - there might be some more functionalities added to Tables.jl to make working with such data easier in the future)

rocco_sprmnt21 · August 19, 2023, 7:45pm

Of the two unfulfilled expectations, of one (the one related to the function append!(…, cols=:union)) I got an idea of how it works.
From the following example I understand that kwarg intervenes to combine internally “homogeneous” blocks (ie a vectors of named tuples with the same fields) but between the two o more blocks there may be fields not present in the others.
I can’t figure out why the push!() function can’t work on a list of namedtuples instead

julia> df1=DataFrame(data[Not(3)])
3×3 DataFrame
 Row │ Number  Name      Address        
     │ Int64   String    String
─────┼──────────────────────────────────
   1 │      4  Abc Efgg  48 Mont Rd
   2 │      6  Ruf Sly   19A Keke ava
   3 │      5  Gid Hoo   120 Mut Street

julia> append!(df1,data[3:3],cols=:union)
4×3 DataFrame
 Row │ Number  Name      Address        
     │ Int64   String    String?
─────┼──────────────────────────────────
   1 │      4  Abc Efgg  48 Mont Rd
   2 │      6  Ruf Sly   19A Keke ava
   3 │      5  Gid Hoo   120 Mut Street
   4 │     10  Jack Bog  missing
#--------------------
julia> df1=DataFrame(data[1:1])
1×3 DataFrame
 Row │ Number  Name      Address    
     │ Int64   String    String
─────┼──────────────────────────────
   1 │      4  Abc Efgg  48 Mont Rd

julia> append!(df1,data[2:4],cols=:union)
ERROR: type NamedTuple has no field Address
Stacktrace:
#-------------------
julia> df1=DataFrame(data[1:2])
2×3 DataFrame
 Row │ Number  Name      Address      
     │ Int64   String    String
─────┼────────────────────────────────
   1 │      4  Abc Efgg  48 Mont Rd
   2 │      6  Ruf Sly   19A Keke ava

julia> append!(df1,data[3:4],cols=:union)
4×3 DataFrame
 Row │ Number  Name      Address      
     │ Int64   String    String?
─────┼────────────────────────────────
   1 │      4  Abc Efgg  48 Mont Rd
   2 │      6  Ruf Sly   19A Keke ava
   3 │     10  Jack Bog  missing  #???????
   4 │      5  Gid Hoo   missing

The behavior also seems to depend on the order in which the array of named tuples to be appended is prepared


julia> df1=DataFrame(data[1:2])
2×3 DataFrame
 Row │ Number  Name      Address      
     │ Int64   String    String
─────┼────────────────────────────────
   1 │      4  Abc Efgg  48 Mont Rd
   2 │      6  Ruf Sly   19A Keke ava

julia> append!(df1,data[[4,3]],cols=:union)
ERROR: type NamedTuple has no field Address

bkamins · August 19, 2023, 8:16pm

This is not implemented as it strains compiler a lot, foreach should be used instead - just as you have proposed.

append!(df1,data[3:4],cols=:union)

As I have commented - the reason why this fails is unrelated with DataFrames.jl. This is an issue with Tables.jl. The problem is that data[3:4] is not a valid Tables.jl table. That is why Tables.dictrowtable is currently required. However, in Initializing Dataframe from vector of named tuples: missing values · Issue #3370 · JuliaData/DataFrames.jl · GitHub I proposed to add more support for non-homogenous tables in Tables.jl.

aplavin · August 19, 2023, 10:29pm

How to determine that? From reading the docs, it seems to fulfil the requirements for a row-based table.

rocco_sprmnt21 · August 20, 2023, 7:29am

Isn’t the istable function checking if something is valid Tables.jl table?

julia> data = [(Number = 4, Name = "Abc Efgg", Address = "48 Mont Rd"),
       (Number = 6, Name = "Ruf Sly", Address = "19A Keke ava"),
       (Number = 10, Name = "Jack Bog"),
       (Number = 5, Name = "Gid Hoo", Address = "120 Mut Street")]
4-element Vector{NamedTuple}:
 (Number = 4, Name = "Abc Efgg", Address = "48 Mont Rd")
 (Number = 6, Name = "Ruf Sly", Address = "19A Keke ava")
 (Number = 10, Name = "Jack Bog")
 (Number = 5, Name = "Gid Hoo", Address = "120 Mut Street")

julia> Tables.istable(data)
true

julia> Tables.istable(data[3:4])
true

julia> Tables.istable(data[[4,3]])
true

bkamins · August 20, 2023, 8:09am

Technically you can check it if you try Tables.columns:

julia> Tables.columns(data)
ERROR: type NamedTuple has no field Address

In the documentation (of dictrowtable) you can read:

For “schema-less” input tables, dictrowtable employs a “column unioning” behavior, as opposed to inferring the schema from the first row like Tables.columns.

So as you can read here normally the columns from a first row of data will be assumed to specify the columns of a table. That is why you get an error.

Also because of this if you have the following operation:

julia> DataFrame(data[[3, 1, 2, 4]])
4×2 DataFrame
 Row │ Number  Name
     │ Int64   String
─────┼──────────────────
   1 │     10  Jack Bog
   2 │      4  Abc Efgg
   3 │      6  Ruf Sly
   4 │      5  Gid Hoo

it works and uses only 2 columns (from the 3rd row of the original table)

bkamins · August 20, 2023, 8:12am

I would not recommend using Tables.istable in practice (unfortunately). See its docs:

Check if an object has specifically defined that it is a table. Note that not all valid tables will return true, since it’s possible to satisfy the Tables.jl interface at “run-time”

and

It is recommended that for users implementing MyType, they define only istable(::Type{MyType})

so as you can see:

If istable returns false it does not mean anything
IF istable returns true it is typically determined on TYPE level (not instance level) and TYPE could have opted in to signal that it is a table, while the instance might violate some assumptions (this is the case of our data vector)

rocco_sprmnt21 · August 20, 2023, 8:18am

Would it be a bad idea (if it were possible) to transform (behind the scenes) the following expressions into the “good” one using dictrowtable?

   df=DataFrame()

    push!(df,data...,cols=:union)

    append!(df,data,cols=:union)

bkamins · August 20, 2023, 8:23am

This is easily doable, by adding a following definition (simplified - simplified because we probably should not use recursion + we should handle all kwargs):

push!(df, d1, data...,cols) = push!(push!(df, d1, cols=cols), data..., cols=cols)

if you think it would be useful can you please open an issue?

For this:

It cannot be fixed in DataFrames.jl. The reason is that it is Tables.jl that signals that data is not a valid table before even append! gets called. We would need something like append!(df,Tables.colunion(data)) and add Tables.colunion to Tables.jl (colunion name is tentative).

rocco_sprmnt21 · August 20, 2023, 8:38am

to extend the use of the push function! I think it’s useful (if feasible without losing too much performance) as it’s a direct extension of how push!() works for “normal” arrays.
I’ll open the issue right away. And I will be grateful if you make it easier for me by giving me the correct link of where to write it.

For the function append! I read from the documentation that

Add the rows of df2 to the end of df. If the second argument table is not an AbstractDataFrame then it is converted using DataFrame(table, copycols=false) before being appended.

could you then (do I make it too easy? I don’t want to sound presumptuous in making suggestions to you. it’s just to understand something more) do DataFrame( dictrowtable(table), copycols=false) before being appended?

bkamins · August 20, 2023, 8:56am

DataFrame( dictrowtable(table), copycols=false)

This is very inefficient (computationally expensive). Therefore, what we now propose is:

if you really need it you can add Tables.dictrowtable wrapper manually around data and things work
to add another wrapper, I called it Tables.colunion tentatively, that would work like Tables.dictrowtable but would perform column unioning

I have opened the issue for the feature you asked for in Add support for multiple positional arguments in push!/pushfirst!/append!/prepend! · Issue #3371 · JuliaData/DataFrames.jl · GitHub.

aplavin · August 20, 2023, 11:06am

data seems to fulfill the requirements for a row-table: rows(data) is an iterable of AbstractRow-like objects. So, if columns() doesn’t work with it — either a bug in columns, or some requirements is missing in the docs (on implementing tables Interface).

From

it follows that “schema-less” tables are actually tables, their support is just not implemented in columns.

bkamins · August 20, 2023, 1:56pm

If you feel the behavior should be changed, can you please open an issue in Tables.jl as probably @quinnj should comment on this since he maintains this package.

bkamins · August 20, 2023, 4:47pm

Topic		Replies	Views
Array of heterogeneous named tuples to DataFrame General Usage	2	414	December 24, 2020
How to create `DataFrame` from using NamedTuple keys as column names Data	4	2710	August 11, 2019
DataFrame construction from array of tuples General Usage data	12	7077	November 28, 2022
Array of tuples to DataFrame General Usage question , package , dataframes	1	43	September 17, 2024
Construct Julia Dataframe from row data New to Julia question , dataframes , data_structures	11	6176	March 21, 2020

Construct DataFrame From Uneven Named Tuples

Related topics