Flatten dicts of dicts in DataFrame

stene · February 17, 2025, 1:59pm

I have a vector of dicts where some entries contains dicts which I want to process in a DafaFrame. I want to flatten the dicts inside the dicts to make it easier to work with the data. I cannot find the correct command.

DataFrame(a) works if a is a vector of dicts of dicts, but I cannot find out how to flatten a.

Example:
a1=Dict(“a”=>1, “b”=>Dict(“c”=>1, “d”=>2))
a=[a1]
DataFrame(a)

How to flatten the dict “b”

heliosdrm · February 17, 2025, 2:34pm

julia> flattenpair(p) = isa(p.second, Dict) ? p.second : Dict(p)
flattenpair (generic function with 1 method)

julia> reduce(merge, [flattenpair(p) for p in a1])
Dict{String, Any} with 3 entries:
  "c" => 1
  "a" => 1
  "d" => 2

julia> DataFrame(reduce(merge, [flattenpair(p) for p in a1]))
1×3 DataFrame
 Row │ a      c      d     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      2

Dan · February 17, 2025, 4:42pm

A tweak on the solution might be:

julia> flattenpair(p) = Dict(isa(p.second, Dict) ? "$(p.first)." .* keys(p.second) .=> values(p.second) : p)

julia> reduce(merge, [flattenpair(p) for p in a1])
Dict{String, Any} with 3 entries:
  "b.c" => 1
  "a"   => 1
  "b.d" => 2

julia> DataFrame(reduce(merge, [flattenpair(p) for p in a1]))
1×3 DataFrame
 Row │ a      b.c    b.d   
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      2

which adds the original key into the column names in the DataFrame.

drizk1 · February 17, 2025, 11:40pm

@stene some great solutions are above, i just wanted to throw out one more.

using TidierData
a1=Dict("a"=> 1, "b"=> Dict("c"=>1, "d"=>2))
df = DataFrame(a1)

@unnest_wider(df, b)
1×3 DataFrame
 Row │ a      c      d     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      2

stene · February 18, 2025, 3:24am

@unnest_wider is great, thanks for sharing it! Albeit it will not work on some JSON data, since keys might be missing. On a more complicated case I got:
KeyError: key “xxxx” not found. Will report a separate bug, since the intention of the feature is great. A ready made feature was what I was looking for.

After bugs are fixed @unnest_wider is definitely the solution for flattening dicts or unnesting dicts.

stene · February 18, 2025, 4:08am

transform(df, :b => AsTable) is another solution built into DataFrames. It will not work for JSON data where keys are missing.

heliosdrm · February 18, 2025, 8:42am

Then I think you could uncheck my answer and mark @drizk1 's as “solution”, to make your preferred solution easier to discover.

rocco_sprmnt21 · February 18, 2025, 9:30am

Maybe there is something here that (suitably adapted) could be useful for you. Can you provide some minimal but exhaustive examples of the cases that interest you?

stene · February 18, 2025, 10:08am

A more complicated example is this one and this is a toy example of what you might get after parsing a JSON file. It reproduces the exact same error as using a real world JSON file.

a1=Dict(“a”=>1, “b”=>Dict(“c”=>1, “d”=>2))
a2=Dict(“a”=>1, “b”=>Dict(“c”=>1))
a=[a1;a2]
df=DataFrame(a)

Will not work

bug=@unnest_wider(df, b)

Will not work either

bug2=transform(df, :b => AsTable)

Will test some other solutions, but currently the first solution (with some extra code) seems to work.

drizk1 · February 18, 2025, 10:16am

hey thanks for catching this. the fix was quite straightforward and is now in a PR. Please let us know if you run into other edge cases

using Pkg; Pkg.add(url = "https://github.com/TidierOrg/TidierData.jl", rev = "unnest_wider_bugfix")
a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)); a2=Dict("a"=>1, "b"=>Dict("c"=>1)); a=[a1;a2]; df=DataFrame(a);

@unnest_wider(df, b)
2×3 DataFrame
 Row │ a      c      d       
     │ Int64  Int64  Int64?  
─────┼───────────────────────
   1 │     1      1        2
   2 │     1      1  missing 

#fyi you can also unnest multiple columns at once
df2 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7, 8]]);

@unnest_wider(df2, b:c, names_sep = "_")
2×5 DataFrame
 Row │ a      b_1    b_2    c_1    c_2   
     │ Int64  Int64  Int64  Int64  Int64 
─────┼───────────────────────────────────
   1 │     1      1      2      5      6
   2 │     2      3      4      7      8

stene · February 18, 2025, 10:34am

Thanks, I am also posting a toy example in JSON which should work:

test = JSON.parse(“{"arr":[{"a":1,"b":2},{"a":2}]}”)
df=DataFrame(test)
noLongerABug=@unnest_wider(df, arr)

rocco_sprmnt21 · February 18, 2025, 12:03pm

Read nesting JSON and flat nested DataFrame

allflatnt(vd::Vector{Dict{Symbol, Any}})= all(allflatnt, vd)


allflatnt(d::Dict{Symbol, Any})=all(e->!(last(e) isa Vector{<:Dict} || last(e) isa Dict),d)

function nestdf(vd::Vector{Dict{Symbol, Any}})
  DataFrame(Tables.dictrowtable(nestdict.(vd)))
end

function nestdf(d::Dict{Symbol, Any})
    tdf=DataFrame(Tables.dictrowtable(nestdict(d)))
    if names(tdf)==["first","second"]
        rename!(tdf, [:first,:second].=>[:key,:value])
    end
    #tdf
end

function nestdict(d::Dict{Symbol, Any})
    for (k,v) in d
       if v isa Vector{<:Dict} || v isa Dict
            if allflatnt(v)
                tdf=DataFrame(Tables.dictrowtable(v))
                if names(tdf)==["first","second"]
                    rename!(tdf, [:first,:second].=>[:key,:value])
                end
                merge!(d, Dict(k=>tdf))
            else
                merge!(d, Dict(k=>nestdf(v)))
            end
        end
    end
    d 
end



nestdf(jsobj::JSON3.Object)=nestdf(copy(jsobj))
nestdf(jsobj::JSON3.Array)=nestdf(copy(jsobj))



using YFinance, JSON3, DataFrames

aapl_json=get_quoteSummary("AAPL")


ndf=nestdf(aapl_json)



function expand2(ndf, col)
    df2nt(sdf)=(;zip(Symbol.(string.(col,'.',names(sdf))),eachcol(sdf))...)
    rexp=findall(er-> er[col] isa DataFrame, eachrow(ndf))
    rt=Tables.dictrowtable(vcat(df2nt.(ndf[rexp,col])...))
    edf=hcat(ndf[rexp,Not(col)],DataFrame(rt))
    iexp=findall.(>(1),map(r-> [er isa Vector ? length(er) : 1 for er in r],values.(eachrow(edf))))
    edfp=mapreduce(r->flatten(edf[r:r,:],iexp[r]), vcat,1: nrow(edf))
    nedfp=select(filter(er-> !isa(er[col] , DataFrame), ndf),Not(col))
    vcat(nedfp,edfp,cols=:union)
end

dfexp_l1=expand2(ndf,:value)
dfexp_l2=expand2(dfexp_l1,Symbol("value.value"))

Some less toy-like outputs

julia> aapl_json=get_quoteSummary("AAPL")
JSON3.Object{Vector{UInt8}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}} with 31 entries:
  :assetProfile                      => {…
  :recommendationTrend               => {…
  :cashflowStatementHistory          => {…
  :indexTrend                        => {…
  :defaultKeyStatistics              => {…
  :industryTrend                     => {…
  :quoteType                         => {…
  :incomeStatementHistory            => {…
  :fundOwnership                     => {…
  :summaryDetail                     => {…
  :insiderHolders                    => {…
  :calendarEvents                    => {…
  :upgradeDowngradeHistory           => {…
  :price                             => {…
  :balanceSheetHistory               => {…
  :earningsTrend                     => {…
  :secFilings                        => {…
  :institutionOwnership              => {…
  :majorHoldersBreakdown             => {…
  :balanceSheetHistoryQuarterly      => {…
  :earningsHistory                   => {…
  :majorDirectHolders                => {…
  :esgScores                         => {…
  :summaryProfile                    => {…
  :netSharePurchaseActivity          => {…
  :insiderTransactions               => {…
  :sectorTrend                       => {…
  :incomeStatementHistoryQuarterly   => {…
  :cashflowStatementHistoryQuarterly => {…
  :earnings                          => {…
  :financialData                     => {…

julia> ndf=nestdf(aapl_json)
31×2 DataFrame
 Row │ key                                value          
     │ Symbol                             DataFrame
─────┼───────────────────────────────────────────────────
   1 │ cashflowStatementHistoryQuarterly  2×2 DataFrame
   2 │ industryTrend                      3×2 DataFrame
   3 │ earningsTrend                      2×2 DataFrame
   4 │ incomeStatementHistory             2×2 DataFrame
   5 │ price                              36×2 DataFrame
   6 │ upgradeDowngradeHistory            2×2 DataFrame
   7 │ institutionOwnership               2×2 DataFrame
   8 │ fundOwnership                      2×2 DataFrame
   9 │ summaryDetail                      42×2 DataFrame
  10 │ netSharePurchaseActivity           12×2 DataFrame
  11 │ earnings                           4×2 DataFrame
  12 │ insiderHolders                     2×2 DataFrame
  13 │ calendarEvents                     4×2 DataFrame
  14 │ summaryProfile                     19×2 DataFrame
  15 │ majorDirectHolders                 2×2 DataFrame
  16 │ assetProfile                       26×2 DataFrame
  17 │ recommendationTrend                2×2 DataFrame
  18 │ balanceSheetHistory                2×2 DataFrame
  19 │ financialData                      30×2 DataFrame
  20 │ indexTrend                         3×2 DataFrame
  21 │ balanceSheetHistoryQuarterly       2×2 DataFrame
  22 │ insiderTransactions                2×2 DataFrame
  23 │ cashflowStatementHistory           2×2 DataFrame
  24 │ earningsHistory                    2×2 DataFrame
  25 │ quoteType                          13×2 DataFrame
  26 │ sectorTrend                        3×2 DataFrame
  27 │ esgScores                          36×2 DataFrame
  28 │ secFilings                         2×2 DataFrame
  29 │ defaultKeyStatistics               40×2 DataFrame
  30 │ majorHoldersBreakdown              5×2 DataFrame
  31 │ incomeStatementHistoryQuarterly    2×2 DataFrame



julia> expand2(ndf,:value)
308×3 DataFrame
 Row │ key                                value.key                     value.value ⋯
     │ Symbol                             Symbol?                       Any         ⋯
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │ cashflowStatementHistoryQuarterly  cashflowStatements            4×3 DataFra ⋯
   2 │ cashflowStatementHistoryQuarterly  maxAge                        86400
   3 │ industryTrend                      symbol
   4 │ industryTrend                      estimates                     Any[]
   5 │ industryTrend                      maxAge                        1           ⋯
   6 │ earningsTrend                      trend                         4×8 DataFra
   7 │ earningsTrend                      maxAge                        1
   8 │ incomeStatementHistory             incomeStatementHistory        4×24 DataFr
   9 │ incomeStatementHistory             maxAge                        86400       ⋯
  10 │ price                              regularMarketChangePercent    0.0127107
  11 │ price                              currency                      USD
  12 │ price                              priceHint                     2
  13 │ price                              postMarketSource              FREE_REALTI ⋯
  14 │ price                              regularMarketTime             1739566800
  15 │ price                              regularMarketSource           FREE_REALTI
  16 │ price                              regularMarketPrice            244.6
  ⋮  │                 ⋮                               ⋮                      ⋮     ⋱
 293 │ defaultKeyStatistics               beta                          1.2         ⋯
 294 │ defaultKeyStatistics               sharesShortPreviousMonthDate  1735603200
 295 │ defaultKeyStatistics               forwardEps                    8.31
 296 │ defaultKeyStatistics               lastFiscalYearEnd             1727481600
 297 │ defaultKeyStatistics               sharesOutstanding             15022100480 ⋯
 298 │ defaultKeyStatistics               enterpriseValue               37174231695
 299 │ defaultKeyStatistics               shortRatio                    2.12
 300 │ defaultKeyStatistics               nextFiscalYearEnd             1759017600
 301 │ defaultKeyStatistics               mostRecentQuarter             1735344000  ⋯
 302 │ majorHoldersBreakdown              institutionsPercentHeld       0.62682
 303 │ majorHoldersBreakdown              institutionsFloatPercentHeld  0.64148
 304 │ majorHoldersBreakdown              insidersPercentHeld           0.02286
 305 │ majorHoldersBreakdown              institutionsCount             6848        ⋯
 306 │ majorHoldersBreakdown              maxAge                        1
 307 │ incomeStatementHistoryQuarterly    incomeStatementHistory        4×24 DataFr
 308 │ incomeStatementHistoryQuarterly    maxAge                        86400
                                                        1 column and 276 rows omitted


julia> dfexp_l2=expand2(dfexp_l1,Symbol("value.value"))
1569×84 DataFrame
  Row │ key                                value.key                   value.value.netIncome ⋯
      │ Symbol                             Symbol?                     Any                   ⋯
──────┼───────────────────────────────────────────────────────────────────────────────────────
    1 │ cashflowStatementHistoryQuarterly  maxAge                      missing               ⋯
    2 │ industryTrend                      symbol                      missing
    3 │ industryTrend                      estimates                   missing
    4 │ industryTrend                      maxAge                      missing
    5 │ earningsTrend                      maxAge                      missing               ⋯
    6 │ incomeStatementHistory             maxAge                      missing
    7 │ price                              regularMarketChangePercent  missing
    8 │ price                              currency                    missing
    9 │ price                              priceHint                   missing               ⋯
   10 │ price                              postMarketSource            missing
   11 │ price                              regularMarketTime           missing
   12 │ price                              regularMarketSource         missing
   13 │ price                              regularMarketPrice          missing               ⋯
   14 │ price                              marketCap                   missing
   15 │ price                              longName                    missing
   16 │ price                              toCurrency                  missing
  ⋮   │                 ⋮                              ⋮                         ⋮           ⋱
 1554 │ secFilings                         filings                     missing               ⋯
 1555 │ secFilings                         filings                     missing
 1556 │ secFilings                         filings                     missing
 1557 │ secFilings                         filings                     missing
 1558 │ secFilings                         filings                     missing               ⋯
 1559 │ secFilings                         filings                     missing
 1560 │ secFilings                         filings                     missing
 1561 │ secFilings                         filings                     missing
 1562 │ secFilings                         filings                     missing               ⋯
 1563 │ secFilings                         filings                     missing
 1564 │ secFilings                         filings                     missing
 1565 │ secFilings                         filings                     missing
 1566 │ incomeStatementHistoryQuarterly    incomeStatementHistory      3×2 DataFrame         ⋯
 1567 │ incomeStatementHistoryQuarterly    incomeStatementHistory      3×2 DataFrame
 1568 │ incomeStatementHistoryQuarterly    incomeStatementHistory      3×2 DataFrame
 1569 │ incomeStatementHistoryQuarterly    incomeStatementHistory      3×2 DataFrame
                                                              81 columns and 1537 rows omitted

julia> names(dfexp_l2)
84-element Vector{String}:
 "key"
 "value.key"
 "value.value.netIncome"
 "value.value.endDate"
 "value.value.maxAge"
 "value.value.growth"
 "value.value.earningsEstimate"
 "value.value.period"
 "value.value.epsTrend"
 "value.value.epsRevisions"
 "value.value.revenueEstimate"
 "value.value.totalRevenue"
 "value.value.totalOperatingExpenses"
 "value.value.grossProfit"
 "value.value.nonRecurring"
 "value.value.otherOperatingExpenses"
 "value.value.minorityInterest"
 "value.value.sellingGeneralAdministrative"
 ⋮
 "value.value.transactionText"
 "value.value.shares"
 "value.value.startDate"
 "value.value.filerUrl"
 "value.value.filerRelation"
 "value.value.moneyText"
 "value.value.ownership"
 "value.value.epsActual"
 "value.value.currency"
 "value.value.epsDifference"
 "value.value.surprisePercent"
 "value.value.epsEstimate"
 "value.value.quarter"
 "value.value.type"
 "value.value.date"
 "value.value.edgarUrl"
 "value.value.epochDate"
 "value.value.exhibits"


# The tree of nested dataframes is, albeit crudely, "navigable".

julia> ndf.value[16].value[26].totalPay[1:3]
3-element Vector{Union{Missing, DataFrame}}:
 3×2 DataFrame
 Row │ key      value      
     │ Symbol   Union…
─────┼─────────────────────
   1 │ fmt      16.24M
   2 │ longFmt  16,239,562
   3 │ raw      16239562
 3×2 DataFrame
 Row │ key      value     
     │ Symbol   Union…
─────┼────────────────────
   1 │ fmt      4.64M
   2 │ longFmt  4,637,585
   3 │ raw      4637585
 3×2 DataFrame
 Row │ key      value     
     │ Symbol   Union…
─────┼────────────────────
   1 │ fmt      4.62M
   2 │ longFmt  4,618,064
   3 │ raw      4618064

Perhaps with some modification these functions could suit your needs.

stene · February 18, 2025, 12:47pm

Thanks, but with the latest TidierData.jl, their code works with real world large scale files.

rocco_sprmnt21 · February 18, 2025, 4:01pm

It’s not meant to replace the package TidierData at all, but if one don’t want any dependencies other than DataFrames, here’s a quick adaptation of the expand2() function, , which does something similar

julia> function expanDict_df(ndf, col)
           df2nt(sdf)=(;zip(Symbol.(string.(col,'.',keys(sdf))),values(sdf))...)
           rexp=findall(er-> er[col] isa Dict, eachrow(ndf))
           rt=Tables.dictrowtable(vcat(df2nt.(ndf[rexp,col])...))
          hcat(ndf[rexp,Not(col)],DataFrame(rt))
       end
expanDict_df (generic function with 1 method)

julia>        expand2(df, :b)
2×3 DataFrame
 Row │ a      b.c    b.d     
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     1      1        2
   2 │     1      1  missing

rocco_sprmnt21 · February 18, 2025, 4:08pm

a variant to choose how to connect suffixes

julia> function expanDict_df(ndf, col; suff='_')
           df2nt(sdf)=(;zip(Symbol.(string.(col,suff,keys(sdf))),values(sdf))...)
           rexp=findall(er-> er[col] isa Dict, eachrow(ndf))
           rt=Tables.dictrowtable(vcat(df2nt.(ndf[rexp,col])...))
          hcat(ndf[rexp,Not(col)],DataFrame(rt))
       end
expanDict_df (generic function with 2 methods)

julia> expanDict_df(df, :b)
2×3 DataFrame
 Row │ a      b_c    b_d     
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     1      1        2
   2 │     1      1  missing

rocco_sprmnt21 · February 19, 2025, 8:49am

I don’t know if this is a real case, but, in theory, we could have a row of data like a0=Dict("a"=>0, "b"=>0)

In this case, I wonder how it should be treated.

how to unnest no-nest?

using DataFrames, TidierData

function expanDict_df(ndf, col; suff='_')
    df2nt(sdf)=(;zip(Symbol.(string.(col,suff,keys(sdf))),values(sdf))...)
    rexp=findall(er-> er[col] isa Dict, eachrow(ndf))
    rt=Tables.dictrowtable(vcat(df2nt.(ndf[rexp, col])...))   
    edf=hcat(ndf[rexp,Not(col)],DataFrame(rt))
    iexp=findall.(>(1),map(r-> [er isa Dict ? length(er) : 1 for er in r],values.(eachrow(edf))))
    edfp=mapreduce(r->flatten(edf[r:r,:],iexp[r]), vcat,1: nrow(edf))
    nedfp=select(filter(er-> !isa(er[col] , Dict), ndf),Not(col))
    vcat(nedfp,edfp,cols=:union)
end

julia> a0=Dict("a"=>0, "b"=>0)
Dict{String, Int64} with 2 entries:
  "b" => 0
  "a" => 0

julia> a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)); a2=Dict("a"=>2, "b"=>Dict("c"=>2))
Dict{String, Any} with 2 entries:
  "b" => Dict("c"=>2)
  "a" => 2

julia> a3=Dict("a"=>3, "b"=>Dict("c"=>3))
Dict{String, Any} with 2 entries:
  "b" => Dict("c"=>3)
  "a" => 3

julia> a=[a0;a1;a2;a3]
4-element Vector{Dict{String}}:
 Dict("b" => 0, "a" => 0)
 Dict{String, Any}("b" => Dict("c" => 1, "d" => 2), "a" => 1)
 Dict{String, Any}("b" => Dict("c" => 2), "a" => 2)
 Dict{String, Any}("b" => Dict("c" => 3), "a" => 3)

julia> df=DataFrame(a)
4×2 DataFrame
 Row │ b                     a     
     │ Any                   Int64
─────┼─────────────────────────────
   1 │ 0                         0
   2 │ Dict("c"=>1, "d"=>2)      1
   3 │ Dict("c"=>2)              2
   4 │ Dict("c"=>3)              3

julia> expanDict_df(df, :b)
4×3 DataFrame
 Row │ a      b_c      b_d     
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     0  missing  missing
   2 │     1        1        2
   3 │     2        2  missing
   4 │     3        3  missing

julia> udfb= @unnest_wider(df, b)
ERROR: Column b contains neither dictionaries nor arrays nor DataFrames
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:35
 [2] unnest_wider(df::DataFrame, cols::Vector{Symbol}; names_sep::Nothing)    
   @ TidierData C:\Users\sprmn\.julia\packages\TidierData\A0yDz\src\nests.jl:66
 [3] top-level scope
   @ C:\Users\sprmn\.julia\packages\TidierData\A0yDz\src\nests.jl:95

julia> a=[a1;a0;a2;a3]
4-element Vector{Dict{String}}:
 Dict{String, Any}("b" => Dict("c" => 1, "d" => 2), "a" => 1)
 Dict("b" => 0, "a" => 0)
 Dict{String, Any}("b" => Dict("c" => 2), "a" => 2)
 Dict{String, Any}("b" => Dict("c" => 3), "a" => 3)

julia> df=DataFrame(a)
4×2 DataFrame
 Row │ b                     a     
     │ Any                   Int64
─────┼─────────────────────────────
   1 │ Dict("c"=>1, "d"=>2)      1
   2 │ 0                         0
   3 │ Dict("c"=>2)              2
   4 │ Dict("c"=>3)              3

julia> expanDict_df(df, :b)
4×3 DataFrame
 Row │ a      b_c      b_d     
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     0  missing  missing
   2 │     1        1        2
   3 │     2        2  missing
   4 │     3        3  missing

julia> udfb= @unnest_wider(df, b)
ERROR: MethodError: Cannot `convert` an object of type Int64 to an object of type String

Closest candidates are:
  convert(::Type{String}, ::StringManipulation.Decoration)
   @ StringManipulation C:\Users\sprmn\.julia\packages\StringManipulation\bMZ2A\src\decorations.jl:365
  convert(::Type{String}, ::Base.JuliaSyntax.Kind)
   @ Base C:\workdir\base\JuliaSyntax\src\kinds.jl:975
  convert(::Type{String}, ::String)
   @ Base essentials.jl:321
  ...

Stacktrace:
 [1] setindex!(h::Dict{String, Nothing}, v0::Nothing, key0::Int64)
   @ Base .\dict.jl:367
 [2] push!(s::Set{String}, x::Int64)
   @ Base .\set.jl:103
 [3] union!(s::Set{String}, itr::Base.OneTo{Int64})
   @ Base .\abstractset.jl:106
 [4] unnest_wider(df::DataFrame, cols::Vector{Symbol}; names_sep::Nothing)    
   @ TidierData C:\Users\sprmn\.julia\packages\TidierData\A0yDz\src\nests.jl:50
 [5] top-level scope
   @ C:\Users\sprmn\.julia\packages\TidierData\A0yDz\src\nests.jl:95

maybe this is more realistic.
How should the case of a second (or even more) level of nesting be treated?

julia> a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)); a2=Dict("a"=>2, "b"=>Dict("c"=>2))
Dict{String, Any} with 2 entries:
  "b" => Dict("c"=>2)
  "a" => 2

julia> a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2))
Dict{String, Any} with 2 entries:
  "b" => Dict("c"=>1, "d"=>2)
  "a" => 1

julia> a2=Dict("a"=>2, "b"=>Dict("c"=>2))
Dict{String, Any} with 2 entries:
  "b" => Dict("c"=>2)
  "a" => 2

julia> a3=Dict("a"=>3, "b"=>Dict("d"=>3))
Dict{String, Any} with 2 entries:
  "b" => Dict("d"=>3)
  "a" => 3

julia> a4=Dict("a"=>4, "b"=>Dict("c"=>41, "d"=>Dict("da"=>411, "db"=>Dict("dbc"=>4111))));

julia> a=[a1;a2;a3;a4]
4-element Vector{Dict{String, Any}}:
 Dict("b" => Dict("c" => 1, "d" => 2), "a" => 1)
 Dict("b" => Dict("c" => 2), "a" => 2)
 Dict("b" => Dict("d" => 3), "a" => 3)
 Dict("b" => Dict{String, Any}("c" => 41, "d" => Dict{String, Any}("da" => 411, "db" => Dict("dbc" => 4111))), "a" => 4)

julia> df=DataFrame(a)
4×2 DataFrame
 Row │ b                                  a     
     │ Dict…                              Int64
─────┼──────────────────────────────────────────
   1 │ Dict("c"=>1, "d"=>2)                   1
   2 │ Dict("c"=>2)                           2
   3 │ Dict("d"=>3)                           3
   4 │ Dict{String, Any}("c"=>41, "d"=>…      4

julia> udfb= @unnest_wider(df, b)
4×3 DataFrame
 Row │ a      c        d                                 
     │ Int64  Int64?   Any
─────┼───────────────────────────────────────────────────
   1 │     1        1  2
   2 │     2        2  missing
   3 │     3  missing  3
   4 │     4       41  Dict{String, Any}("da"=>411, "db…

julia> udfbd= @unnest_wider(udfb, d)
ERROR: Column d contains neither dictionaries nor arrays nor DataFrames
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:35
 [2] unnest_wider(df::DataFrame, cols::Vector{Symbol}; names_sep::Nothing)    
   @ TidierData C:\Users\sprmn\.julia\packages\TidierData\A0yDz\src\nests.jl:66
 [3] top-level scope
   @ C:\Users\sprmn\.julia\packages\TidierData\A0yDz\src\nests.jl:95

drizk1 · February 19, 2025, 11:15am

so the first one is now addressed.

julia> a0=Dict("a"=>0, "b"=>0);  a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)); a2=Dict("a"=>2, "b"=>Dict("c"=>2)); a3=Dict("a"=>3, "b"=>Dict("c"=>3));

julia> a=[a0;a1;a2;a3];

julia> df3=DataFrame(a)
4×2 DataFrame
 Row │ b                     a     
     │ Any                   Int64 
─────┼─────────────────────────────
   1 │ 0                         0
   2 │ Dict("c"=>1, "d"=>2)      1
   3 │ Dict("c"=>2)              2
   4 │ Dict("c"=>3)              3

julia> @unnest_wider(df3, b)
4×3 DataFrame
 Row │ a      c        d       
     │ Int64  Int64?   Int64?  
─────┼─────────────────────────
   1 │     0  missing  missing 
   2 │     1        1        2
   3 │     2        2  missing 
   4 │     3        3  missing

yea this last one is definitely quite an edge case, but the way unnesting behaves right now, its probably best to unnest longer than wider. the case presented above is particularly prickly because the range of different names. as it stands right this would be the outcome

julia> a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)); a2=Dict("a"=>2, "b"=>Dict("c"=>2)); a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)); a2=Dict("a"=>2, "b"=>Dict("c"=>2)); a3=Dict("a"=>3, "b"=>Dict("d"=>3));  a4=Dict("a"=>4, "b"=>Dict("c"=>41, "d"=>Dict("da"=>411, "db"=>Dict("dbc"=>4111))));

julia> a=[a1;a2;a3;a4];

julia> df=DataFrame(a)
4×2 DataFrame
 Row │ b                                  a     
     │ Dict…                              Int64 
─────┼──────────────────────────────────────────
   1 │ Dict("c"=>1, "d"=>2)                   1
   2 │ Dict("c"=>2)                           2
   3 │ Dict("d"=>3)                           3
   4 │ Dict{String, Any}("c"=>41, "d"=>…      4

julia> @chain df begin 
           @unnest_wider b
           @unnest_longer d
           @unnest_wider d
       end
4×4 DataFrame
 Row │ a      c        da       db                
     │ Int64  Int64?   Int64?   Dict…?            
─────┼────────────────────────────────────────────
   1 │     1        1  missing  missing           
   2 │     3  missing  missing  missing           
   3 │     4       41      411  missing           
   4 │     4       41  missing  Dict("dbc"=>4111)

Topic		Replies	Views
Flattening YFinance.jl JSON result into a DataFrame New to Julia question	22	1301	April 11, 2023
Dash datatable using a nested dict anyone done it? New to Julia dataframes , dash	7	1306	June 15, 2022
Unnesting columns of a data frame containing arrays General Usage question , dataframes , data_structures	1	509	December 9, 2020
Nested access of DataFrame New to Julia dataframes	8	672	June 25, 2021
[Pre-ANN/RFC] ExpandNestedData.jl (Previously Normalize.jl) Package Announcements	21	1483	December 7, 2022

Flatten dicts of dicts in DataFrame

Will not work

Will not work either

Related topics