Flattening YFinance.jl JSON result into a DataFrame

I am sorry. Yes, that’s what I meant.

I am not sure if this is the most elegant way, but you could combine the dictionaries first, then add b the combined dictionary, and then use the final dictionary to initialize the dataframe.

julia> d1 = Dict([:aa=>1, :bb=>2])
Dict{Symbol, Int64} with 2 entries:
  :aa => 1
  :bb => 2

julia> d2 = Dict([:aa=>3, :bb=>4])
Dict{Symbol, Int64} with 2 entries:
  :aa => 3
  :bb => 4

julia> dicts = [d1, d2]
2-element Vector{Dict{Symbol, Int64}}:
 Dict(:aa => 1, :bb => 2)
 Dict(:aa => 3, :bb => 4)

# The type `Dict{Symbol, Any}` is required, 
# otherwise you won't be able to insert the column `b` with string values
julia> merged = Dict{Symbol, Any}(k => getindex.([d1, d2], k) for k in keys(d1))
Dict{Symbol, Any} with 2 entries:
  :aa => [1, 3]
  :bb => [2, 4]

julia> merged[:b] = ["one", "two"]
2-element Vector{String}:
 "one"
 "two"

julia> DataFrame(merged)
2×3 DataFrame
 Row │ aa     b       bb
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  one         2
   2 │     3  two         4

A slightly shorter version:

julia> [DataFrame(df4.a) df4[:, [:b]]]
2×3 DataFrame
 Row │ aa     bb     b
     │ Int64  Int64  String
─────┼──────────────────────
   1 │     1      2  one
   2 │     3      4  two

DataFrame already has a vector-of-dicts constructor, so I’m just using that and then hcat on the other column.

4 Likes

barucden,

Your solution seems to work, but it is a little more complex of what I expected

Thank you

nilshg,

your solution is simple and elegant, thank you

my actual DataFrame comes from a JSON file. So, in its nature, it is recursive, that is, their columns may contain Dictionaries that contain other Dictionaries inside them.

Can your solution be applied recursively, that is, without konwing in advance the number of levels of dictionaries it contains?

Not in general, no. If you want to read JSON data into DataFrames maybe use the JSONTables package which has this as its intended use case?

2 Likes

Are you sure you’ll always be able to coerce the data into a tabular format? If so, the JSONTables.jl suggestion should be fine. Otherwise, you might consider just using JSON.jl or JSON3.jl to read & manipulate the data. If the structure of the data is always the same, and cannot be nicely represented in tabular format, you might also consider reading it into a custom struct. Sometimes, this makes it easier to write functions that do the subsequent transformations/manipulations/etc.

1 Like
julia> transform(df4,:a=>AsTable)
2×4 DataFrame
 Row │ a                     b       aa     bb    
     │ Dict…                 String  Int64  Int64
─────┼────────────────────────────────────────────
   1 │ Dict(:aa=>1, :bb=>2)  one         1      2
   2 │ Dict(:aa=>3, :bb=>4)  two         3      4
2 Likes

The data is stock market data, from Yahoo.

I’m using the YFinance package.

Data isread in JSON format. Sometimes, it is easy to convert it to DataFrame, some other times it is not.

Since data comes from Yahho, theoretically it always follows the same format.

However, sometimes the available information may be incomple. That is why it cannot be assured that the structure of the data will always be the same. I’ll depent on what’s available for each ticker.

The YFinance.jl docs have lots of examples of fetching results and converting to DataFrame.

https://eohne.github.io/YFinance.jl/dev/

Also, thanks for informing me about this package. I only knew of a much older and unmaintained package that did this. Glad someone updated it.

3 Likes

If the JSON struct is nested, you can use

mrufsvold/ExpandNestedData.jl (github.com)

I haven’t registered it yet because I’m still honing a couple pieces of it, but it should be good to go for most applications.

1 Like

Are you referring to the really old Yahoo Finance package or MarketData?

Another recent alternative is

1 Like

Yea that one, although it looks like it has seen some recent activity as well.

YFinance has lots of information, including free fundamental data coming from Yahoo.
MarketData has only historical prices. Therefore, it might be a good tool to use.

2 Likes

Sorry this is a bit off topic.
YFinance guy here. Which data are you trying to get into what format? Could be a general improvement to the package…

If you want to do some tests, I submit a script that recursively reads a json file that contains multi-level nested tables.
The result is a dataframe containing in some cells a dataframe as value.
I add a function for selective unpacking of the columns of interest.
PS. I didn’t test on files other than the attached one that I found on a web page.

julia> using JSON3

julia> using DataFrames

the json string
julia> jdata = """[
           {
               "name":"bob",
               "salary":13000,
               "friends":[
                   {
                       "name": "sarah",
                       "salary":10000
                   },
                   {
                       "name": "bill",
                       "salary":5000
                   }
               ]
           },
           {
               "name":"marge",
               "salary":10000,
               "friends":[
                   {
                       "name": "rhonda",
                       "salary":10000
                   },
                   {
                       "name": "mike",
                       "salary":5000,
                       "hobbies":[
                           {
                               "name":"surfing",
                               "frequency":10
                           },
                           {
                               "name":"surfing",
                               "frequency":15
                           }
                       ]
                   }
               ]
           },
           {
               "name":"joe",
               "salary":10000,
               "friends":[
                   {
                       "name": "harry",
                       "salary":10000
                   },
                   {
                       "name": "sally",
                       "salary":5000
                   }
               ]
           }
       ]"""
"[\n    {\n        \"name\":\"bob\",\n        \"salary\":13000,\n        \"friends\":[\n            {\n                \"name\": \"sarah\",\n                \"salary\":10000\n            },\n            {\n                \"name\": \"bill\",\n  " ⋯ 691 bytes ⋯ " \"friends\":[\n            {\n                \"name\": \"harry\",\n                \"salary\":10000\n            },\

julia> jsobj =  JSON3.read(jdata);
the function to read nested json files to dataframe
julia> function allflatnt(vnt)
           for i in eachindex(vnt)
               for k in keys(vnt[i])
                   v=vnt[i][k]
                   if v isa Vector{<:Dict}
                       return false
                   end
               end
           end
           true
       end
allflatnt (generic function with 1 method)

julia> function nestdf(ant)
           ANT=copy(ant)
           for i in eachindex(ANT)
               for (k,v) in ANT[i]
                   if v isa Vector{<:Dict}
                       if allflatnt(v)
                           ANT[i]=merge(ANT[i], Dict(k=>DataFrame(Tables.dictrowtable(v))))    
                       else
                           ANT[i]=merge(ANT[i], Dict(k=>nestdf(v)))
                       end
                   end
               end
           end
           DataFrame(Tables.dictrowtable(ANT))
       end
nestdf (generic function with 1 method)

julia> ndf=nestdf(jsobj)
3×3 DataFrame
 Row │ name    salary  friends       
     │ String  Int64   DataFrame
─────┼───────────────────────────────
   1 │ bob      13000  2×2 DataFrame
   2 │ marge    10000  2×3 DataFrame
   3 │ joe      10000  2×2 DataFrame

the expansion function

function expand(ndf, col)
    df2nt(df)=(;zip(Symbol.(names(df)),eachcol(df))...)
    rexp=filter(er-> er[col] isa DataFrame, ndf)
    foreach(subdf->rename!(n->string(col,'.')*n  ,subdf), rexp[:,col])
    rt=Tables.dictrowtable(vcat(df2nt.(rexp[:,col])...))
    edf=hcat(rexp[:,Not(col)],DataFrame(rt))
    iexp=findall.(>(1),map(r-> [er isa Vector ? length(er) : 1 for er in r],values.(eachrow(edf))))
    edfp=mapreduce(r->flatten(edf[r:r,:],iexp[r]), vcat,1: nrow(edf))
    nedfp=select(filter(er-> !isa(er[col] , DataFrame), ndf),Not(col))
    vcat(nedfp,edfp,cols=:union)
end

julia> endf=expand(ndf,:friends)
6×5 DataFrame
 Row │ name    salary  friends.name  friends.salary  friends.hobbies 
     │ String  Int64   String?       Int64?          Any
─────┼───────────────────────────────────────────────────────────────
   1 │ bob      13000  sarah                  10000  missing
   2 │ bob      13000  bill                    5000  missing
   3 │ marge    10000  rhonda                 10000  missing
   4 │ marge    10000  mike                    5000  2×2 DataFrame
   5 │ joe      10000  harry                  10000  missing
   6 │ joe      10000  sally                   5000  missing



julia> expand(endf, Symbol("friends.hobbies"))
7×6 DataFrame
 Row │ name    salary  friends.name  friends.salary  friends.hobbies.frequency  friends.hobbies.name         
     │ String  Int64   String?       Int64?          Int64?                     String?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────        
   1 │ bob      13000  sarah                  10000                    missing  missing
   2 │ bob      13000  bill                    5000                    missing  missing
   3 │ marge    10000  rhonda                 10000                    missing  missing
   4 │ joe      10000  harry                  10000                    missing  missing
   5 │ joe      10000  sally                   5000                    missing  missing
   6 │ marge    10000  mike                    5000                         10  surfing
   7 │ marge    10000  mike                    5000                         15  surfing
1 Like

I don’t want to veer too much into self-promotion, but since we didn’t have a working example previously, I did want to clarify what ExpandNestedData can do:

# same jdata string as rocco_sprmnt21
julia> using ExpandNesteData
julia> jsobj =  JSON3.read(jdata)
julia> ExpandNestedData.expand(jsobj) |> DataFrame
7×6 DataFrame
 Row │ friends_hobbies_name  friends_salary  name    salary  friends_hobbies_frequency  friends_name 
     │ String?               Int64           String  Int64   Union{Missing, Int64}      String       
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 │ missing                        10000  bob      13000                    missing  sarah        
   2 │ missing                         5000  bob      13000                    missing  bill
   3 │ missing                        10000  marge    10000                    missing  rhonda       
   4 │ surfing                         5000  marge    10000                         10  mike
   5 │ surfing                         5000  marge    10000                         15  mike
   6 │ missing                        10000  joe      10000                    missing  harry
   7 │ missing                         5000  joe      10000                    missing  sally

Or we can specify what paths we want:

julia> column_set = [
           ExpandNestedData.ColumnDefinition([:name]),
           ExpandNestedData.ColumnDefinition([:salary]),
       ]
       Main.ExpandNestedData.expand(jsobj,column_set) |> DataFrame
3×2 DataFrame
 Row │ name    salary 
     │ String  Int64  
─────┼────────────────
   1 │ bob      13000
   2 │ marge    10000
   3 │ joe      10000

Having done no profiling, it is almost 100% certain that @rocco_sprmnt21’s solution will outperform ExpandNestedData, but for adhoc scripts or things where the JSON isn’t too big, it’s much easier that writing a custom function each time. Plus it works for all sorts of nested things: structs, XMLDict, etc. So if you are unfamiliar with the object topology, it makes it very easy to get it into a table :slight_smile:

Edit: just realized that @rocco_sprmnt21 's generic too. I guess we’ll have to profile them now :stuck_out_tongue_winking_eye:

I couldn’t install the package so I decided to try to write something I’ve been thinking about for some time.
I tested it very little. I don’t know how general this is, but it should work for all “table” forms

Edit: this performs better by avoiding some allocations

function expand1(ndf, col)
    df2nt(df)=(;zip(Symbol.(names(df)),eachcol(df))...)
    rexp=findall(er-> er[col] isa DataFrame, eachrow(ndf))
    foreach(subdf->rename!(n->string(col,'.')*n  ,subdf), ndf[rexp,col])
    rt=Tables.dictrowtable(vcat(df2nt.(ndf[rexp,col])...))
    edf=hcat(ndf[rexp,Not(col)],DataFrame(rt))
    iexp=findall.(>(1),map(r-> [er isa Vector ? length(er) : 1 for er in r],values.(eachrow(edf))))
    edfp=mapreduce(r->flatten(edf[r:r,:],iexp[r]), vcat,1: nrow(edf))
    nedfp=select(filter(er-> !isa(er[col] , DataFrame), ndf),Not(col))
    vcat(nedfp,edfp,cols=:union)
end

this should avoid mutations to the original dataframe (sub-dataframe names)

function expand2(ndf, col)
    df2nt(sdf)=(;zip(Symbol.(string.(col,'.',names(sdf))),eachcol(sdf))...)
    rexp=findall(er-> er[col] isa DataFrame, eachrow(ndf))
    rt=Tables.dictrowtable(vcat(df2nt.(ndf[rexp,col])...))
    edf=hcat(ndf[rexp,Not(col)],DataFrame(rt))
    iexp=findall.(>(1),map(r-> [er isa Vector ? length(er) : 1 for er in r],values.(eachrow(edf))))
    edfp=mapreduce(r->flatten(edf[r:r,:],iexp[r]), vcat,1: nrow(edf))
    nedfp=select(filter(er-> !isa(er[col] , DataFrame), ndf),Not(col))
    vcat(nedfp,edfp,cols=:union)
end

nilshg,

I’m refering to YFinance, which is a relatively new package

function's methods to read json file in nested DataFrame

 function allflatnt(vnt::Vector{Dict{Symbol, Any}})
           for i in eachindex(vnt)
               for k in keys(vnt[i])
                   v=vnt[i][k]
                   if v isa Vector{<:Dict} || v isa Dict
                       return false
                   end
               end
           end
           true
       end


       function allflatnt(d::Dict{Symbol, Any})
            for k in keys(d)
                if d[k] isa Vector{<:Dict} || d[k] isa Dict
                    return false
                end
            end

        true
    end




function nestdf(vd::Vector{Dict{Symbol, Any}})
 for i in eachindex(vd)
     for (k,v) in vd[i]
         if v isa Vector{<:Dict} || v isa Dict
             if allflatnt(v)
                tdf=DataFrame(Tables.dictrowtable(v))
                if names(tdf)==["first","second"]
                    rename!(tdf, [:first,:second].=>[:key,:value])
                end
                 vd[i]=merge(vd[i], Dict(k=>tdf))
             else
                 vd[i]=merge(vd[i], Dict(k=>nestdf(v)))
             end
         end
     end
 end
 DataFrame(Tables.dictrowtable(vd))
end


function nestdf(d::Dict{Symbol, Any})
    for (k,v) in d
        if v isa Vector{<:Dict} || v isa Dict
            if allflatnt(v)
                tdf=DataFrame(Tables.dictrowtable(v))
                if names(tdf)==["first","second"]
                    rename!(tdf, [:first,:second].=>[:key,:value])
                end
                d=merge(d, Dict(k=>tdf))
            else
                d=merge(d, Dict(k=>nestdf(v)))
            end
        end
    end
    rename!(DataFrame(Tables.dictrowtable(d)), [:first,:second].=>[:key,:value])
end




nestdf(jsobj::JSON3.Object)=nestdf(copy(jsobj))
nestdf(jsobj::JSON3.Array)=nestdf(copy(jsobj))


I adapted the previous scripts to be able to handle nested structures of JSON.Array and JSON.Object.
I only did some tests on the data corresponding to the “AAPL” ticker

using YFinance, JSON3, DataFrames

aapl_json=get_quoteSummary("AAPL")

Below are some possible “views” with the structure obtained by applying the nestdf() function

julia> ndf=nestdf(aapl_json)
31×2 DataFrame
 Row │ key                                value
     │ Symbol                             DataFrame      
─────┼───────────────────────────────────────────────────
   1 │ cashflowStatementHistoryQuarterly  2×2 DataFrame  
   2 │ industryTrend                      3×2 DataFrame  
   3 │ earningsTrend                      2×2 DataFrame  
   4 │ incomeStatementHistory             2×2 DataFrame  
   5 │ price                              31×2 DataFrame 
   6 │ upgradeDowngradeHistory            2×2 DataFrame  
   7 │ institutionOwnership               2×2 DataFrame  
   8 │ fundOwnership                      2×2 DataFrame
   9 │ summaryDetail                      42×2 DataFrame
  10 │ netSharePurchaseActivity           12×2 DataFrame
  11 │ earnings                           4×2 DataFrame
  12 │ insiderHolders                     2×2 DataFrame
  13 │ calendarEvents                     4×2 DataFrame
  14 │ summaryProfile                     13×2 DataFrame
  15 │ majorDirectHolders                 2×2 DataFrame
  16 │ assetProfile                       20×2 DataFrame
  17 │ recommendationTrend                2×2 DataFrame
  18 │ balanceSheetHistory                2×2 DataFrame
  19 │ financialData                      30×2 DataFrame
  20 │ indexTrend                         5×2 DataFrame
  21 │ balanceSheetHistoryQuarterly       2×2 DataFrame
  22 │ insiderTransactions                2×2 DataFrame
  23 │ cashflowStatementHistory           2×2 DataFrame
  24 │ earningsHistory                    2×2 DataFrame
  25 │ quoteType                          13×2 DataFrame
  26 │ sectorTrend                        3×2 DataFrame
  27 │ esgScores                          36×2 DataFrame
  28 │ secFilings                         2×2 DataFrame
  29 │ defaultKeyStatistics               39×2 DataFrame
  30 │ majorHoldersBreakdown              5×2 DataFrame
  31 │ incomeStatementHistoryQuarterly    2×2 DataFrame

julia> ndf.value[16]
20×2 DataFrame
 Row │ key                        value
     │ Symbol                     Any
─────┼──────────────────────────────────────────────────────────────
   1 │ auditRisk                  4
   2 │ country                    United States
   3 │ overallRisk                1
   4 │ state                      CA
   5 │ phone                      408 996 1010
   6 │ longBusinessSummary        Apple Inc. designs, manufactures…
   7 │ city                       Cupertino
   8 │ zip                        95014
   9 │ industry                   Consumer Electronics
  10 │ address1                   One Apple Park Way
  11 │ shareHolderRightsRisk      1
  12 │ governanceEpochDate        1680307200
  13 │ maxAge                     86400
  14 │ compensationRisk           5
  15 │ compensationAsOfEpochDate  1672444800
  16 │ boardRisk                  1
  17 │ sector                     Technology
  18 │ fullTimeEmployees          164000
  19 │ website                    https://www.apple.com
  20 │ companyOfficers            10×9 DataFrame

julia> ndf.value[16].value[20]
10×9 DataFrame
 Row │ fiscalYear  unexercisedValue  age      name                     title               ⋯
     │ Int64?      DataFrame         Int64?   String                   String              ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 │       2022  3×2 DataFrame          61  Mr. Timothy D. Cook      CEO & Director      ⋯
   2 │       2022  3×2 DataFrame          59  Mr. Luca  Maestri        CFO & Sr. VP
   3 │       2022  3×2 DataFrame          58  Mr. Jeffrey E. Williams  Chief Operating Off  
   4 │       2022  3×2 DataFrame          58  Ms. Katherine L. Adams   Sr. VP, Gen. Counse  
   5 │       2022  3×2 DataFrame          55  Ms. Deirdre  O'Brien     Sr. VP of Retail    ⋯
   6 │    missing  3×2 DataFrame     missing  Mr. Chris  Kondo         Sr. Director of Cor  
   7 │    missing  3×2 DataFrame     missing  Mr. James  Wilson        Chief Technology Of  
   8 │    missing  3×2 DataFrame     missing  Ms. Mary  Demby          Chief Information O  
   9 │    missing  3×2 DataFrame     missing  Ms. Nancy  Paxton        Sr. Director of Inv ⋯
  10 │    missing  3×2 DataFrame     missing  Mr. Greg  Joswiak        Sr. VP of Worldwide  
                                                                           5 columns omitted


julia> ndf.value[16].value[20].totalPay[1:3]
3-element Vector{Union{Missing, DataFrame}}:
 3×2 DataFrame
 Row │ key      value      
     │ Symbol   Union…     
─────┼─────────────────────
   1 │ fmt      16.43M
   2 │ longFmt  16,425,933
   3 │ raw      16425933
 3×2 DataFrame
 Row │ key      value     
     │ Symbol   Union…    
─────┼────────────────────
   1 │ fmt      5.02M
   2 │ longFmt  5,019,783
   3 │ raw      5019783
 3×2 DataFrame
 Row │ key      value     
     │ Symbol   Union…    
─────┼────────────────────
   1 │ fmt      5.02M
   2 │ longFmt  5,018,337
   3 │ raw      5018337



some refactoring
allflatnt(vd::Vector{Dict{Symbol, Any}})= all(allflatnt, vd)


allflatnt(d::Dict{Symbol, Any})=all(e->!(last(e) isa Vector{<:Dict} || last(e) isa Dict),d)


function nestdf(vd::Vector{Dict{Symbol, Any}})
  DataFrame(Tables.dictrowtable(nestdict.(vd)))
end

function nestdf(d::Dict{Symbol, Any})
    tdf=DataFrame(Tables.dictrowtable(nestdict(d)))
    if names(tdf)==["first","second"]
        rename!(tdf, [:first,:second].=>[:key,:value])
    end
    #tdf
end

function nestdict(d::Dict{Symbol, Any})
    for (k,v) in d
       if v isa Vector{<:Dict} || v isa Dict
            if allflatnt(v)
                tdf=DataFrame(Tables.dictrowtable(v))
                if names(tdf)==["first","second"]
                    rename!(tdf, [:first,:second].=>[:key,:value])
                end
                d=merge(d, Dict(k=>tdf))
            else
                d=merge(d, Dict(k=>nestdf(v)))
            end
        end
    end
    d 
end



nestdf(jsobj::JSON3.Object)=nestdf(copy(jsobj))
nestdf(jsobj::JSON3.Array)=nestdf(copy(jsobj))



using YFinance, JSON3, DataFrames

aapl_json=get_quoteSummary("AAPL")

ndf=nestdf(aapl_json)
1 Like