Flattening YFinance.jl JSON result into a DataFrame

I have a DataFrame similar to this one:

df4 = DataFrame(a=[Dict([:aa=>1, :bb=>2]), Dict([:aa=>3, :bb=>4])], b=["one", "two"])

… and I want the Dictionaries to be flattened wide, that is, I want that the DataFramehas two new columns (:aa and :bb) and eventually I want to remove column :a.

I am lost. I have checked that the flatten function helps making the DataFrame longer (that is, higher), but I want it wider.

I suppose that I can use the map function, along with a for loop to solve the problem.

But I am interested in a more elegant solution.

Any ideas?

Just to clarify: is it correct that you want the following table as the output?

aa | bb |   b
---+----+-------
 1 |  2 | "one"
 2 |  3 | "two"

Not exactly, the desired format is …

aa | bb |   b
---+----+-------
 1 |  2 | "one"
 3 |  4 | "two"

I am sorry. Yes, that’s what I meant.

I am not sure if this is the most elegant way, but you could combine the dictionaries first, then add b the combined dictionary, and then use the final dictionary to initialize the dataframe.

julia> d1 = Dict([:aa=>1, :bb=>2])
Dict{Symbol, Int64} with 2 entries:
  :aa => 1
  :bb => 2

julia> d2 = Dict([:aa=>3, :bb=>4])
Dict{Symbol, Int64} with 2 entries:
  :aa => 3
  :bb => 4

julia> dicts = [d1, d2]
2-element Vector{Dict{Symbol, Int64}}:
 Dict(:aa => 1, :bb => 2)
 Dict(:aa => 3, :bb => 4)

# The type `Dict{Symbol, Any}` is required, 
# otherwise you won't be able to insert the column `b` with string values
julia> merged = Dict{Symbol, Any}(k => getindex.([d1, d2], k) for k in keys(d1))
Dict{Symbol, Any} with 2 entries:
  :aa => [1, 3]
  :bb => [2, 4]

julia> merged[:b] = ["one", "two"]
2-element Vector{String}:
 "one"
 "two"

julia> DataFrame(merged)
2Γ—3 DataFrame
 Row β”‚ aa     b       bb
     β”‚ Int64  String  Int64
─────┼──────────────────────
   1 β”‚     1  one         2
   2 β”‚     3  two         4

A slightly shorter version:

julia> [DataFrame(df4.a) df4[:, [:b]]]
2Γ—3 DataFrame
 Row β”‚ aa     bb     b
     β”‚ Int64  Int64  String
─────┼──────────────────────
   1 β”‚     1      2  one
   2 β”‚     3      4  two

DataFrame already has a vector-of-dicts constructor, so I’m just using that and then hcat on the other column.

4 Likes

barucden,

Your solution seems to work, but it is a little more complex of what I expected

Thank you

nilshg,

your solution is simple and elegant, thank you

my actual DataFrame comes from a JSON file. So, in its nature, it is recursive, that is, their columns may contain Dictionaries that contain other Dictionaries inside them.

Can your solution be applied recursively, that is, without konwing in advance the number of levels of dictionaries it contains?

Not in general, no. If you want to read JSON data into DataFrames maybe use the JSONTables package which has this as its intended use case?

2 Likes

Are you sure you’ll always be able to coerce the data into a tabular format? If so, the JSONTables.jl suggestion should be fine. Otherwise, you might consider just using JSON.jl or JSON3.jl to read & manipulate the data. If the structure of the data is always the same, and cannot be nicely represented in tabular format, you might also consider reading it into a custom struct. Sometimes, this makes it easier to write functions that do the subsequent transformations/manipulations/etc.

1 Like
julia> transform(df4,:a=>AsTable)
2Γ—4 DataFrame
 Row β”‚ a                     b       aa     bb    
     β”‚ Dict…                 String  Int64  Int64
─────┼────────────────────────────────────────────
   1 β”‚ Dict(:aa=>1, :bb=>2)  one         1      2
   2 β”‚ Dict(:aa=>3, :bb=>4)  two         3      4
2 Likes

The data is stock market data, from Yahoo.

I’m using the YFinance package.

Data isread in JSON format. Sometimes, it is easy to convert it to DataFrame, some other times it is not.

Since data comes from Yahho, theoretically it always follows the same format.

However, sometimes the available information may be incomple. That is why it cannot be assured that the structure of the data will always be the same. I’ll depent on what’s available for each ticker.

The YFinance.jl docs have lots of examples of fetching results and converting to DataFrame.

https://eohne.github.io/YFinance.jl/dev/

Also, thanks for informing me about this package. I only knew of a much older and unmaintained package that did this. Glad someone updated it.

3 Likes

If the JSON struct is nested, you can use

mrufsvold/ExpandNestedData.jl (github.com)

I haven’t registered it yet because I’m still honing a couple pieces of it, but it should be good to go for most applications.

1 Like

Are you referring to the really old Yahoo Finance package or MarketData?

Another recent alternative is

1 Like

Yea that one, although it looks like it has seen some recent activity as well.

YFinance has lots of information, including free fundamental data coming from Yahoo.
MarketData has only historical prices. Therefore, it might be a good tool to use.

2 Likes

Sorry this is a bit off topic.
YFinance guy here. Which data are you trying to get into what format? Could be a general improvement to the package…

If you want to do some tests, I submit a script that recursively reads a json file that contains multi-level nested tables.
The result is a dataframe containing in some cells a dataframe as value.
I add a function for selective unpacking of the columns of interest.
PS. I didn’t test on files other than the attached one that I found on a web page.

julia> using JSON3

julia> using DataFrames

the json string
julia> jdata = """[
           {
               "name":"bob",
               "salary":13000,
               "friends":[
                   {
                       "name": "sarah",
                       "salary":10000
                   },
                   {
                       "name": "bill",
                       "salary":5000
                   }
               ]
           },
           {
               "name":"marge",
               "salary":10000,
               "friends":[
                   {
                       "name": "rhonda",
                       "salary":10000
                   },
                   {
                       "name": "mike",
                       "salary":5000,
                       "hobbies":[
                           {
                               "name":"surfing",
                               "frequency":10
                           },
                           {
                               "name":"surfing",
                               "frequency":15
                           }
                       ]
                   }
               ]
           },
           {
               "name":"joe",
               "salary":10000,
               "friends":[
                   {
                       "name": "harry",
                       "salary":10000
                   },
                   {
                       "name": "sally",
                       "salary":5000
                   }
               ]
           }
       ]"""
"[\n    {\n        \"name\":\"bob\",\n        \"salary\":13000,\n        \"friends\":[\n            {\n                \"name\": \"sarah\",\n                \"salary\":10000\n            },\n            {\n                \"name\": \"bill\",\n  " β‹― 691 bytes β‹― " \"friends\":[\n            {\n                \"name\": \"harry\",\n                \"salary\":10000\n            },\

julia> jsobj =  JSON3.read(jdata);
the function to read nested json files to dataframe
julia> function allflatnt(vnt)
           for i in eachindex(vnt)
               for k in keys(vnt[i])
                   v=vnt[i][k]
                   if v isa Vector{<:Dict}
                       return false
                   end
               end
           end
           true
       end
allflatnt (generic function with 1 method)

julia> function nestdf(ant)
           ANT=copy(ant)
           for i in eachindex(ANT)
               for (k,v) in ANT[i]
                   if v isa Vector{<:Dict}
                       if allflatnt(v)
                           ANT[i]=merge(ANT[i], Dict(k=>DataFrame(Tables.dictrowtable(v))))    
                       else
                           ANT[i]=merge(ANT[i], Dict(k=>nestdf(v)))
                       end
                   end
               end
           end
           DataFrame(Tables.dictrowtable(ANT))
       end
nestdf (generic function with 1 method)

julia> ndf=nestdf(jsobj)
3Γ—3 DataFrame
 Row β”‚ name    salary  friends       
     β”‚ String  Int64   DataFrame
─────┼───────────────────────────────
   1 β”‚ bob      13000  2Γ—2 DataFrame
   2 β”‚ marge    10000  2Γ—3 DataFrame
   3 β”‚ joe      10000  2Γ—2 DataFrame

the expansion function

function expand(ndf, col)
    df2nt(df)=(;zip(Symbol.(names(df)),eachcol(df))...)
    rexp=filter(er-> er[col] isa DataFrame, ndf)
    foreach(subdf->rename!(n->string(col,'.')*n  ,subdf), rexp[:,col])
    rt=Tables.dictrowtable(vcat(df2nt.(rexp[:,col])...))
    edf=hcat(rexp[:,Not(col)],DataFrame(rt))
    iexp=findall.(>(1),map(r-> [er isa Vector ? length(er) : 1 for er in r],values.(eachrow(edf))))
    edfp=mapreduce(r->flatten(edf[r:r,:],iexp[r]), vcat,1: nrow(edf))
    nedfp=select(filter(er-> !isa(er[col] , DataFrame), ndf),Not(col))
    vcat(nedfp,edfp,cols=:union)
end

julia> endf=expand(ndf,:friends)
6Γ—5 DataFrame
 Row β”‚ name    salary  friends.name  friends.salary  friends.hobbies 
     β”‚ String  Int64   String?       Int64?          Any
─────┼───────────────────────────────────────────────────────────────
   1 β”‚ bob      13000  sarah                  10000  missing
   2 β”‚ bob      13000  bill                    5000  missing
   3 β”‚ marge    10000  rhonda                 10000  missing
   4 β”‚ marge    10000  mike                    5000  2Γ—2 DataFrame
   5 β”‚ joe      10000  harry                  10000  missing
   6 β”‚ joe      10000  sally                   5000  missing



julia> expand(endf, Symbol("friends.hobbies"))
7Γ—6 DataFrame
 Row β”‚ name    salary  friends.name  friends.salary  friends.hobbies.frequency  friends.hobbies.name         
     β”‚ String  Int64   String?       Int64?          Int64?                     String?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────        
   1 β”‚ bob      13000  sarah                  10000                    missing  missing
   2 β”‚ bob      13000  bill                    5000                    missing  missing
   3 β”‚ marge    10000  rhonda                 10000                    missing  missing
   4 β”‚ joe      10000  harry                  10000                    missing  missing
   5 β”‚ joe      10000  sally                   5000                    missing  missing
   6 β”‚ marge    10000  mike                    5000                         10  surfing
   7 β”‚ marge    10000  mike                    5000                         15  surfing
1 Like

I don’t want to veer too much into self-promotion, but since we didn’t have a working example previously, I did want to clarify what ExpandNestedData can do:

# same jdata string as rocco_sprmnt21
julia> using ExpandNesteData
julia> jsobj =  JSON3.read(jdata)
julia> ExpandNestedData.expand(jsobj) |> DataFrame
7Γ—6 DataFrame
 Row β”‚ friends_hobbies_name  friends_salary  name    salary  friends_hobbies_frequency  friends_name 
     β”‚ String?               Int64           String  Int64   Union{Missing, Int64}      String       
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚ missing                        10000  bob      13000                    missing  sarah        
   2 β”‚ missing                         5000  bob      13000                    missing  bill
   3 β”‚ missing                        10000  marge    10000                    missing  rhonda       
   4 β”‚ surfing                         5000  marge    10000                         10  mike
   5 β”‚ surfing                         5000  marge    10000                         15  mike
   6 β”‚ missing                        10000  joe      10000                    missing  harry
   7 β”‚ missing                         5000  joe      10000                    missing  sally

Or we can specify what paths we want:

julia> column_set = [
           ExpandNestedData.ColumnDefinition([:name]),
           ExpandNestedData.ColumnDefinition([:salary]),
       ]
       Main.ExpandNestedData.expand(jsobj,column_set) |> DataFrame
3Γ—2 DataFrame
 Row β”‚ name    salary 
     β”‚ String  Int64  
─────┼────────────────
   1 β”‚ bob      13000
   2 β”‚ marge    10000
   3 β”‚ joe      10000

Having done no profiling, it is almost 100% certain that @rocco_sprmnt21’s solution will outperform ExpandNestedData, but for adhoc scripts or things where the JSON isn’t too big, it’s much easier that writing a custom function each time. Plus it works for all sorts of nested things: structs, XMLDict, etc. So if you are unfamiliar with the object topology, it makes it very easy to get it into a table :slight_smile:

Edit: just realized that @rocco_sprmnt21 's generic too. I guess we’ll have to profile them now :stuck_out_tongue_winking_eye: