Flattening YFinance.jl JSON result into a DataFrame

vsoler · March 20, 2023, 9:06am

I have a DataFrame similar to this one:

df4 = DataFrame(a=[Dict([:aa=>1, :bb=>2]), Dict([:aa=>3, :bb=>4])], b=["one", "two"])

… and I want the Dictionaries to be flattened wide, that is, I want that the DataFramehas two new columns (:aa and :bb) and eventually I want to remove column :a.

I am lost. I have checked that the flatten function helps making the DataFrame longer (that is, higher), but I want it wider.

I suppose that I can use the map function, along with a for loop to solve the problem.

But I am interested in a more elegant solution.

Any ideas?

barucden · March 20, 2023, 9:16am

Just to clarify: is it correct that you want the following table as the output?

aa | bb |   b
---+----+-------
 1 |  2 | "one"
 2 |  3 | "two"

vsoler · March 20, 2023, 9:26am

Not exactly, the desired format is …

aa | bb |   b
---+----+-------
 1 |  2 | "one"
 3 |  4 | "two"

barucden · March 20, 2023, 9:31am

I am sorry. Yes, that’s what I meant.

I am not sure if this is the most elegant way, but you could combine the dictionaries first, then add b the combined dictionary, and then use the final dictionary to initialize the dataframe.

julia> d1 = Dict([:aa=>1, :bb=>2])
Dict{Symbol, Int64} with 2 entries:
  :aa => 1
  :bb => 2

julia> d2 = Dict([:aa=>3, :bb=>4])
Dict{Symbol, Int64} with 2 entries:
  :aa => 3
  :bb => 4

julia> dicts = [d1, d2]
2-element Vector{Dict{Symbol, Int64}}:
 Dict(:aa => 1, :bb => 2)
 Dict(:aa => 3, :bb => 4)

# The type `Dict{Symbol, Any}` is required, 
# otherwise you won't be able to insert the column `b` with string values
julia> merged = Dict{Symbol, Any}(k => getindex.([d1, d2], k) for k in keys(d1))
Dict{Symbol, Any} with 2 entries:
  :aa => [1, 3]
  :bb => [2, 4]

julia> merged[:b] = ["one", "two"]
2-element Vector{String}:
 "one"
 "two"

julia> DataFrame(merged)
2×3 DataFrame
 Row │ aa     b       bb
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  one         2
   2 │     3  two         4

nilshg · March 20, 2023, 9:50am

A slightly shorter version:

julia> [DataFrame(df4.a) df4[:, [:b]]]
2×3 DataFrame
 Row │ aa     bb     b
     │ Int64  Int64  String
─────┼──────────────────────
   1 │     1      2  one
   2 │     3      4  two

DataFrame already has a vector-of-dicts constructor, so I’m just using that and then hcat on the other column.

vsoler · March 20, 2023, 10:12am

barucden,

Your solution seems to work, but it is a little more complex of what I expected

Thank you

vsoler · March 20, 2023, 10:15am

nilshg,

your solution is simple and elegant, thank you

vsoler · March 20, 2023, 10:35am

my actual DataFrame comes from a JSON file. So, in its nature, it is recursive, that is, their columns may contain Dictionaries that contain other Dictionaries inside them.

Can your solution be applied recursively, that is, without konwing in advance the number of levels of dictionaries it contains?

nilshg · March 20, 2023, 11:03am

Not in general, no. If you want to read JSON data into DataFrames maybe use the JSONTables package which has this as its intended use case?

mthelm85 · March 20, 2023, 12:47pm

Are you sure you’ll always be able to coerce the data into a tabular format? If so, the JSONTables.jl suggestion should be fine. Otherwise, you might consider just using JSON.jl or JSON3.jl to read & manipulate the data. If the structure of the data is always the same, and cannot be nicely represented in tabular format, you might also consider reading it into a custom struct. Sometimes, this makes it easier to write functions that do the subsequent transformations/manipulations/etc.

rocco_sprmnt21 · March 20, 2023, 2:14pm

julia> transform(df4,:a=>AsTable)
2×4 DataFrame
 Row │ a                     b       aa     bb    
     │ Dict…                 String  Int64  Int64
─────┼────────────────────────────────────────────
   1 │ Dict(:aa=>1, :bb=>2)  one         1      2
   2 │ Dict(:aa=>3, :bb=>4)  two         3      4

vsoler · March 20, 2023, 3:27pm

The data is stock market data, from Yahoo.

I’m using the YFinance package.

Data isread in JSON format. Sometimes, it is easy to convert it to DataFrame, some other times it is not.

Since data comes from Yahho, theoretically it always follows the same format.

However, sometimes the available information may be incomple. That is why it cannot be assured that the structure of the data will always be the same. I’ll depent on what’s available for each ticker.

tbeason · March 20, 2023, 3:30pm

The YFinance.jl docs have lots of examples of fetching results and converting to DataFrame.

https://eohne.github.io/YFinance.jl/dev/

Also, thanks for informing me about this package. I only knew of a much older and unmaintained package that did this. Glad someone updated it.

mrufsvold · March 20, 2023, 3:41pm

If the JSON struct is nested, you can use

mrufsvold/ExpandNestedData.jl (github.com)

I haven’t registered it yet because I’m still honing a couple pieces of it, but it should be good to go for most applications.

nilshg · March 20, 2023, 3:44pm

Are you referring to the really old Yahoo Finance package or MarketData?

Another recent alternative is

tbeason · March 20, 2023, 3:47pm

Yea that one, although it looks like it has seen some recent activity as well.

vsoler · March 20, 2023, 9:14pm

YFinance has lots of information, including free fundamental data coming from Yahoo.
MarketData has only historical prices. Therefore, it might be a good tool to use.

EOhneberg · March 21, 2023, 11:46pm

Sorry this is a bit off topic.
YFinance guy here. Which data are you trying to get into what format? Could be a general improvement to the package…

rocco_sprmnt21 · March 25, 2023, 11:19am

If you want to do some tests, I submit a script that recursively reads a json file that contains multi-level nested tables.
The result is a dataframe containing in some cells a dataframe as value.
I add a function for selective unpacking of the columns of interest.
PS. I didn’t test on files other than the attached one that I found on a web page.

julia> using JSON3

julia> using DataFrames

the json string

julia> jdata = """[
           {
               "name":"bob",
               "salary":13000,
               "friends":[
                   {
                       "name": "sarah",
                       "salary":10000
                   },
                   {
                       "name": "bill",
                       "salary":5000
                   }
               ]
           },
           {
               "name":"marge",
               "salary":10000,
               "friends":[
                   {
                       "name": "rhonda",
                       "salary":10000
                   },
                   {
                       "name": "mike",
                       "salary":5000,
                       "hobbies":[
                           {
                               "name":"surfing",
                               "frequency":10
                           },
                           {
                               "name":"surfing",
                               "frequency":15
                           }
                       ]
                   }
               ]
           },
           {
               "name":"joe",
               "salary":10000,
               "friends":[
                   {
                       "name": "harry",
                       "salary":10000
                   },
                   {
                       "name": "sally",
                       "salary":5000
                   }
               ]
           }
       ]"""
"[\n    {\n        \"name\":\"bob\",\n        \"salary\":13000,\n        \"friends\":[\n            {\n                \"name\": \"sarah\",\n                \"salary\":10000\n            },\n            {\n                \"name\": \"bill\",\n  " ⋯ 691 bytes ⋯ " \"friends\":[\n            {\n                \"name\": \"harry\",\n                \"salary\":10000\n            },\

julia> jsobj =  JSON3.read(jdata);

the function to read nested json files to dataframe

julia> function allflatnt(vnt)
           for i in eachindex(vnt)
               for k in keys(vnt[i])
                   v=vnt[i][k]
                   if v isa Vector{<:Dict}
                       return false
                   end
               end
           end
           true
       end
allflatnt (generic function with 1 method)

julia> function nestdf(ant)
           ANT=copy(ant)
           for i in eachindex(ANT)
               for (k,v) in ANT[i]
                   if v isa Vector{<:Dict}
                       if allflatnt(v)
                           ANT[i]=merge(ANT[i], Dict(k=>DataFrame(Tables.dictrowtable(v))))    
                       else
                           ANT[i]=merge(ANT[i], Dict(k=>nestdf(v)))
                       end
                   end
               end
           end
           DataFrame(Tables.dictrowtable(ANT))
       end
nestdf (generic function with 1 method)


julia> ndf=nestdf(jsobj)
3×3 DataFrame
 Row │ name    salary  friends       
     │ String  Int64   DataFrame
─────┼───────────────────────────────
   1 │ bob      13000  2×2 DataFrame
   2 │ marge    10000  2×3 DataFrame
   3 │ joe      10000  2×2 DataFrame

the expansion function


function expand(ndf, col)
    df2nt(df)=(;zip(Symbol.(names(df)),eachcol(df))...)
    rexp=filter(er-> er[col] isa DataFrame, ndf)
    foreach(subdf->rename!(n->string(col,'.')*n  ,subdf), rexp[:,col])
    rt=Tables.dictrowtable(vcat(df2nt.(rexp[:,col])...))
    edf=hcat(rexp[:,Not(col)],DataFrame(rt))
    iexp=findall.(>(1),map(r-> [er isa Vector ? length(er) : 1 for er in r],values.(eachrow(edf))))
    edfp=mapreduce(r->flatten(edf[r:r,:],iexp[r]), vcat,1: nrow(edf))
    nedfp=select(filter(er-> !isa(er[col] , DataFrame), ndf),Not(col))
    vcat(nedfp,edfp,cols=:union)
end

julia> endf=expand(ndf,:friends)
6×5 DataFrame
 Row │ name    salary  friends.name  friends.salary  friends.hobbies 
     │ String  Int64   String?       Int64?          Any
─────┼───────────────────────────────────────────────────────────────
   1 │ bob      13000  sarah                  10000  missing
   2 │ bob      13000  bill                    5000  missing
   3 │ marge    10000  rhonda                 10000  missing
   4 │ marge    10000  mike                    5000  2×2 DataFrame
   5 │ joe      10000  harry                  10000  missing
   6 │ joe      10000  sally                   5000  missing



julia> expand(endf, Symbol("friends.hobbies"))
7×6 DataFrame
 Row │ name    salary  friends.name  friends.salary  friends.hobbies.frequency  friends.hobbies.name         
     │ String  Int64   String?       Int64?          Int64?                     String?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────        
   1 │ bob      13000  sarah                  10000                    missing  missing
   2 │ bob      13000  bill                    5000                    missing  missing
   3 │ marge    10000  rhonda                 10000                    missing  missing
   4 │ joe      10000  harry                  10000                    missing  missing
   5 │ joe      10000  sally                   5000                    missing  missing
   6 │ marge    10000  mike                    5000                         10  surfing
   7 │ marge    10000  mike                    5000                         15  surfing

mrufsvold · March 26, 2023, 12:10am

I don’t want to veer too much into self-promotion, but since we didn’t have a working example previously, I did want to clarify what ExpandNestedData can do:

# same jdata string as rocco_sprmnt21
julia> using ExpandNesteData
julia> jsobj =  JSON3.read(jdata)
julia> ExpandNestedData.expand(jsobj) |> DataFrame
7×6 DataFrame
 Row │ friends_hobbies_name  friends_salary  name    salary  friends_hobbies_frequency  friends_name 
     │ String?               Int64           String  Int64   Union{Missing, Int64}      String       
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 │ missing                        10000  bob      13000                    missing  sarah        
   2 │ missing                         5000  bob      13000                    missing  bill
   3 │ missing                        10000  marge    10000                    missing  rhonda       
   4 │ surfing                         5000  marge    10000                         10  mike
   5 │ surfing                         5000  marge    10000                         15  mike
   6 │ missing                        10000  joe      10000                    missing  harry
   7 │ missing                         5000  joe      10000                    missing  sally

Or we can specify what paths we want:

julia> column_set = [
           ExpandNestedData.ColumnDefinition([:name]),
           ExpandNestedData.ColumnDefinition([:salary]),
       ]
       Main.ExpandNestedData.expand(jsobj,column_set) |> DataFrame
3×2 DataFrame
 Row │ name    salary 
     │ String  Int64  
─────┼────────────────
   1 │ bob      13000
   2 │ marge    10000
   3 │ joe      10000

Having done no profiling, it is almost 100% certain that @rocco_sprmnt21’s solution will outperform ExpandNestedData, but for adhoc scripts or things where the JSON isn’t too big, it’s much easier that writing a custom function each time. Plus it works for all sorts of nested things: structs, XMLDict, etc. So if you are unfamiliar with the object topology, it makes it very easy to get it into a table

Edit: just realized that @rocco_sprmnt21 's generic too. I guess we’ll have to profile them now

Topic		Replies	Views
Flatten dicts of dicts in DataFrame General Usage dataframes	16	236	February 19, 2025
Dash datatable using a nested dict anyone done it? New to Julia dataframes , dash	7	1306	June 15, 2022
[Pre-ANN/RFC] ExpandNestedData.jl (Previously Normalize.jl) Package Announcements	21	1483	December 7, 2022
Efficiently Read JSON and Create DataFrame Performance json , dataframes	23	7769	April 3, 2025
How to read Panda's DataFrames from json file? New to Julia dataframes	27	978	February 1, 2023

Flattening YFinance.jl JSON result into a DataFrame

Related topics