About DataFrame(array_of_dict)

using  AlphaVantage, DataFrames
apikey_AV = AlphaVantage.global_key!("5J16OULI0KY026M4");
tickerList = ["AAPL","MSFT"];
StockHistoryRawJson = AlphaVantage.time_series_daily_adjusted.(Stocks.tickerList, outputsize="compact", datatype="json")


DataFrame([StockHistoryRawJson[i]["Meta Data"] for i in 1:2]) # does not give the expected result
names(DataFrame([StockHistoryRawJson[i]["Meta Data"] for i in 1:2])) #extra column appeared

a way to get around the “obstacle”

vcat(DataFrame.([StockHistoryRawJson[i]["Meta Data"] for i in 1:length(tickerList)])...)

Instead, by transforming the dictionary keys into symbols , the expected dataframe is obtained.

dsym=[Dict([Symbol(k)=>v for (k,v) in StockHistoryRawJson[i]["Meta Data"]]) for i in 1:length(tickerList)]
DataFrame(dsym)

Can anyone tell me what is going on behind the scenes?
Why if you use {String, Any}, DataFrame (…) pairs “doesn’t work” and where do the extra columns come from?

The reason is that you are passing a Vector of “something” to a DataFrame constructor.

As explained in the docstring of DataFrame (I am quoting only the relevant part):

If a single positional argument is passed to a DataFrame constructor then it is assumed to be of type that implements the Tables.jl (GitHub - JuliaData/Tables.jl: An interface for tables in Julia) interface using which the returned DataFrame is materialized.

So you get exactly the same as what you would get from Tables.jl directly:

julia> Tables.columns([StockHistoryRawJson[i]["Meta Data"] for i in 1:2])
Tables.CopiedColumns{NamedTuple{(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe), Tuple{Vector{Vector{UInt8}}, Vector{Vector{String}}, Vector{Vector{Any}}, Vector{Int64}, Vector{Int64}, Vector{UInt64}, Vector{Int64}, Vector{Int64}}}} with 2 rows, 8 columns, and schema:
 :slots     Vector{UInt8} (alias for Array{UInt8, 1})
 :keys      Vector{String} (alias for Array{String, 1})
 :vals      Vector{Any} (alias for Array{Any, 1})
 :ndel      Int64
 :count     Int64
 :age       UInt64
 :idxfloor  Int64
 :maxprobe  Int64

And the rule in Tables.jl is that if you pass it a vector of “something” then fields of this “something” are interpreted as columns of a table.

One of special exceptions are dictionaries with Symbols as keys, which get handled as you have observed. This special treatment is defined here: https://github.com/JuliaData/Tables.jl/blob/1f2395a68e02906134f7ffb7020944894c85a91c/src/Tables.jl#L138.

If you would like to have AbstractString be handled in the same way as Symbol please consider opening an issue in Tables.jl.

1 Like

Thank you very much for the references you have given me: I have so much to “study” :grinning:.
In the meantime I did some (almost) random experiments which do not necessarily have to do with the dataframe and tables packages …

I saw that:

Tables.columns(df.meta[1])
ERROR: to treat Dict{String, Any} as a table, it must have a key type of `Symbol`, and a value type `<: AbstractVector`

while:

Tables.columns([df.meta[1]])
Tables.CopiedColumns{NamedTuple{(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe), Tuple{Vector{Vector{UInt8}}, Vector{Vector{String}}, Vector{Vector{Any}}, Vector{Int64}, Vector{Int64}, Vector{UInt64}, Vector{Int64}, Vector{Int64}}}} with 1 rows, 8 columns, and schema:
 :slots     Vector{UInt8} (alias for Array{UInt8, 1})  
 :keys      Vector{String} (alias for Array{String, 1})
 :vals      Vector{Any} (alias for Array{Any, 1})      
 :ndel      Int64
 :count     Int64
 :age       UInt64
 :idxfloor  Int64
 :maxprobe  Int64

gives the result you showed before.

I tried using the getfield and propertynames functions on this dictionary df.data[1], getting this:

julia> propertynames(df.data[1])
(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe)

getfield(df.meta[1],7) 
getfield(df.meta[1],2)
getfield(df.meta[1],3)

or

julia> df.meta[1].vals
16-element Vector{Any}:
    "Daily Time Series with Splits and Dividend Events"
    "2021-07-23"
 #undef
 #undef
 #undef
 #undef
    "AAPL"
 #undef
 #undef
    "Compact"
    "US/Eastern"
 #undef
 #undef
 #undef
 #undef
 #undef

Not knowing how things work behind the scenes, I thought these extra fields might depend on how the AlphaVantage.time_series_daily_adjusted function builds dictionaries.
To dispel the doubt I did a test with some dictionaries defined by me, getting similar result:

julia> vd=Dict{String, Any}("uno" => "1")
Dict{String, Any} with 1 entry:
  "uno" => "1"
julia> propertynames(vd)
(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe)

julia> Tables.columns([vd])
Tables.CopiedColumns{NamedTuple{(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe), Tuple{Vector{Vector{UInt8}}, Vector{Vector{String}}, Vector{Vector{Any}}, Vector{Int64}, Vector{Int64}, Vector{UInt64}, Vector{Int64}, Vector{Int64}}}} with 1 rows, 8 columns, and schema:
 :slots     Vector{UInt8} (alias for Array{UInt8, 1})
 :keys      Vector{String} (alias for Array{String, 1})
 :vals      Vector{Any} (alias for Array{Any, 1})
 :ndel      Int64
 :count     Int64
 :age       UInt64
 :idxfloor  Int64
 :maxprobe  Int64

also when:

julia> vd=Dict{Symbol, Any}(:uno => "1")
Dict{Symbol, Any} with 1 entry:
  :uno => "1"

julia> propertynames(vd)
(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe)

This means that, “internally” (some of you will understand better than me what this term means in this case :-), do dictionaries have a structure that uses meta-information in addition to keys and values?

Yes, Dictionaries are highly optimized and contain some internal fields to help performance.

It’s definitely a bit confusing that sometimes calling DataFrame on a vector of objects will expose a new user to all these details. But consistency with Tables.jl is a real strong point.

1 Like

Actually I think the best thing to do is to open an issue in AlphaVantage.jl and ask if they could provide data in a format that is compatible with Tables.jl table.

I already did this last week!

1 Like