About DataFrame(array_of_dict)

using  AlphaVantage, DataFrames
apikey_AV = AlphaVantage.global_key!("5J16OULI0KY026M4");
tickerList = ["AAPL","MSFT"];
StockHistoryRawJson = AlphaVantage.time_series_daily_adjusted.(Stocks.tickerList, outputsize="compact", datatype="json")


DataFrame([StockHistoryRawJson[i]["Meta Data"] for i in 1:2]) # does not give the expected result
names(DataFrame([StockHistoryRawJson[i]["Meta Data"] for i in 1:2])) #extra column appeared

a way to get around the “obstacle”

vcat(DataFrame.([StockHistoryRawJson[i]["Meta Data"] for i in 1:length(tickerList)])...)

Instead, by transforming the dictionary keys into symbols , the expected dataframe is obtained.

dsym=[Dict([Symbol(k)=>v for (k,v) in StockHistoryRawJson[i]["Meta Data"]]) for i in 1:length(tickerList)]
DataFrame(dsym)

Can anyone tell me what is going on behind the scenes?
Why if you use {String, Any}, DataFrame (…) pairs “doesn’t work” and where do the extra columns come from?

The reason is that you are passing a Vector of “something” to a DataFrame constructor.

As explained in the docstring of DataFrame (I am quoting only the relevant part):

If a single positional argument is passed to a DataFrame constructor then it is assumed to be of type that implements the Tables.jl (GitHub - JuliaData/Tables.jl: An interface for tables in Julia) interface using which the returned DataFrame is materialized.

So you get exactly the same as what you would get from Tables.jl directly:

julia> Tables.columns([StockHistoryRawJson[i]["Meta Data"] for i in 1:2])
Tables.CopiedColumns{NamedTuple{(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe), Tuple{Vector{Vector{UInt8}}, Vector{Vector{String}}, Vector{Vector{Any}}, Vector{Int64}, Vector{Int64}, Vector{UInt64}, Vector{Int64}, Vector{Int64}}}} with 2 rows, 8 columns, and schema:
 :slots     Vector{UInt8} (alias for Array{UInt8, 1})
 :keys      Vector{String} (alias for Array{String, 1})
 :vals      Vector{Any} (alias for Array{Any, 1})
 :ndel      Int64
 :count     Int64
 :age       UInt64
 :idxfloor  Int64
 :maxprobe  Int64

And the rule in Tables.jl is that if you pass it a vector of “something” then fields of this “something” are interpreted as columns of a table.

One of special exceptions are dictionaries with Symbols as keys, which get handled as you have observed. This special treatment is defined here: https://github.com/JuliaData/Tables.jl/blob/1f2395a68e02906134f7ffb7020944894c85a91c/src/Tables.jl#L138.

If you would like to have AbstractString be handled in the same way as Symbol please consider opening an issue in Tables.jl.

Thank you very much for the references you have given me: I have so much to “study” :grinning:.
In the meantime I did some (almost) random experiments which do not necessarily have to do with the dataframe and tables packages …

I saw that:

Tables.columns(df.meta[1])
ERROR: to treat Dict{String, Any} as a table, it must have a key type of `Symbol`, and a value type `<: AbstractVector`

while:

Tables.columns([df.meta[1]])
Tables.CopiedColumns{NamedTuple{(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe), Tuple{Vector{Vector{UInt8}}, Vector{Vector{String}}, Vector{Vector{Any}}, Vector{Int64}, Vector{Int64}, Vector{UInt64}, Vector{Int64}, Vector{Int64}}}} with 1 rows, 8 columns, and schema:
 :slots     Vector{UInt8} (alias for Array{UInt8, 1})  
 :keys      Vector{String} (alias for Array{String, 1})
 :vals      Vector{Any} (alias for Array{Any, 1})      
 :ndel      Int64
 :count     Int64
 :age       UInt64
 :idxfloor  Int64
 :maxprobe  Int64

gives the result you showed before.

I tried using the getfield and propertynames functions on this dictionary df.data[1], getting this:

julia> propertynames(df.data[1])
(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe)

getfield(df.meta[1],7) 
getfield(df.meta[1],2)
getfield(df.meta[1],3)

or

julia> df.meta[1].vals
16-element Vector{Any}:
    "Daily Time Series with Splits and Dividend Events"
    "2021-07-23"
 #undef
 #undef
 #undef
 #undef
    "AAPL"
 #undef
 #undef
    "Compact"
    "US/Eastern"
 #undef
 #undef
 #undef
 #undef
 #undef

Not knowing how things work behind the scenes, I thought these extra fields might depend on how the AlphaVantage.time_series_daily_adjusted function builds dictionaries.
To dispel the doubt I did a test with some dictionaries defined by me, getting similar result:

julia> vd=Dict{String, Any}("uno" => "1")
Dict{String, Any} with 1 entry:
  "uno" => "1"
julia> propertynames(vd)
(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe)

julia> Tables.columns([vd])
Tables.CopiedColumns{NamedTuple{(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe), Tuple{Vector{Vector{UInt8}}, Vector{Vector{String}}, Vector{Vector{Any}}, Vector{Int64}, Vector{Int64}, Vector{UInt64}, Vector{Int64}, Vector{Int64}}}} with 1 rows, 8 columns, and schema:
 :slots     Vector{UInt8} (alias for Array{UInt8, 1})
 :keys      Vector{String} (alias for Array{String, 1})
 :vals      Vector{Any} (alias for Array{Any, 1})
 :ndel      Int64
 :count     Int64
 :age       UInt64
 :idxfloor  Int64
 :maxprobe  Int64

also when:

julia> vd=Dict{Symbol, Any}(:uno => "1")
Dict{Symbol, Any} with 1 entry:
  :uno => "1"

julia> propertynames(vd)
(:slots, :keys, :vals, :ndel, :count, :age, :idxfloor, :maxprobe)

This means that, “internally” (some of you will understand better than me what this term means in this case :-), do dictionaries have a structure that uses meta-information in addition to keys and values?

Yes, Dictionaries are highly optimized and contain some internal fields to help performance.

It’s definitely a bit confusing that sometimes calling DataFrame on a vector of objects will expose a new user to all these details. But consistency with Tables.jl is a real strong point.

Actually I think the best thing to do is to open an issue in AlphaVantage.jl and ask if they could provide data in a format that is compatible with Tables.jl table.

I already did this last week!