Combine more DataFrames with some columns missing

Hi all,

I deal with API, that returns sometimes incomplete JSON. Some properties are missing. That could be reproduced with this code:

using Query
using CSV
using JSON
using DataFrames
testjson = """
[{"A":5, "B":6}, {"A":7, "D":8}]
"""
js = JSON.parse(testjson)
df = DataFrame.(js)
vcat(df...)

Note that first object contains properties A, B, second A, D. As a result I’d like to have a DataFrame with 2 rows and columns A, B, D.

The vcat returns error, that is not surprising.

julia> vcat(df...)
ERROR: ArgumentError: column(s) D are missing from argument(s) 1, and column(s) B are missing from argument(s) 2
Stacktrace:
 [1] _vcat(::Array{DataFrame,1}; cols::Symbol) at C:\Users\u\.julia\packages\DataFrames\3ZmR2\src\abstractdataframe\abstractdataframe.jl:1421

My β€œsolution” looks quite ugly. Is there any other way how to do that?

using Query
using CSV
using JSON
using DataFrames
testjson = """
[{"A":5, "B":6}, {"A":7, "D":8}]
"""
js = JSON.parse(testjson)
df = DataFrame.(js)

dfs = DataFrame.(df);
# get all column names through all JSON objects (= A, B, D)
dfcols = collect.(keys.(js)) |> 
              Iterators.flatten |> 
              @groupby(_) |> 
              @map(Name=key(_)) |> 
              collect
# for each DataFrame find missing columns and place some default value
for df = dfs
  missingcolumns = setdiff(dfcols, names(df))
  for col = missingcolumns
    df[!, col] .= ""
  end
end
# finally possible to concat the dataframes
dffinal = vcat(dfs...)

The result looks like this:

julia> dffinal
2Γ—3 DataFrame
β”‚ Row β”‚ A     β”‚ B   β”‚ D   β”‚
β”‚     β”‚ Int64 β”‚ Any β”‚ Any β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5     β”‚ 6   β”‚     β”‚
β”‚ 2   β”‚ 7     β”‚     β”‚ 8   β”‚

cols = :union in vcat should work, right?

4 Likes

if you don’t know, already, you can type ? vcat in the REPL and pull up documentation for that function. Then you can read through it to find the documentation for data frames, and you would see that there is a keyword argument for what you want.

Pretty! Thanks a lot.

It didn’t came to my mind that vcat might solve that. Honestly I found vcat just on StackOverflow when trying to combine more DataFrames together.

Please get into the habit of looking at ? fun whenever you are confused about a function. It will probably have the answer.

1 Like