[ANN] ExpandNestedData.jl

ExpandNestedData.jl is in the process of being registered!

Do you wish you could see your JSON data as a table? How about XML? Or a struct of structs? If so, ExpandNestedData is the tool for you. Just pass your object to expand and it will unpack any nested data into a columnar table.

Tl;Dr

using ExpandNestedData 
using JSON3
using DataFrames

message = JSON3.read("""
    {
        "a" : [
            {"b" : 1, "c" : 2},
            {"b" : 2},
            {"b" : [3, 4], "c" : 1},
            {"b" : []}
        ],
        "d" : 4
    }
    """
)

expand(message) |> DataFrame
# returns
5Γ—3 DataFrame
 Row β”‚ d      a_b      a_c     
     β”‚ Int64  Int64?   Int64?  
─────┼─────────────────────────
   1 β”‚     4  missing  missing 
   2 β”‚     4        3        1
   3 β”‚     4        4        1
   4 β”‚     4        2  missing 
   5 β”‚     4        1        2

expand has many useful kwargs that allow you to tweak how the columns are collect, how the column names are constructed, whether the table structure should be flat or if the columns/returned rows should be nested matching the structure of the source data, and even designate which paths to include (ignoring branches of the input data that are not included). You can see the docs for detailed descriptions of all options.

I’ve tested this package with XMLDict.jl and JSON3.jl extensively, and it handles a number of edge cases well, but if you find any bugs, please let me know!

Outstanding Goals

  • Support for AbstractTree.jl input (This would enable composability with Gumbo.jl and XML.jl)
  • Use custom Table as input for compressing tabular data to nested data
  • Widen arrays so column names match XPath expressions
  • Parse Xpath to ColumnDefinitions
  • Dispatch on user-defined get_keys and get_values functions to traverse arbitrary custom types

Contributing

I’d love help with any of the outstanding goals or other features you think would be useful. Further, I’m sure there is still performance left on the table. Unpacking completely generic Dicts and Arrays has been challenging to type-stabilize, and I’m definitely open to feedback. Core.jl contains the central logic of unpacking the input, if you want to take a crack at it.

9 Likes

Version 1.1.0 has been released!

There are no new features, but I did a major overhaul of the internals and brought down the allocations by ~200x. This is in no small part thanks to @Mason’s SumTypes.jl which, as I’ve said before, is totally amazing!

For just a small example of the improvement:

julia> small_dict = Dict(
           :a => 1,
           :b => "2",
           :c => Dict(:e => Symbol(3), :f => 4)
       );

julia> many_records = [small_dict for _ in 1:10_000];

# lazy_columns so we don't measure the time to collect the values
# nested so we aren't measuring reorganizing columns into a flat table
# This ensures we are only running the code I've refactored
julia> @btime ExpandNestedData.expand($many_records; lazy_columns=true, column_style=:nested);
  191.322 ms (2310855 allocations: 92.78 MiB)

Previously, this benchmark completely froze my REPL, so I can’t compare its exact performance improvement. But, last I checked, 191ms < Inf ms :smile: If I remove the kwargs above, it only increases run time to 202ms, so we’ve seen major gains in the worst parts of the algorithm!

All this to say, if you need to consume lots and lots of nested data (looking at you, XML) and turn it into a table, this package’s performance shouldn’t hold you back anymore!

1 Like