[Pre-ANN/RFC] ExpandNestedData.jl (Previously Normalize.jl)

Edit: Name has changed to ExpandNestedData.jl based on the feedback in this thread.

I often have to “flatten out” nested JSON data into tables, so I decided to have a go at a package that can handle the task more flexibly than just writing a bunch of nested for loops. And so Normalize.jl was born.

using Normalize
using JSON3
using DataFrames

message = JSON3.read("""
    {
        "a" : [
            {"b" : 1, "c" : 2},
            {"b" : 2},
            {"b" : [3, 4], "c" : 1},
            {"b" : []}
        ],
        "d" : 4
    }
    """
)

normalize(message) |> DataFrame

Returns:
4×3 DataFrame

Row d a_b a_c
Int64 Union… Int64?
1 4 1 2
2 4 2 missing
3 4 [3, 4] 1
4 4 [] missing

It has options for

  • Using PooledArrays (helpful with large JSON/XML files with repetative data)
  • Flattening array values (would spread [3, 4] in the previous example to two rows)
  • Setting default for missing values
  • Replacing automatic names with something more meaningful
  • Specifying all these settings on a per-column level

Since this is my first attempt at a package, I’d really love some advice/guidance on a few things:

  1. How the docs can be improved. (Side note here – Documenter.jl is populating the Dev version, but I don’t know how to get the Stable version of the docs to publish)
  2. The API: Thoughts on the custom ColumnDefinition struct for accepting user parameters? Other options that should be supported?
  3. Code organization: I’ve notice that a lot of repos keep their whole source in one, very long file. I split mine up into a bunch of purpose-specific files normally, but I want to respect the community’s preferences if this is going to be a public package. Is there a best practice?
  4. I invented the NestedIterator struct and the various repeating/stacking functions that can be applied to them because nesting Iterators.repeated(cycle(somegenerator, 4))... was slowing things way down, I think because of Inference. However, it trades off with having to do a large number of composed steps for each index when collecting into a Vector at the end. Is there a better approach for deeply nesting iterators like this?
  5. The name. Googling normalize does generally yield “normal data structure” in the context of database tables, but I know that it’s also used for “normalizing data” in the context of ML and statistics. I don’t want to take up prime real estate in the namespace if there is a better name for this package.

Current Loose Ends that I’m tracking:

  1. Currently, I’m checking across all column definitions for unique names and whether I’m at a “leaf node” at every step. Really, this should be parsed into a graph first, and then passed into the processing steps.
  2. I want to rename stack iterators to vcat since that is really what’s happening.
  3. Stable Docs don’t work
  4. Write tests for functions below the public api (especially NestedIterators)
  5. I’m currently testing with JSON3.jl (both its JSON3.Object and StructType.jl outputs), but this should work with YAML.jl, XMLDict.jl, etc. I’d like to add tests for these.

Edit: corrected the output now that code is working correctly. Thank you, @Dan!

4 Likes

3rd column is missing and not [] - is this correct?

UPDATE: Fixed in OP. So good now.

1 Like

Ah, yikes! Nice catch. It would be missing if expand_array was true. Must have the wrong logic for the case of expand_array is false and array length is 0. I’ll get a PR in tonight.

Very cool stuff! Yeah, I get that Normalize.jl feels a bit generic; maybe DataNormalize.jl? or NormalizeObject.jl? Honestly Normalize.jl doesn’t seem wrong at all, so if others agree, I think that’s a great name.

I actually have a slightly similarly purposed package called Strapping.jl, that aims to make a bridge between StructTypes.jl objects and the Tables.jl interface. It works pretty well, but I haven’t quite had the time to really stress test it in a production setting. It doesn’t quite work out of the box for generic Dicts, since it targets any Tables.jl-compatible input, but I think it’s pretty neat and has some similar functionality as Normalize.jl (flattening nested objects to a 2d structure).

With that context, I wanted to ask how you’re outputting to a DataFrame? Is it via the Tables.jl interface? If so, that’s awesome, because it means we could flatten inputs to any kind of tables sink instead of just DataFrames.

For stable docs, I believe they get deployed automatically via TagBot once you have an official tagged release, so it makes sense you just have Dev for now.

Just wanted to say, very cool stuff! Hopefully I can find some time to play around with this in the near future.

3 Likes

“Normalization” has many important technical meanings, in statistics, databases, unicode, natural-language processing, …

DataNormalization, Flattening, seem fine to me

12 Likes

my 2 cents. normalisation doesn’t give me an instant knowledge. perhaps normalisation is the right technical term in data domain, I’ve heard about normalisation relationship tables. Maybe something like Unnest feels like it’s more intuitive as a name. Otherwise I 2nd @quinnj

6 Likes

I had never heard of normalize in the data structure sense, just the statistical sense. IMO a more descriptive name would be better.

1 Like

Yep! I collect into a NamedTuple of (optionally Pooled)Arrays and return that. You can direct that into whatever sink you’d like!

Oh, cool! I hadn’t seen this yet. I’ll look at it and probably borrow liberally! :wink:

Also, I’m not married to this being its own package. I was originally going to put it in as a PR to JSONTables.jl since my initial impetuous was trying to extend the functionality of parsing JSON files, but since it could help with XMLDict.jl and others, separating it out seemed to make more sense. If you think it fits into the toolbelt of Strapping.jl now or down the road once it’s had some time to mature, I’m down merge it in.

Totally agree with all the feedback on the name. I think Unnest has promise, since from a quick google search, it looks like Tidyverse has a function of the same name that unpacks nested tables (not the same function, but same spirit). My gut says Unnest is pretty uninformative too on its own though. So maybe UnnestData or FlattenNestedData and then rename the function to unnest?

See also unnest · Issue #3116 · JuliaData/DataFrames.jl · GitHub

cc @bkamins

3 Likes

After sleeping on it, I disagree with myself. The fact that (un)nest already refers to tables of tables in R might make it confusing. Doubly so given @jar1’s link to the PR adding it to DataFrames.jl.

Given that pandas calls this .json_normalize, I think NormalizeNestedData is probably best. Nested hopefully disambiguates from normalizing data in statistics and such. It’s a little wordy, but it’s the most succinct I can come up with

Unless there are any better ideas today, I’ll go through switching it up this evening!

2 Likes

Maybe this can be called expand or json_expand as it seems to be the reverse operation of a compress on the table. Anyway, expand is also the opposite of simplification (simplify) or factorization (factor), and the JSON seems to be interpreted as a big cross-product expression.
Having a function which goes the reverse way on a DataFrame, and produces a JSON could be nice (there is the issue of choosing a column ordering - which might take exp-time to optimize, but heuristics exist, and user can supply ordering / template too. Perhaps column ordering can be inferred from column names or original ordering or both).

1 Like

I think this is a much better name than Normalize.jl since to the entire field of data science it means something completely different. Really cool work by the way and very useful. :muscle:t2:

3 Likes

Flattening nested structures in this way looks pretty convenient indeed!
For futher ease of programmatic postprocessing, I’d also suggest an alternative output format like

[(d=4, (a=(b=1, c=2),), ...]

instead or in addition to the current

[(d=4, a_b=1, a_c=2), ...]

It would still flatten arrays, but keep the nesting structure intact.

1 Like

Currently, I’m actually outputting

(
    d = [4,4,4,4],
    a_b = [1, 2, [3,4], []],
    a_c = [2, missing, 1, missing]
)

because

  1. It goes straight to the columnar format that many Tables.jl sinks use
  2. Getting all the cross products from large data structures makes the row count explode, so preferring columns lets me use PooledArrays.jl when it benefits the memory footprint.

That being said, adding an option to do eachrows over that output which creates a view formatted as a nested tuples like you suggest is a great idea!

Sorry for the confusion, I didn’t suggest switching to row-based storage!

Data structures, incl tables, that have row-based interface are often more convenient to work with - but this doesn’t say anything about their storage. For example,

tbl = (
    d = [4,4,4,4],
    a_b = [1, 2, [3,4], []],
    a_c = [2, missing, 1, missing]
) |> StructArray

is an array of named tuples (; d, a_b, a_c), but with columnar storage. Of course, stuff like PooledArrays works here just fine.
This is how the situation in Normalize.jl is right now: the result is effectively a collection of named tuples of the form (; d, a_b, a_c), they are just stored in a columnar way.

I don’t suggest changing the storage: columnar is typically more performant indeed, even without pooling. I just asked for (; d, a=(; b, c)) as elements instead of (; d, a_b, a_c).
For example,

julia> tbl = StructArray(
           d = [4,4,4,4],
           a = StructArray(
               b=[1, 2, [3,4], []],
               c=[2, missing, 1, missing]
           )
       )

julia> tbl[1]
(d = 4, a = NamedTuple{(:b, :c), Tuple{Any, Union{Missing, Int64}}}((1, 2)))

julia> tbl.d
4-element Vector{Int64}:
 4
 4
 4
 4

julia> tbl.a
4-element StructArray(::Vector{Any}, ::Vector{Union{Missing, Int64}}) with eltype NamedTuple{(:b, :c), Tuple{Any, Union{Missing, Int64}}}:
 NamedTuple{(:b, :c), Tuple{Any, Union{Missing, Int64}}}((1, 2))
 NamedTuple{(:b, :c), Tuple{Any, Union{Missing, Int64}}}((2, missing))
 NamedTuple{(:b, :c), Tuple{Any, Union{Missing, Int64}}}(([3, 4], 1))
 NamedTuple{(:b, :c), Tuple{Any, Union{Missing, Int64}}}((Any[], missing))

julia> tbl.a.b
4-element Vector{Any}:
 1
 2
  [3, 4]
  Any[]

Ah, I gotchya! Sorry I assumed incorrectly there.

I love the idea. Is it sufficient to return

(
     d = [4,4,4,4],
     a = (
         b=[1, 2, [3,4], []],
         c=[2, missing, 1, missing]
     )
 )

And let a user format that as StructArrays or whatever they’d like. Or is it better to add StructArrays as a dependency and do that formatting for automatically?

I really like this idea. I think it addresses other concerns about the fact that normalize doesn’t give instant knowledge for everyone, especially if you haven’t worked with different database paradigms. I’m inclined to use ExpandNestedData.jl, but I’ll wait another minute for reactions.

This is another killer idea. Two initial thoughts I have are

  1. I could take inspiration from the StructType templating that @quinnj uses in JSON3.jl. I think it would be relatively straight forward to do the correct groupingbys on a table to build the correct structs.
  2. I struggle to think of good inference heuristic because
# Given a table
> tbl = (
    a = [1, 1, 2, 2]
    b = [1, 2, 3, 4]
    c = [5, 6, 7, 8]
)
# Results of 
> compress(tbl)
# could be
Dict(
    :a => [1, 2]
    :lvl2 => [
        Dict(:b => [1, 2], :c => [5, 6]),
        Dict(:b => [3, 4], :c => [7, 8])
    ]
)
# Or
Dict(
    :a => [1,1,2,2],
    :lvl2 => [
        Dict(:b => 1, :c => 5),
        Dict(:b => 2, :c => 6),
        Dict(:b => 3, :c => 7),
        Dict(:b => 4, :c => 8)
    ]
)
# ... or several other stackings of arrays

I guess you could prioritize organizations that limit the number of leaf nodes that are arrays (option 2)? But I can also see the argument that it should prioritize having less repeated data (option 1). I’d appreciate any insights on what you’d expect the behavior of such a function to be.

This representation looks clearer (to me):

but I think the compression would be:

> compress(tbl)
# could be
[
  Dict(:a = 1,
       :t1 = [ { :b = 1, :c = 5 },
               { :b = 2, :c = 6 }]
  ),
  Dict(:a = 2,
       :t2 = [ { :b = 3, :c = 7 },
               { :b = 4, :c = 8 }]
  )
]

The important property would be some sort of identify: expand(compress(df)) == df (might not hold in this exact way, but close).

If you JSONize the structure I quoted from your post, I think something other than the table in your post would be output from expand/normalize.

JSONization of the compression suggested here:

message = JSON3.read("""[ 
  { "a" : 1, "t1" : [ { "b" : 1, "c" : 5 }, { "b" : 2, "c" : 6 } ] },
  { "a" : 2, "t2" : [ { "b" : 3, "c" : 7 }, { "b" : 4, "c" : 8 } ] } ]
  """
)
1 Like

Nesting, in both of the suggested ways, can be done rather naturally using DataFrames and Tables:

julia> using DataFrames, Tables, Chain, JSON3

julia> tbl = DataFrame(a = [1,1,2,2], b = [1,2,3,4], c = [5,6,7,8])
4×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      5
   2 │     1      2      6
   3 │     2      3      7
   4 │     2      4      8

julia> variant1 = @chain begin
           tbl
           groupby(_, :a)
           combine(_, AsTable(Not(:a)) => Ref ∘ Tables.columntable => :lvl2)
           Tables.columntable(_)
       end
(a = [1, 2], lvl2 = NamedTuple{(:b, :c), Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false}, SubArray{Int64, 1, Vector{Int64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false}}}[(b = [1, 2], c = [5, 6]), (b = [3, 4], c = [7, 8])])

julia> print(JSON3.write(variant1))
{"a":[1,2],"lvl2":[{"b":[1,2],"c":[5,6]},{"b":[3,4],"c":[7,8]}]}

julia> variant2 = @chain begin
           tbl
           groupby(_, :a)
           combine(_, AsTable(Not(:a)) => Ref ∘ Tables.rowtable => :lvl2)
           Tables.rowtable(_)
       end
2-element Vector{NamedTuple{(:a, :lvl2), Tuple{Int64, Vector{NamedTuple{(:b, :c), Tuple{Int64, Int64}}}}}}:
 (a = 1, lvl2 = [(b = 1, c = 5), (b = 2, c = 6)])
 (a = 2, lvl2 = [(b = 3, c = 7), (b = 4, c = 8)])

julia> print(JSON3.write(variant2))
[{"a":1,"lvl2":[{"b":1,"c":5},{"b":2,"c":6}]},{"a":2,"lvl2":[{"b":3,"c":7},{"b":4,"c":8}]}]
1 Like