Edit: Name has changed to ExpandNestedData.jl based on the feedback in this thread.
I often have to “flatten out” nested JSON data into tables, so I decided to have a go at a package that can handle the task more flexibly than just writing a bunch of nested for loops. And so Normalize.jl was born.
using Normalize
using JSON3
using DataFrames
message = JSON3.read("""
{
"a" : [
{"b" : 1, "c" : 2},
{"b" : 2},
{"b" : [3, 4], "c" : 1},
{"b" : []}
],
"d" : 4
}
"""
)
normalize(message) |> DataFrame
Returns:
4×3 DataFrame
Row | d | a_b | a_c |
---|---|---|---|
Int64 | Union… | Int64? | |
1 | 4 | 1 | 2 |
2 | 4 | 2 | missing |
3 | 4 | [3, 4] | 1 |
4 | 4 | [] |
missing |
It has options for
- Using PooledArrays (helpful with large JSON/XML files with repetative data)
- Flattening array values (would spread
[3, 4]
in the previous example to two rows) - Setting default for missing values
- Replacing automatic names with something more meaningful
- Specifying all these settings on a per-column level
Since this is my first attempt at a package, I’d really love some advice/guidance on a few things:
- How the docs can be improved. (Side note here – Documenter.jl is populating the
Dev
version, but I don’t know how to get theStable
version of the docs to publish) - The API: Thoughts on the custom
ColumnDefinition
struct for accepting user parameters? Other options that should be supported? - Code organization: I’ve notice that a lot of repos keep their whole source in one, very long file. I split mine up into a bunch of purpose-specific files normally, but I want to respect the community’s preferences if this is going to be a public package. Is there a best practice?
- I invented the
NestedIterator
struct and the various repeating/stacking functions that can be applied to them because nestingIterators.repeated(cycle(somegenerator, 4))...
was slowing things way down, I think because of Inference. However, it trades off with having to do a large number of composed steps for each index when collecting into aVector
at the end. Is there a better approach for deeply nesting iterators like this? - The name. Googling
normalize
does generally yield “normal data structure” in the context of database tables, but I know that it’s also used for “normalizing data” in the context of ML and statistics. I don’t want to take up prime real estate in the namespace if there is a better name for this package.
Current Loose Ends that I’m tracking:
- Currently, I’m checking across all column definitions for unique names and whether I’m at a “leaf node” at every step. Really, this should be parsed into a graph first, and then passed into the processing steps.
- I want to rename
stack
iterators tovcat
since that is really what’s happening. - Stable Docs don’t work
- Write tests for functions below the public api (especially NestedIterators)
- I’m currently testing with
JSON3.jl
(both itsJSON3.Object
andStructType.jl
outputs), but this should work withYAML.jl
,XMLDict.jl
, etc. I’d like to add tests for these.
Edit: corrected the output now that code is working correctly. Thank you, @Dan!