Edit: Name has changed to ExpandNestedData.jl based on the feedback in this thread.
I often have to “flatten out” nested JSON data into tables, so I decided to have a go at a package that can handle the task more flexibly than just writing a bunch of nested for loops. And so Normalize.jl was born.
using Normalize
using JSON3
using DataFrames
message = JSON3.read("""
{
"a" : [
{"b" : 1, "c" : 2},
{"b" : 2},
{"b" : [3, 4], "c" : 1},
{"b" : []}
],
"d" : 4
}
"""
)
normalize(message) |> DataFrame
Returns:
4×3 DataFrame
| Row | d | a_b | a_c |
|---|---|---|---|
| Int64 | Union… | Int64? | |
| 1 | 4 | 1 | 2 |
| 2 | 4 | 2 | missing |
| 3 | 4 | [3, 4] | 1 |
| 4 | 4 | [] |
missing |
It has options for
- Using PooledArrays (helpful with large JSON/XML files with repetative data)
- Flattening array values (would spread
[3, 4]in the previous example to two rows) - Setting default for missing values
- Replacing automatic names with something more meaningful
- Specifying all these settings on a per-column level
Since this is my first attempt at a package, I’d really love some advice/guidance on a few things:
- How the docs can be improved. (Side note here – Documenter.jl is populating the
Devversion, but I don’t know how to get theStableversion of the docs to publish) - The API: Thoughts on the custom
ColumnDefinitionstruct for accepting user parameters? Other options that should be supported? - Code organization: I’ve notice that a lot of repos keep their whole source in one, very long file. I split mine up into a bunch of purpose-specific files normally, but I want to respect the community’s preferences if this is going to be a public package. Is there a best practice?
- I invented the
NestedIteratorstruct and the various repeating/stacking functions that can be applied to them because nestingIterators.repeated(cycle(somegenerator, 4))...was slowing things way down, I think because of Inference. However, it trades off with having to do a large number of composed steps for each index when collecting into aVectorat the end. Is there a better approach for deeply nesting iterators like this? - The name. Googling
normalizedoes generally yield “normal data structure” in the context of database tables, but I know that it’s also used for “normalizing data” in the context of ML and statistics. I don’t want to take up prime real estate in the namespace if there is a better name for this package.
Current Loose Ends that I’m tracking:
- Currently, I’m checking across all column definitions for unique names and whether I’m at a “leaf node” at every step. Really, this should be parsed into a graph first, and then passed into the processing steps.
- I want to rename
stackiterators tovcatsince that is really what’s happening. - Stable Docs don’t work
- Write tests for functions below the public api (especially NestedIterators)
- I’m currently testing with
JSON3.jl(both itsJSON3.ObjectandStructType.jloutputs), but this should work withYAML.jl,XMLDict.jl, etc. I’d like to add tests for these.
Edit: corrected the output now that code is working correctly. Thank you, @Dan!