[ANN] Cleaner.jl: A toolbox of simple solutions for common data cleaning problems

Finally I feel confident enough in the current state of my package to announce it here.

Cleaner.jl is a pure Julia package for data cleaning inspired by janitor from the R ecosystem, designed to be compatible with any Tables.jl implementation while also implementing the Tables.jl interface.

Key features currently are:

  • Format column names to make them unique and fit snake_case or camelCase style.
  • Remove rows and columns with different kinds of empty values. E.g: missing, "", "NA", "None"
  • Delete columns filled with just a constant value.
  • Use a row as the names of the columns.
  • Minimize the amount of element types for each column without making the column of type Any.
  • Automatically use multiple threads if your data is big enough (and you are running Julia with more than 1 thread).
  • Being able to rematerialize your original source Tables.jl type, if it has defined a constructor for other Tables.jl implementations, as our main type CleanTable implements the Tables.jl interface too.

Examples:

julia> using Cleaner

julia> using DataFrames: DataFrame

julia> df = DataFrame(" Some bad Name" => [missing, missing, missing], "Another_weird name " => [1, "x", 3])
3Γ—2 DataFrame
 Row β”‚  Some bad Name  Another_weird name
     β”‚ Missing         Any
─────┼─────────────────────────────────────
   1 β”‚        missing  1
   2 β”‚        missing  x
   3 β”‚        missing  3

julia> ct = polish_names(df)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ some_bad_name β”‚ another_weird_name β”‚
β”‚       Missing β”‚                Any β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       missing β”‚                  1 β”‚
β”‚       missing β”‚                  x β”‚
β”‚       missing β”‚                  3 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


julia> ct |> compact_columns! |> reinfer_schema!
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ another_weird_name β”‚
β”‚   U{Int64, String} β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                  1 β”‚
β”‚                  x β”‚
β”‚                  3 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


julia> df2 = ct |> DataFrame
3Γ—1 DataFrame
 Row β”‚ another_weird_name
     β”‚ Union…
─────┼────────────────────
   1 β”‚ 1
   2 β”‚ x
   3 β”‚ 3

The example above could also be rewriten as a one-liner using Julia’s functional pipes:

julia> df2 = df |> polish_names |> compact_columns! |> reinfer_schema! |> DataFrame
3Γ—1 DataFrame
 Row β”‚ another_weird_name
     β”‚ Union…
─────┼────────────────────
   1 β”‚ 1
   2 β”‚ x
   3 β”‚ 3

For a more in-deph guide on the current features and a display of working examples, feel free to refer to the latest stable documentation.

29 Likes

It seems great. Options like reinfer_schema! or polish_names! are very useful. Sometimes I had to do it β€˜manually’, so I’m writing it down to use the package next time :slight_smile:

2 Likes

Nice one. DataConvenience.jl has a cleannames! function too.

3 Likes

Great to have a package like this working with arbitrary Tables, not tightly coupled with any one specific implementation!

Would it be possible to have a set of functions returning the same table type as input? Like, f(::Vector{NamedTuple}) -> Vector{NamedTuple}, f(::DataFrame) -> DataFrame, …

1 Like

We announced two very similar packages on the same day :sweat_smile:

I wonder if we should join efforts in TableTransforms.jl @TheRoniOne because these cleaning transformations are a small subset of the kinds of transforms people need to do with tabular data. We have worked out a simple API for lazy pipelines and exploit Transducers.jl for parallelism.

Let me know if you are interested in following the same API. I’d be happy to help migrating some of the cleaning operations over there. We will end up adding cleaning operations as well at some point, so that is why I am asking here before duplicating the work.

5 Likes

I’m definitely interested on upstreaming my work on Cleaner.jl as a dependency into TableTransforms.jl, while keeping Cleaner.jl as a minimal dependency too (currently Cleaner.jl only depends on Tables.jl v"1" and PrettyTables v"1" so there should’t be any compatibility problems).

Thanks to following semver, all minor Cleaner.jl releases should be directly compatible and available with the latest version of TableTransforms.jl and I will personally upstream all new features on major Cleaner.jl releases to follow your API while also updating the Cleaner entry on TableTransforms.jl’s Project.toml if you are fine with it too.

I will also develop tests for all new features upstreamed from Cleaner.jl to TableTransforms.jl to ensure nothing breaks.

Would be great to coordinate major Cleaner.jl and TableTransforms.jl releases too to have all latest features on Cleaner.jl always be available on the latest version of TableTransforms.jl.

If you agree, please let me know so I can start developing pull requests towards TableTransforms.jl repo.

3 Likes

It should be possible, will try to make a minimal working example for tomorrow to further discuss it.

I have a question: why shouldn’t we join efforts in a single repository? Cleaning transforms are widely used in the beginning of more sophisticated pipelines. If users could import a single package to have it all, that would be a great achievement in my opinion. Also, notice that the dependencies of TableTransforms.jl are pretty minimal at this point. Transducers.jl is the major one, but I think it is a crucial one in order to combine distributed with multi-threaded transforms.

In the worst case scenario that we keep working in separate repositories, we can still direct users to each other’s efforts. I see value in concentrating energy in a single place, specially when the Julia community is so small and tends to get busy with their own jobs sometimes. Having two people reviewing issues and PRs is already a major achievement.

Let me know if that makes sense. I am looking forward to improve the situation with table transforms in Julia either way! Thanks for the contributions!

1 Like

@xiaodai. Nice, I did not know it that package!

I am not one of the authors of any of these packages, but I am trying to move to Julia my works using dataframes, and sometimes I do not know how to use some simple preprocessing in Julia (actually, I could do it, but I guess there is a package that implements it but I do not know them). However, I think that @juliohm has reason. At least, it should be nice to have a reference/document in which these different packages were listed, to allow users to find them easily (it took me weeks to find a package to tackle missing values as Impute.jl, for example).

1 Like

Managed to do a simple example.

julia> function polish_names_same_input_type(table; style::Symbol=:snake_case)
       return polish_names(table; style=style) |> Tables.materializer(table)
       end
polish_names_same_input_type (generic function with 1 method)

julia> julia> df = DataFrame(" Some bad Name" => [missing, missing, missing], "Another_weird name " => [1, "x", 3])
3Γ—2 DataFrame
 Row β”‚  Some bad Name  Another_weird name
     β”‚ Missing         Any
─────┼─────────────────────────────────────
   1 β”‚        missing  1
   2 β”‚        missing  x
   3 β”‚        missing  3

julia> polish_names_same_input_type(df)
3Γ—2 DataFrame
 Row β”‚ some_bad_name  another_weird_name
     β”‚ Missing        Any
─────┼───────────────────────────────────
   1 β”‚       missing  1
   2 β”‚       missing  x
   3 β”‚       missing  3

Probably could have this done for the rest of Cleaner’s functions on the next major release, but would like to have a better name for the family other than β€œfunction_same_input_type” tho.

Any recommendations for the naming of this function family are welcome.

I would truly like to join efforts too, but I do think upstreaming Cleaner to TableTransforms would be the best way to do it in order to let users import a single package with all the functionalities and work on more sophisticated pipelines.

My reasons for this are mainly to mantain modularity and its benefits, be able to specialice both packages and give more freedom of choice to users and mantainers.

Also, for important performance reasons as Cleaner functions heavily rely on knowing the underlying data structure (CleanTable) for further optimization.

For example:

  • Being column major lets optimize functions working column wise by letting the compiler use SIMD.
  • Knowing it always will be column based lets securely multithread operations over each column while also taking advantage of the SIMD optimization.
  • Being able to use mutating functions knowing that the data structure can be mutated (not all Tables.jl implementations are mutable e.g. NamedTuples) lets avoid having to build a new instance of the data type every time a function is used to transform the data in it.
  • Having the option to rematerialize the original Tables.jl implementation while not doing it as default behavior for every function called lets avoid transforming back and foward from a type to another, also having to copy the underlying data which would end on triggering to many times the garbage collector.

The main disadvantage I find on upstreaming Cleaner’s functionality would be to have to do the upstreaming work towards TableTransforms to satisfy the API, but I’m willing to be the responsable of doing it.

I personally think that integrating the advantages listed above towards TableTransforms while mantaining Cleaner as mainly a data cleaning package and TableTransforms as a data transformation superset grouping functionalities under a common API would be the most benefitial for the community.

For example, MLJ.jl does that really well for data science packages providing different models and data science functionalities from different, smaller and specialized packages while grouping them under a common API.

I hope we can end up on an agreement and help both improve Julia’s data ecosystem, as well thanks for your constributions too!

PD: I have added TableTransforms.jl to the new β€œRelated Packages” section on Cleaner’s README.

2 Likes

On the other hand, I can argue that premature modularization is a real issue in Julia. We have lots of tiny packages that die after a couple years because the only maintainer is gone.

I think you want to refer to multi-threading in general, not specifically SIMD, right? Notice that transforms in TableTransforms.jl are mostly column-oriented. Each transform can implement the pattern that is more optimal, we don’t fix that in the design.

I respectfully disagree with this observation. MLJ.jl is a really nice project, but it is not a good example of code reuse and integration with other ecosystems. There is a whole stack of MLJ*.jl packages that only work with MLJ.jl and nothing else. Also, notice that MLJ.jl itself provides pipelines.

I will do the same on the TableTransforms.jl side hoping that this divide is gone in the future :pray:t4:

2 Likes