[ANN] Cleaner.jl: A toolbox of simple solutions for common data cleaning problems v1.0

TheRoniOne · February 6, 2022, 4:57pm

After months of development, I finally consider Cleaner.jl to be feature complete and stable enough to release version 1.0

Current features are:

Format column names to make them unique and fit snake_case or camelCase style.
Remove rows and columns filled with different kinds of empty values. e.g: missing , "" , "NA" , "None"
Delete columns filled with just a constant value.
Delete rows with at least one missing value.
Use a row as the names of the columns.
Minimize the amount of element types for each column without making the column of type Any .
Add a row index to your table.
Automatically use multiple threads if your data is big enough (and you are running Julia with more than 1 thread).
Rematerialize your original source Tables.jl type, as CleanTable implements the Tables.jl interface too.
Apply Cleaner transformations on your original table implementation and have the resulting table be of the same type as the original.
Get all repeated values or value combinations that are supposed to be unique.
Get the percentage distribution of the different categories that make up your table.
Compare tables to help solve join or merge problems caused by having different schemas.

Examples:

julia> using DataFrames: DataFrame

julia> using Cleaner

julia> df = DataFrame(" Some bad Name" => [missing, missing, missing], "Another_weird name " => [1, "x", 3])
3×2 DataFrame
 Row │  Some bad Name  Another_weird name
     │ Missing         Any
─────┼─────────────────────────────────────
   1 │        missing  1
   2 │        missing  x
   3 │        missing  3

julia> df2 = df |> polish_names |> compact_columns! |> reinfer_schema! |> DataFrame
3×1 DataFrame
 Row │ another_weird_name
     │ Union…
─────┼────────────────────
   1 │ 1
   2 │ x
   3 │ 3

julia> df3 = add_index(df)
┌───────────┬────────────────┬─────────────────────┐
│ row_index │  Some bad Name │ Another_weird name  │
│     Int64 │        Missing │                 Any │
├───────────┼────────────────┼─────────────────────┤
│         1 │        missing │                   1 │
│         2 │        missing │                   x │
│         3 │        missing │                   3 │
└───────────┴────────────────┴─────────────────────┘


julia> compare_table_columns(df, df2, df3)
┌─────────────────────┬─────────┬──────────────────────┬─────────┐
│         column_name │    tbl1 │                 tbl2 │    tbl3 │
│              Symbol │    Type │                 Type │    Type │
├─────────────────────┼─────────┼──────────────────────┼─────────┤
│       Some bad Name │ Missing │              Nothing │ Missing │
│ Another_weird name  │     Any │              Nothing │     Any │
│  another_weird_name │ Nothing │ Union{Int64, String} │ Nothing │
│           row_index │ Nothing │              Nothing │   Int64 │
└─────────────────────┴─────────┴──────────────────────┴─────────┘

If you just want to use a few Cleaner transformations and keep the original table type, we also offer the ROT function variants.

julia> add_index_ROT(df)
3×3 DataFrame
 Row │ row_index   Some bad Name  Another_weird name
     │ Int64      Missing         Any
─────┼────────────────────────────────────────────────
   1 │         1         missing  1
   2 │         2         missing  x
   3 │         3         missing  3

For more examples and a comprehensive guide about using Cleaner.jl, feel free to refer to the current stable documentation.

xiaodai · February 6, 2022, 11:54pm

Nice one. DataConvenience.jl has some complimentary functions too. It also has a cleannames! function inspired by janitor

Topic		Replies	Views
[ANN] Cleaner.jl: A toolbox of simple solutions for common data cleaning problems Package Announcements package , announcement	12	2303	October 29, 2021
[ANN] DataConvenience.jl - convenience functions I find useful Package Announcements	1	719	November 26, 2019
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10530	April 22, 2022
DataFramesMeta.jl and the state of the DataFrames ecosystem Data	36	4027	April 24, 2020
Release announcements for DataFrames.jl Data announcement , dataframes	190	24507	September 28, 2023

[ANN] Cleaner.jl: A toolbox of simple solutions for common data cleaning problems v1.0

Examples:

Related topics