[ANN] Cleaner.jl: A toolbox of simple solutions for common data cleaning problems v1.0

After months of development, I finally consider Cleaner.jl to be feature complete and stable enough to release version 1.0

Current features are:

  • Format column names to make them unique and fit snake_case or camelCase style.
  • Remove rows and columns filled with different kinds of empty values. e.g: missing , "" , "NA" , "None"
  • Delete columns filled with just a constant value.
  • Delete rows with at least one missing value.
  • Use a row as the names of the columns.
  • Minimize the amount of element types for each column without making the column of type Any .
  • Add a row index to your table.
  • Automatically use multiple threads if your data is big enough (and you are running Julia with more than 1 thread).
  • Rematerialize your original source Tables.jl type, as CleanTable implements the Tables.jl interface too.
  • Apply Cleaner transformations on your original table implementation and have the resulting table be of the same type as the original.
  • Get all repeated values or value combinations that are supposed to be unique.
  • Get the percentage distribution of the different categories that make up your table.
  • Compare tables to help solve join or merge problems caused by having different schemas.

Examples:

julia> using DataFrames: DataFrame

julia> using Cleaner

julia> df = DataFrame(" Some bad Name" => [missing, missing, missing], "Another_weird name " => [1, "x", 3])
3Γ—2 DataFrame
 Row β”‚  Some bad Name  Another_weird name
     β”‚ Missing         Any
─────┼─────────────────────────────────────
   1 β”‚        missing  1
   2 β”‚        missing  x
   3 β”‚        missing  3

julia> df2 = df |> polish_names |> compact_columns! |> reinfer_schema! |> DataFrame
3Γ—1 DataFrame
 Row β”‚ another_weird_name
     β”‚ Union…
─────┼────────────────────
   1 β”‚ 1
   2 β”‚ x
   3 β”‚ 3

julia> df3 = add_index(df)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ row_index β”‚  Some bad Name β”‚ Another_weird name  β”‚
β”‚     Int64 β”‚        Missing β”‚                 Any β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚         1 β”‚        missing β”‚                   1 β”‚
β”‚         2 β”‚        missing β”‚                   x β”‚
β”‚         3 β”‚        missing β”‚                   3 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


julia> compare_table_columns(df, df2, df3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         column_name β”‚    tbl1 β”‚                 tbl2 β”‚    tbl3 β”‚
β”‚              Symbol β”‚    Type β”‚                 Type β”‚    Type β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       Some bad Name β”‚ Missing β”‚              Nothing β”‚ Missing β”‚
β”‚ Another_weird name  β”‚     Any β”‚              Nothing β”‚     Any β”‚
β”‚  another_weird_name β”‚ Nothing β”‚ Union{Int64, String} β”‚ Nothing β”‚
β”‚           row_index β”‚ Nothing β”‚              Nothing β”‚   Int64 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


If you just want to use a few Cleaner transformations and keep the original table type, we also offer the ROT function variants.

julia> add_index_ROT(df)
3Γ—3 DataFrame
 Row β”‚ row_index   Some bad Name  Another_weird name
     β”‚ Int64      Missing         Any
─────┼────────────────────────────────────────────────
   1 β”‚         1         missing  1
   2 β”‚         2         missing  x
   3 β”‚         3         missing  3

For more examples and a comprehensive guide about using Cleaner.jl, feel free to refer to the current stable documentation.

14 Likes

Nice one. DataConvenience.jl has some complimentary functions too. It also has a cleannames! function inspired by janitor

3 Likes