After months of development, I finally consider Cleaner.jl to be feature complete and stable enough to release version 1.0
Current features are:
- Format column names to make them unique and fit
snake_case
orcamelCase
style. - Remove rows and columns filled with different kinds of empty values. e.g:
missing
,""
,"NA"
,"None"
- Delete columns filled with just a constant value.
- Delete rows with at least one missing value.
- Use a row as the names of the columns.
- Minimize the amount of element types for each column without making the column of type
Any
. - Add a row index to your table.
- Automatically use multiple threads if your data is big enough (and you are running
Julia
with more than 1 thread). - Rematerialize your original source Tables.jl type, as
CleanTable
implements the Tables.jl interface too. - Apply
Cleaner
transformations on your original table implementation and have the resulting table be of the same type as the original. - Get all repeated values or value combinations that are supposed to be unique.
- Get the percentage distribution of the different categories that make up your table.
- Compare tables to help solve
join
ormerge
problems caused by having different schemas.
Examples:
julia> using DataFrames: DataFrame
julia> using Cleaner
julia> df = DataFrame(" Some bad Name" => [missing, missing, missing], "Another_weird name " => [1, "x", 3])
3Γ2 DataFrame
Row β Some bad Name Another_weird name
β Missing Any
ββββββΌβββββββββββββββββββββββββββββββββββββ
1 β missing 1
2 β missing x
3 β missing 3
julia> df2 = df |> polish_names |> compact_columns! |> reinfer_schema! |> DataFrame
3Γ1 DataFrame
Row β another_weird_name
β Unionβ¦
ββββββΌββββββββββββββββββββ
1 β 1
2 β x
3 β 3
julia> df3 = add_index(df)
βββββββββββββ¬βββββββββββββββββ¬ββββββββββββββββββββββ
β row_index β Some bad Name β Another_weird name β
β Int64 β Missing β Any β
βββββββββββββΌβββββββββββββββββΌββββββββββββββββββββββ€
β 1 β missing β 1 β
β 2 β missing β x β
β 3 β missing β 3 β
βββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββββββββ
julia> compare_table_columns(df, df2, df3)
βββββββββββββββββββββββ¬ββββββββββ¬βββββββββββββββββββββββ¬ββββββββββ
β column_name β tbl1 β tbl2 β tbl3 β
β Symbol β Type β Type β Type β
βββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββββΌββββββββββ€
β Some bad Name β Missing β Nothing β Missing β
β Another_weird name β Any β Nothing β Any β
β another_weird_name β Nothing β Union{Int64, String} β Nothing β
β row_index β Nothing β Nothing β Int64 β
βββββββββββββββββββββββ΄ββββββββββ΄βββββββββββββββββββββββ΄ββββββββββ
If you just want to use a few Cleaner transformations and keep the original table type, we also offer the ROT function variants.
julia> add_index_ROT(df)
3Γ3 DataFrame
Row β row_index Some bad Name Another_weird name
β Int64 Missing Any
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 1 missing 1
2 β 2 missing x
3 β 3 missing 3
For more examples and a comprehensive guide about using Cleaner.jl, feel free to refer to the current stable documentation.