Better way to clean and normalize a DataFrame of strings?

ben-schulz · April 11, 2021, 6:20pm

hi there,

i have a corpus of messy CSVs with matching columns but irregular formatting, so that a column ( say A ) often has some values with an unwanted metadata prefix, and other values without, like so:

  Row │ A                    
          │ String                            
──────┼───────────────────────────────────
       1 │ text:'value1'
       2 │ text:'value2'
       3 │ value3
       4 | value4

i’d like to normalize columns like A so that none of the values have the unwanted text: prefix. i realize that i could do something like this:

function removetypeprefix(x::String)
    if length(x) > 5 && x[1:5] == "text:"
        return x[6:end]
    else
        return x
    end
end 

function removetypeprefix(x::Array{String, 1})
    [removetypeprefix(_x) for _x in x]
end

result = transform(df, :A => removetypeprefix)

which would produce a new column of the values in A with the prefix removed – but this seems like a lot of boilerplate to write, and i feel like i must be missing a more straightforward way to transform the column.

am i missing a more concise way to do this?

pdeffebach · April 11, 2021, 6:37pm

There are a few options

Setup:

df = DataFrame(A = 
	["text:'value1'"
	"text:'value2'"
	"value3"
	"value4"])

Use ByRow to get rid of the Array method.

function removetypeprefix(x::String)
    if length(x) > 5 && x[1:5] == "text:"
        return x[6:end]
    else
        return x
    end
end 

result = transform(df, :A => ByRow(removetypeprefix))

Use an anonymous function to avoid the declaration

julia> result = transform(df, :A => begin 
       x -> length(x) > 5 && x[1:5] == "text:" ? x[6:end] : x
       end |> ByRow)

(Okay, maybe this last one is a bit ugly)

I would normally recommend using DataFramesMeta in this scenario. But we don’t have support for ByRow at the moment. So it’s not the best fit for this exact problem.

rikh · April 11, 2021, 6:58pm

I’ve created the regex below after experimenting a bit at https://regex101.com/.

julia> df = DataFrame(A = ["text:'value1'", "text:'value2'", "value3"]);

julia> rx = r"text:'([^']*)'";

julia> df.A = replace.(df.A, rx => s"\1");

julia> df
3×1 DataFrame
 Row │ A
     │ String
─────┼────────
   1 │ value1
   2 │ value2
   3 │ value3

rafael.guerra · April 11, 2021, 11:38pm

Another way, using regex for dummies (my category):

using DataFrames, ReadableRegex

df = DataFrame(A = ["text:'value1'", "text:'value2'", "value3"])
rx = exactly(1, ["text:'", "'"])
df.A = replace.(df.A, Regex(rx) => "")

result:

julia> df
3×1 DataFrame
│ Row │ A      │
│     │ String │
├─────┼────────┤
│ 1   │ value1 │
│ 2   │ value2 │
│ 3   │ value3 │

NB: regex might be an overkill in this case, as the following two replace lines achieve the same result:

df.A = replace.(df.A, "text:'" => "")
df.A = replace.(df.A, "'" => "")

kevbonham · April 12, 2021, 1:19am

Seems fine to me - speaking as someone T
that’s done a lot of this kind of thing, sometimes you’re stuck with boiler plate. You can do the transform one-liner, but my advice would be to write a robust and general function, since you’re likely to need something like it again.

By general, I mean make the function take the prefix as an argument, check for startswith(str, pre) and use replace(str, pre=>"") instead of using indices. I spend an unfortunate amount of time with this kind of data cleaning. Embrace the built in string functions and get used to them. Good luck!

Topic		Replies	Views
String column dataframe: replace for another string built from a substring from each row Data	1	1305	January 20, 2021
Remove spaces and units in Dataframe header? New to Julia dataframes	6	1843	September 4, 2020
Replace Character in a column New to Julia dataframes	4	1121	August 24, 2019
Use only the x first letters of strings inside a columns of a DataFrame General Usage	4	1051	November 16, 2017
How to clean/filter/remove wrong data from DataFrame New to Julia question , dataframes	13	735	February 21, 2022

Better way to clean and normalize a DataFrame of strings?

Related topics