Better way to clean and normalize a DataFrame of strings?

hi there,

i have a corpus of messy CSVs with matching columns but irregular formatting, so that a column ( say A ) often has some values with an unwanted metadata prefix, and other values without, like so:

  Row │ A                    
          │ String                            
──────┼───────────────────────────────────
       1 │ text:'value1'
       2 │ text:'value2'
       3 │ value3
       4 | value4

i’d like to normalize columns like A so that none of the values have the unwanted text: prefix. i realize that i could do something like this:

function removetypeprefix(x::String)
    if length(x) > 5 && x[1:5] == "text:"
        return x[6:end]
    else
        return x
    end
end 

function removetypeprefix(x::Array{String, 1})
    [removetypeprefix(_x) for _x in x]
end

result = transform(df, :A => removetypeprefix)

which would produce a new column of the values in A with the prefix removed – but this seems like a lot of boilerplate to write, and i feel like i must be missing a more straightforward way to transform the column.

am i missing a more concise way to do this?

There are a few options

Setup:

df = DataFrame(A = 
	["text:'value1'"
	"text:'value2'"
	"value3"
	"value4"])
  1. Use ByRow to get rid of the Array method.
function removetypeprefix(x::String)
    if length(x) > 5 && x[1:5] == "text:"
        return x[6:end]
    else
        return x
    end
end 

result = transform(df, :A => ByRow(removetypeprefix))
  1. Use an anonymous function to avoid the declaration
julia> result = transform(df, :A => begin 
       x -> length(x) > 5 && x[1:5] == "text:" ? x[6:end] : x
       end |> ByRow)

(Okay, maybe this last one is a bit ugly)

I would normally recommend using DataFramesMeta in this scenario. But we don’t have support for ByRow at the moment. So it’s not the best fit for this exact problem.

I’ve created the regex below after experimenting a bit at https://regex101.com/.

julia> df = DataFrame(A = ["text:'value1'", "text:'value2'", "value3"]);

julia> rx = r"text:'([^']*)'";

julia> df.A = replace.(df.A, rx => s"\1");

julia> df
3×1 DataFrame
 Row │ A
     │ String
─────┼────────
   1 │ value1
   2 │ value2
   3 │ value3
2 Likes

Another way, using regex for dummies (my category):

using DataFrames, ReadableRegex

df = DataFrame(A = ["text:'value1'", "text:'value2'", "value3"])
rx = exactly(1, ["text:'", "'"])
df.A = replace.(df.A, Regex(rx) => "")

result:

julia> df
3×1 DataFrame
│ Row │ A      │
│     │ String │
├─────┼────────┤
│ 1   │ value1 │
│ 2   │ value2 │
│ 3   │ value3 │

NB: regex might be an overkill in this case, as the following two replace lines achieve the same result:

df.A = replace.(df.A, "text:'" => "")
df.A = replace.(df.A, "'" => "")

Seems fine to me - speaking as someone T
that’s done a lot of this kind of thing, sometimes you’re stuck with boiler plate. You can do the transform one-liner, but my advice would be to write a robust and general function, since you’re likely to need something like it again.

By general, I mean make the function take the prefix as an argument, check for startswith(str, pre) and use replace(str, pre=>"") instead of using indices. I spend an unfortunate amount of time with this kind of data cleaning. Embrace the built in string functions and get used to them. Good luck!