i have a corpus of messy CSVs with matching columns but irregular formatting, so that a column ( say A ) often has some values with an unwanted metadata prefix, and other values without, like so:
i’d like to normalize columns like A so that none of the values have the unwanted text: prefix. i realize that i could do something like this:
function removetypeprefix(x::String)
if length(x) > 5 && x[1:5] == "text:"
return x[6:end]
else
return x
end
end
function removetypeprefix(x::Array{String, 1})
[removetypeprefix(_x) for _x in x]
end
result = transform(df, :A => removetypeprefix)
which would produce a new column of the values in A with the prefix removed – but this seems like a lot of boilerplate to write, and i feel like i must be missing a more straightforward way to transform the column.
function removetypeprefix(x::String)
if length(x) > 5 && x[1:5] == "text:"
return x[6:end]
else
return x
end
end
result = transform(df, :A => ByRow(removetypeprefix))
Use an anonymous function to avoid the declaration
julia> result = transform(df, :A => begin
x -> length(x) > 5 && x[1:5] == "text:" ? x[6:end] : x
end |> ByRow)
(Okay, maybe this last one is a bit ugly)
I would normally recommend using DataFramesMeta in this scenario. But we don’t have support for ByRow at the moment. So it’s not the best fit for this exact problem.
Seems fine to me - speaking as someone T
that’s done a lot of this kind of thing, sometimes you’re stuck with boiler plate. You can do the transform one-liner, but my advice would be to write a robust and general function, since you’re likely to need something like it again.
By general, I mean make the function take the prefix as an argument, check for startswith(str, pre) and use replace(str, pre=>"") instead of using indices. I spend an unfortunate amount of time with this kind of data cleaning. Embrace the built in string functions and get used to them. Good luck!