I’m hoping to get some tips for cleaning up strings (maybe with piping). For example, take the following vector:
dirtydata = ["O' Malley's Irish Pub", "El Sueño", "123 & ABC & 100%", 1.23, missing, 12]
What I would like to do is convert everything in this vector to a string, and then “standardize” it by removing punctuation, stripping non-ascii and non-alphanumeric characters, removing all spaces, etc.
I wrote a function that does some of these things as follows
function convert_clean(arr)
arr = string.(arr)
arr = Unicode.normalize.(arr, stripmark=true)
arr = map(x -> replace(x, r"[^a-zA-Z0-9_]" => ""), arr)
return arr
end
julia> convert_clean(dirtydata)
6-element Array{String,1}:
"OMalleysIrishPub"
"ElSueno"
"123ABC100"
"123"
"missing"
"12"
but I’m wondering if it wouldn’t be cleaner to use piping |>
? I can’t seem to make it work…I tried several different ways to convert the above but none of them were successful. Maybe I should just nest all of these operations inside one another?
function convert_clean2(arr)
return map(x -> replace(x, r"[^a-zA-Z0-9_]" => ""), Unicode.normalize.(string.(arr), stripmark=true))
end
I like that I can get it down to one line, but it seems like a nightmare in terms of readability. The first solution just doesn’t feel right either though, repeatedly re-defining arr
inside the function.