Cleaning up strings via piping...?

I’m hoping to get some tips for cleaning up strings (maybe with piping). For example, take the following vector:

dirtydata = ["O' Malley's Irish Pub", "El Sueño", "123 & ABC & 100%", 1.23, missing, 12]

What I would like to do is convert everything in this vector to a string, and then “standardize” it by removing punctuation, stripping non-ascii and non-alphanumeric characters, removing all spaces, etc.

I wrote a function that does some of these things as follows

function convert_clean(arr)
    arr = string.(arr)
    arr = Unicode.normalize.(arr, stripmark=true)
    arr = map(x -> replace(x, r"[^a-zA-Z0-9_]" => ""), arr)
    return arr
end

julia> convert_clean(dirtydata)
6-element Array{String,1}:
 "OMalleysIrishPub"
 "ElSueno"
 "123ABC100"
 "123"
 "missing"
 "12"

but I’m wondering if it wouldn’t be cleaner to use piping |>? I can’t seem to make it work…I tried several different ways to convert the above but none of them were successful. Maybe I should just nest all of these operations inside one another?

function convert_clean2(arr)
    return map(x -> replace(x, r"[^a-zA-Z0-9_]" => ""), Unicode.normalize.(string.(arr), stripmark=true))
end

I like that I can get it down to one line, but it seems like a nightmare in terms of readability. The first solution just doesn’t feel right either though, repeatedly re-defining arr inside the function.

In general, I would avoid creating a whole sequence of intermediate arrays. map with the do block sequence is pretty clear and avoids the intermediate arrays:

convert_clean(arr) = map(arr) do x
    s = string(x)
    s = Unicode.normalize(s, stripmark=true)
    s = replace(s, r"[^a-zA-Z0-9_]" => "")
end

for example.

You could also use .|> here to apply a bunch of functions elementwise, but it is a bit awkward because of the need to explicitly construct anonymous functions:

convert_clean(arr) = arr .|> string .|>
         s -> Unicode.normalize(s, stripmark=true)  .|>
         s -> replace(s, r"[^a-zA-Z0-9_]" => "")

Hopefully someday you will be able to use a magic underscore

convert_clean(arr) = arr .|> string .|>
         Unicode.normalize(_, stripmark=true)  .|>
         replace(_, r"[^a-zA-Z0-9_]" => "")

but not yet.

3 Likes

Thanks, Steven. I need to learn the do blocks once and for all. I see them in other people’s code but I’ve never really understood exactly what they do :woozy_face: and so I rarely implement them in my code but this looks like a great use case!

1 Like