Regex assistance converting from R to Julia

I am working on converting a regular expression from the R format to Julia format. I have made a fair bit of progress, but I am kinda stuck. I was hoping that someone with greater regex mojo might be able to help out.

I am parsing some census data. The original text is:

testex = "Codes999999999 = N.I.U.\n999999998 = Missing. (1962-1964 only)\nValues can be negative.\n\nThe Census Bureau applies different disclosure avoidance measures across time for individuals with high income in this variable. Detailed explanations of these methods, topcodes, and replacement value and swap value thresholds are available here."

I want to extract out the first two codes. So I want to extract into a table or array.

---------------------------------------
999999999 | N.I.U.                    | 
---------------------------------------
999999998 | Missing. (1962-1964 only) |
---------------------------------------

The original R regular expressions are below, taken from the IPUMSR package.

"^(?<val>-?[0-9.,]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(?<lbl>.+?)$"

 "^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$"

I was able to get the code below working, but I am still not getting the exact outputs I am looking for.

c = match(r"([0-9,.]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(.+)", testex)

However this code gives me basically the first entry, but not the second one. So I am missing something. Also, how do I extract the different sections of the matches to create the table or array that I am looking to create.

Any support is appreciate.

1 Like

Try replacing match with eachmatch

1 Like

If I don’t misunderstand the regex expression, it also match the paragraph containing the β€˜.’ and the following text.

julia> em = eachmatch(r"([0-9,.]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(.+)", testex)
Base.RegexMatchIterator(r"([0-9,.]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(.+)", "Codes999999999 = N.I.U.\n999999998 = Missing. (1962-1964 only)\nValues can be negative.\n\nThe Census Bureau applies different disclosure avoidance measures across time for individuals with high income in this variable. Detailed explanations of these methods, topcodes, and replacement value and swap value thresholds are available here.", false)

julia> [e.match for e in em]
3-element Vector{SubString{String}}:
 "999999999 = N.I.U."
 "999999998 = Missing. (1962-1964 only)"
 ". Detailed explanations of thes" β‹― 57 bytes β‹― " thresholds are available here."

If, in addition to testing with regex expressions, the need is to extract the first two lines containing a number β€˜=’ and a text, it can be done in the following way.


sc=split(testex,'\n')

idx=findall(e->contains(e,"="),sc)

split.(sc[idx],"=")

and then continue with the cleaning works

1 Like

These regexes are in the same format as in Julia (Perl-compatible regular expressions) so you can use them without converting anything. The only catch is here you have a testex string with multiple lines, and you want the regex ^...$ to match a whole line rather than a whole string. For this you need to add the m flag to the regex by writing it r"..."m.

Note that the first regex starts with ^(?<val>-?[0-9.,]+). This will only match lines that start with a dash - (optional), followed by a non-zero number of digits/periods/commas. This can match the second line which starts with 999999998, but not the first line which starts with C.

On the other hand the second regex starts with ^(?<val>[[:graph:]]+). This will match any line that starts with graphical characters (alphabetic, numeric or punctionation). This can match both lines, so let’s use this one:

testex = "Codes999999999 = N.I.U.\n999999998 = Missing. (1962-1964 only)\nValues can be negative.\n\nThe Census Bureau applies different disclosure avoidance measures across time for individuals with high income in this variable. Detailed explanations of these methods, topcodes, and replacement value and swap value thresholds are available here."

regex = r"^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$"m

We can use eachmatch as suggested by @danielwe to match all the lines with a single call. Each mach can be indexed with the capture names (val and lbl) that we have defined in the regex. We use this to collect the values we want:

julia> matches = [[m[:val], m[:lbl]] for m in eachmatch(regex, testex)]
2-element Vector{Vector{SubString{String}}}:
 ["Codes999999999", "N.I.U."]
 ["999999998", "Missing. (1962-1964 only)"]

Then to have the codes in a single array:

julia> stack(matches, dims=1)
2Γ—2 Matrix{SubString{String}}:
 "Codes999999999"  "N.I.U."
 "999999998"       "Missing. (1962-1964 only)"

If you want to put the result in a data frame, it’s better to gather the codes as a vector of named tuples rather than a vector of vectors:

using DataFrames

julia> matches = [(value=m[:val], label=m[:lbl]) for m in eachmatch(regex, testex)]
2-element Vector{@NamedTuple{value::SubString{String}, label::SubString{String}}}:
 (value = "Codes999999999", label = "N.I.U.")
 (value = "999999998", label = "Missing. (1962-1964 only)")

julia> DataFrame(matches)
2Γ—2 DataFrame
 Row β”‚ value           label
     β”‚ SubStrin…       SubStrin…
─────┼───────────────────────────────────────────
   1 β”‚ Codes999999999  N.I.U.
   2 β”‚ 999999998       Missing. (1962-1964 only)
5 Likes

It seems the original regexes where designed to work on single lines, while testex is a multiline string.

Try using split(testex, '\n') or iterating over eachsplit(testex, '\n') and applying the regexes on each line.

I see someone already mentioned this. The difficulty is with ^ and $ which match the start and end of the string instead of a line, so a string with the complete multiline text confuses the regex.

1 Like

Is that better than using the m flag as I proposed above?

1 Like

Nope. The m flag is better (but only now I know about it - thanks)

1 Like

I don’t want to suggest alternative solutions.
I just ask for some clarification on how regular expressions work.
By trial and error (not knowing how many of the subexpressions work) I found that the following form obtains the desired result.
I tried to remove all the β€œsuperfluous” and set the search for groups between two delimiters ([^|β€˜\n’] and β€˜\n’) without using the r"…"m modifier.
I wonder what is the possible use of different ((( pattern))) pairs of nested brackets.
These, from what I have seen, do nothing other than repeat the captured group as many times as there are pairs of brackets.
But this could be done easily in post-processing. So are they used for anything else?
Why is the first character lost?

regex3 = r"[^|\n](?<val>[[:graph:]]+)(( = ))(?<lbl>.+)\n" 
matches = [(value=m[:val],equals=(m[2],m[3]), label=m[:lbl]) for m in eachmatch(regex3, testex)]
testex = "Codes999999999 = N.I.U.--#@\n999999998 = Missing. (1962-1964 only)\nValues can be negative.\n\nThe Census Bureau applies different disclosure avoidance measures across time for individuals with high income in this variable. Detailed explanations of these methods, topcodes, and replacement value and swap value thresholds are available here."

julia> DataFrame(matches)
2Γ—3 DataFrame
 Row β”‚ value          equals          label                         
     β”‚ SubStrin…      Tuple…          SubStrin…
─────┼──────────────────────────────────────────────────────────    
   1 β”‚ odes999999999  (" = ", " = ")  N.I.U.--#@
   2 β”‚ 99999998       (" = ", " = ")  Missing. (1962-1964 only)

recovered the first character

julia> regex4 = r"[\n|^]?(?<val>[[:graph:]]+)(( = ))(?<lbl>.+)\n"   
r"[\n|^]?(?<val>[[:graph:]]+)(( = ))(?<lbl>.+)\n"

julia> matches = [(value=m[:val],equals=(m[2],m[3]), label=m[:lbl]) for m in eachmatch(regex4, testex)]
2-element Vector{NamedTuple{(:value, :equals, :label), Tuple{SubString{String}, Tuple{SubString{String}, SubString{String}}, SubString{String}}}}:
 (value = "Codes999999999", equals = (" = ", " = "), label = "N.I.U.--#@")
 (value = "999999998", equals = (" = ", " = "), label = "Missing. (1962-1964 only)")

Ahh, I had not seen eachmatch. That was a big part of the solution. Thanks for that.

1 Like

Wow, this really solved the problem. That is really helpful. Regex is definitely not my forte, and I keep forgetting what all the symbols mean if I don’t use them frequency. Thanks again for your help. This makes a lot more sense now.

This is helpful Rocco. It makes sense. I was thinking along these lines till @sijo 's response. But good to know there are multiple ways to approach the problem.
Thanks again.

Thanks for the help with this. Yeah, I did not even think about the whole multiline issue. That helped a lot.

Yeah I don’t think the nested parentheses in a pattern of the form ((...)) do anything except changing the indices of the matching groups. Take for example

r"^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$"

The inner parentheses in (([[:blank:]]+[[:punct:]|=]+[[:blank:]])+) are important to define what the last + should do, but the outer parentheses can be removed.

Inside square brackets, the operator characters lose their meaning, they become literal characters that define a character class, and if the class starts with ^ the match is inverted. So [^|\n] will match a single character that is not | and not \n.

In [\n|^]? the ^ doesn’t come first so it doesn’t mean to invert the match: [\n|^] is a class that matches either \n, | or ^. The ? afterwards means that this part of the regexp is optional, and indeed the regexp matches the string in such a way that [\n|^] doesn’t match i.e. doesn’t consume any character.