Regex assistance converting from R to Julia

00krishna · April 12, 2024, 2:02am

I am working on converting a regular expression from the R format to Julia format. I have made a fair bit of progress, but I am kinda stuck. I was hoping that someone with greater regex mojo might be able to help out.

I am parsing some census data. The original text is:

testex = "Codes999999999 = N.I.U.\n999999998 = Missing. (1962-1964 only)\nValues can be negative.\n\nThe Census Bureau applies different disclosure avoidance measures across time for individuals with high income in this variable. Detailed explanations of these methods, topcodes, and replacement value and swap value thresholds are available here."

I want to extract out the first two codes. So I want to extract into a table or array.

---------------------------------------
999999999 | N.I.U.                    | 
---------------------------------------
999999998 | Missing. (1962-1964 only) |
---------------------------------------

The original R regular expressions are below, taken from the IPUMSR package.

"^(?<val>-?[0-9.,]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(?<lbl>.+?)$"

 "^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$"

I was able to get the code below working, but I am still not getting the exact outputs I am looking for.

c = match(r"([0-9,.]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(.+)", testex)

However this code gives me basically the first entry, but not the second one. So I am missing something. Also, how do I extract the different sections of the matches to create the table or array that I am looking to create.

Any support is appreciate.

danielwe · April 12, 2024, 3:37am

Try replacing match with eachmatch

rocco_sprmnt21 · April 12, 2024, 6:51am

If I don’t misunderstand the regex expression, it also match the paragraph containing the ‘.’ and the following text.

julia> em = eachmatch(r"([0-9,.]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(.+)", testex)
Base.RegexMatchIterator(r"([0-9,.]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(.+)", "Codes999999999 = N.I.U.\n999999998 = Missing. (1962-1964 only)\nValues can be negative.\n\nThe Census Bureau applies different disclosure avoidance measures across time for individuals with high income in this variable. Detailed explanations of these methods, topcodes, and replacement value and swap value thresholds are available here.", false)

julia> [e.match for e in em]
3-element Vector{SubString{String}}:
 "999999999 = N.I.U."
 "999999998 = Missing. (1962-1964 only)"
 ". Detailed explanations of thes" ⋯ 57 bytes ⋯ " thresholds are available here."

If, in addition to testing with regex expressions, the need is to extract the first two lines containing a number ‘=’ and a text, it can be done in the following way.


sc=split(testex,'\n')

idx=findall(e->contains(e,"="),sc)

split.(sc[idx],"=")

and then continue with the cleaning works

sijo · April 12, 2024, 8:00am

These regexes are in the same format as in Julia (Perl-compatible regular expressions) so you can use them without converting anything. The only catch is here you have a testex string with multiple lines, and you want the regex ^...$ to match a whole line rather than a whole string. For this you need to add the m flag to the regex by writing it r"..."m.

Note that the first regex starts with ^(?<val>-?[0-9.,]+). This will only match lines that start with a dash - (optional), followed by a non-zero number of digits/periods/commas. This can match the second line which starts with 999999998, but not the first line which starts with C.

On the other hand the second regex starts with ^(?<val>[[:graph:]]+). This will match any line that starts with graphical characters (alphabetic, numeric or punctionation). This can match both lines, so let’s use this one:

testex = "Codes999999999 = N.I.U.\n999999998 = Missing. (1962-1964 only)\nValues can be negative.\n\nThe Census Bureau applies different disclosure avoidance measures across time for individuals with high income in this variable. Detailed explanations of these methods, topcodes, and replacement value and swap value thresholds are available here."

regex = r"^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$"m

We can use eachmatch as suggested by @danielwe to match all the lines with a single call. Each mach can be indexed with the capture names (val and lbl) that we have defined in the regex. We use this to collect the values we want:

julia> matches = [[m[:val], m[:lbl]] for m in eachmatch(regex, testex)]
2-element Vector{Vector{SubString{String}}}:
 ["Codes999999999", "N.I.U."]
 ["999999998", "Missing. (1962-1964 only)"]

Then to have the codes in a single array:

julia> stack(matches, dims=1)
2×2 Matrix{SubString{String}}:
 "Codes999999999"  "N.I.U."
 "999999998"       "Missing. (1962-1964 only)"

If you want to put the result in a data frame, it’s better to gather the codes as a vector of named tuples rather than a vector of vectors:

using DataFrames

julia> matches = [(value=m[:val], label=m[:lbl]) for m in eachmatch(regex, testex)]
2-element Vector{@NamedTuple{value::SubString{String}, label::SubString{String}}}:
 (value = "Codes999999999", label = "N.I.U.")
 (value = "999999998", label = "Missing. (1962-1964 only)")

julia> DataFrame(matches)
2×2 DataFrame
 Row │ value           label
     │ SubStrin…       SubStrin…
─────┼───────────────────────────────────────────
   1 │ Codes999999999  N.I.U.
   2 │ 999999998       Missing. (1962-1964 only)

Dan · April 12, 2024, 11:35am

It seems the original regexes where designed to work on single lines, while testex is a multiline string.

Try using split(testex, '\n') or iterating over eachsplit(testex, '\n') and applying the regexes on each line.

I see someone already mentioned this. The difficulty is with ^ and $ which match the start and end of the string instead of a line, so a string with the complete multiline text confuses the regex.

sijo · April 12, 2024, 2:02pm

Is that better than using the m flag as I proposed above?

Dan · April 12, 2024, 2:22pm

Nope. The m flag is better (but only now I know about it - thanks)

rocco_sprmnt21 · April 12, 2024, 7:31pm

I don’t want to suggest alternative solutions.
I just ask for some clarification on how regular expressions work.
By trial and error (not knowing how many of the subexpressions work) I found that the following form obtains the desired result.
I tried to remove all the “superfluous” and set the search for groups between two delimiters ([^|‘\n’] and ‘\n’) without using the r"…"m modifier.
I wonder what is the possible use of different ((( pattern))) pairs of nested brackets.
These, from what I have seen, do nothing other than repeat the captured group as many times as there are pairs of brackets.
But this could be done easily in post-processing. So are they used for anything else?
Why is the first character lost?

regex3 = r"[^|\n](?<val>[[:graph:]]+)(( = ))(?<lbl>.+)\n" 
matches = [(value=m[:val],equals=(m[2],m[3]), label=m[:lbl]) for m in eachmatch(regex3, testex)]

testex = "Codes999999999 = N.I.U.--#@\n999999998 = Missing. (1962-1964 only)\nValues can be negative.\n\nThe Census Bureau applies different disclosure avoidance measures across time for individuals with high income in this variable. Detailed explanations of these methods, topcodes, and replacement value and swap value thresholds are available here."

julia> DataFrame(matches)
2×3 DataFrame
 Row │ value          equals          label                         
     │ SubStrin…      Tuple…          SubStrin…
─────┼──────────────────────────────────────────────────────────    
   1 │ odes999999999  (" = ", " = ")  N.I.U.--#@
   2 │ 99999998       (" = ", " = ")  Missing. (1962-1964 only)

recovered the first character

julia> regex4 = r"[\n|^]?(?<val>[[:graph:]]+)(( = ))(?<lbl>.+)\n"   
r"[\n|^]?(?<val>[[:graph:]]+)(( = ))(?<lbl>.+)\n"

julia> matches = [(value=m[:val],equals=(m[2],m[3]), label=m[:lbl]) for m in eachmatch(regex4, testex)]
2-element Vector{NamedTuple{(:value, :equals, :label), Tuple{SubString{String}, Tuple{SubString{String}, SubString{String}}, SubString{String}}}}:
 (value = "Codes999999999", equals = (" = ", " = "), label = "N.I.U.--#@")
 (value = "999999998", equals = (" = ", " = "), label = "Missing. (1962-1964 only)")

00krishna · April 12, 2024, 8:18pm

Ahh, I had not seen eachmatch. That was a big part of the solution. Thanks for that.

00krishna · April 12, 2024, 8:19pm

Wow, this really solved the problem. That is really helpful. Regex is definitely not my forte, and I keep forgetting what all the symbols mean if I don’t use them frequency. Thanks again for your help. This makes a lot more sense now.

00krishna · April 12, 2024, 8:20pm

This is helpful Rocco. It makes sense. I was thinking along these lines till @sijo 's response. But good to know there are multiple ways to approach the problem.
Thanks again.

00krishna · April 12, 2024, 8:22pm

Thanks for the help with this. Yeah, I did not even think about the whole multiline issue. That helped a lot.

sijo · April 13, 2024, 3:12pm

Yeah I don’t think the nested parentheses in a pattern of the form ((...)) do anything except changing the indices of the matching groups. Take for example

r"^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$"

The inner parentheses in (([[:blank:]]+[[:punct:]|=]+[[:blank:]])+) are important to define what the last + should do, but the outer parentheses can be removed.

Inside square brackets, the operator characters lose their meaning, they become literal characters that define a character class, and if the class starts with ^ the match is inverted. So [^|\n] will match a single character that is not | and not \n.

In [\n|^]? the ^ doesn’t come first so it doesn’t mean to invert the match: [\n|^] is a class that matches either \n, | or ^. The ? afterwards means that this part of the regexp is optional, and indeed the regexp matches the string in such a way that [\n|^] doesn’t match i.e. doesn’t consume any character.

Topic		Replies	Views
Regular Expression Data	4	870	April 20, 2017
Correct usage of regex matches New to Julia regex	5	666	May 9, 2021
Useful Regex's for Julia code bases Tooling proposal , regex	0	679	May 16, 2019
Did Julia 1.6 introduced a regression for regex properties? General Usage strings , regex	2	648	March 30, 2021
Strange regex error (bug?) General Usage regex	3	483	July 25, 2022

Regex assistance converting from R to Julia

Related topics