I am working on converting a regular expression from the R format to Julia format. I have made a fair bit of progress, but I am kinda stuck. I was hoping that someone with greater regex mojo might be able to help out.
I am parsing some census data. The original text is:
testex = "Codes999999999 = N.I.U.\n999999998 = Missing. (1962-1964 only)\nValues can be negative.\n\nThe Census Bureau applies different disclosure avoidance measures across time for individuals with high income in this variable. Detailed explanations of these methods, topcodes, and replacement value and swap value thresholds are available here."
I want to extract out the first two codes. So I want to extract into a table or array.
---------------------------------------
999999999 | N.I.U. |
---------------------------------------
999999998 | Missing. (1962-1964 only) |
---------------------------------------
The original R regular expressions are below, taken from the IPUMSR package.
"^(?<val>-?[0-9.,]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(?<lbl>.+?)$"
"^(?<val>[[:graph:]]+)(([[:blank:]]+[[:punct:]|=]+[[:blank:]])+)(?<lbl>.+)$"
I was able to get the code below working, but I am still not getting the exact outputs I am looking for.
c = match(r"([0-9,.]+)(([[:blank:]][[:punct:]]|[[:punct:]][[:blank:]]|[[:blank:]]|=)+)(.+)", testex)
However this code gives me basically the first entry, but not the second one. So I am missing something. Also, how do I extract the different sections of the matches to create the table or array that I am looking to create.
Any support is appreciate.