Extracting dates and values from a string

I would appreciation any guidance. I have the text below and would like to extract all dates and the associated information within “dataX.push()” to end up with for instance: 2023-01-10, 0.0, 0.0, 0.0, 0.0; 2023-01-11, 0.0, 0.01… I figure you should work with Regex or Strings functions but exactly certain of the most workable to do this.

var data0 = [];\n var data1 = [];\n var data2 = [];\n var data3 = [];\n var data4 = [];\n var data5 = [];\n var title1 = 'Processed grapes (tonn)';\n\n data0.push('2023-01-10');\n data1.push(0.0);\n data2.push(0.0);\n data3.push(0.0);\n data4.push(0.0);\n data5.push(97.6);\n data0.push('2023-01-11');\n data1.push(0.0);\n data2.push(0.0);\n data3.push(0.0);\n data4.push(0.0);\n data5.push(174.7);\n data0.push('2023-08-15');\n // console.log(data1);\n\n\n Highcharts.chart('container

Edit 1: Thanks everyone! Super excellent solutions. This has been enlightening and very helpful

Defining str as your text above:

replace.(first.(split.(split(str, "push(")[2:end], ");\n")), "'" => "")

produces:

 "2023-01-10"
 "0.0"
 "0.0"
 "0.0"
 ⋮
 "0.0"
 "0.0"
 "174.7"
 "2023-08-15"
1 Like

here’s an alternative using regexes

reg = r"data\d\.push\((.*?)\)"
map(m->m.captures, eachmatch(reg, text))
1 Like
julia> str="""var data0 = [];\n    var data1 = [];\n    var data2 = [];\n    var data3 = [];\n    var data4 = [];\n    var data5 = [];\n    var title1 =  'Processed grapes (tonn)';\n\n        data0.push('2023-01-10');\n        data1.push(0.0);\n        data2.push(0.0);\n        data3.push(0.0);\n        data4.push(0.0);\n        data5.push(97.6);\n        data0.push('2023-01-11');\n        data1.push(0.0);\n        data2.push(0.0);\n        data3.push(0.0);\n        data4.push(0.0);\n        data5.push(174.7);\n        data0.push('2023-08-15');\n
       """
"var data0 = [];\n    var data1 = [];\n    var data2 = [];\n    var data3 = [];\n    var data4 = [];\n    var data5 = [];\n    var title1 =  'Processed grapes (tonn)';\n\n        data0.push('2023-01-10');\n        data1.push(0.0);\n " ⋯ 76 bytes ⋯ "      data5.push(97.6);\n        data0.push('2023-01-11');\n        data1.push(0.0);\n        data2.push(0.0);\n        data3.push(0.0);\n        data4.push(0.0);\n        data5.push(174.7);\n        data0.push('2023-08-15');\n  \n"

julia> pat=r"\((\d+\.\d+)\)|'(\d+-\d+-\d+)'"
r"\((\d+\.\d+)\)|'(\d+-\d+-\d+)'"

julia> res=String[]
String[]

julia> for (f,l) in eachmatch(pat,str)
           push!(res,something(f,l))
       end

julia> res
13-element Vector{String}:
 "2023-01-10"
 "0.0"
 "0.0"
 "0.0"
 "0.0"
 "97.6"
 "2023-01-11"
 "0.0"
 "0.0"
 "0.0"
 "0.0"
 "174.7"
 "2023-08-15"
1 Like

what does the ? in (.*?) do?

Makes the * matching lazy instead of eager.

\(.*\) matches 0 or more characters between two parentheses, but it’s eager by default that means that it’s going to match all characters between the first parenthesis it find until the last it finds.
Adding the ? makes the operator lazy which matches all characters until the first closing parenthesis encountered.
In short * matches as many characters as possible that fit the patter, while *? as few as possible.

2 Likes

Thanks for clarifying.
Can you, using your scheme, exclude the ’ around the dates?

julia> reg = r"push\('?(.*?)'?\)"
r"push\('?(.*?)'?\)"

julia> map(m->m.captures[1], eachmatch(reg, text))
13-element Vector{SubString{String}}:
 "2023-01-10"
 "0.0"
 "0.0"
 "0.0"
 "0.0"
 "97.6"
 "2023-01-11"
 "0.0"
 "0.0"
 "0.0"
 "0.0"
 "174.7"
 "2023-08-15"

The extra '? before and after the capture group excludes a ' if present.

2 Likes
reg = r"push\('*(.*?)'*\)"

this form should also achieve the same result.
As far as I know ? after a character indicates 1 or more occurrences of that character, while * after a character indicates 0 or more occurrences. What is the role of ? in this case.

yes, * is 0 or more, as many as possible. While ? is 0 or one, as many as possible.
It’s the equivelent of “optional” in regexes.
The * version would match the case in which you have multiple '.