Extracting dates and values from a string

jovansam · January 30, 2025, 6:31pm

I would appreciation any guidance. I have the text below and would like to extract all dates and the associated information within “dataX.push()” to end up with for instance: 2023-01-10, 0.0, 0.0, 0.0, 0.0; 2023-01-11, 0.0, 0.01… I figure you should work with Regex or Strings functions but exactly certain of the most workable to do this.

var data0 = [];\n var data1 = [];\n var data2 = [];\n var data3 = [];\n var data4 = [];\n var data5 = [];\n var title1 = 'Processed grapes (tonn)';\n\n data0.push('2023-01-10');\n data1.push(0.0);\n data2.push(0.0);\n data3.push(0.0);\n data4.push(0.0);\n data5.push(97.6);\n data0.push('2023-01-11');\n data1.push(0.0);\n data2.push(0.0);\n data3.push(0.0);\n data4.push(0.0);\n data5.push(174.7);\n data0.push('2023-08-15');\n // console.log(data1);\n\n\n Highcharts.chart('container

Edit 1: Thanks everyone! Super excellent solutions. This has been enlightening and very helpful

rafael.guerra · January 30, 2025, 7:56pm

Defining str as your text above:

replace.(first.(split.(split(str, "push(")[2:end], ");\n")), "'" => "")

produces:

 "2023-01-10"
 "0.0"
 "0.0"
 "0.0"
 ⋮
 "0.0"
 "0.0"
 "174.7"
 "2023-08-15"

cshen · January 30, 2025, 8:05pm

here’s an alternative using regexes

reg = r"data\d\.push\((.*?)\)"
map(m->m.captures, eachmatch(reg, text))

rocco_sprmnt21 · January 30, 2025, 8:39pm

julia> str="""var data0 = [];\n    var data1 = [];\n    var data2 = [];\n    var data3 = [];\n    var data4 = [];\n    var data5 = [];\n    var title1 =  'Processed grapes (tonn)';\n\n        data0.push('2023-01-10');\n        data1.push(0.0);\n        data2.push(0.0);\n        data3.push(0.0);\n        data4.push(0.0);\n        data5.push(97.6);\n        data0.push('2023-01-11');\n        data1.push(0.0);\n        data2.push(0.0);\n        data3.push(0.0);\n        data4.push(0.0);\n        data5.push(174.7);\n        data0.push('2023-08-15');\n
       """
"var data0 = [];\n    var data1 = [];\n    var data2 = [];\n    var data3 = [];\n    var data4 = [];\n    var data5 = [];\n    var title1 =  'Processed grapes (tonn)';\n\n        data0.push('2023-01-10');\n        data1.push(0.0);\n " ⋯ 76 bytes ⋯ "      data5.push(97.6);\n        data0.push('2023-01-11');\n        data1.push(0.0);\n        data2.push(0.0);\n        data3.push(0.0);\n        data4.push(0.0);\n        data5.push(174.7);\n        data0.push('2023-08-15');\n  \n"

julia> pat=r"\((\d+\.\d+)\)|'(\d+-\d+-\d+)'"
r"\((\d+\.\d+)\)|'(\d+-\d+-\d+)'"

julia> res=String[]
String[]

julia> for (f,l) in eachmatch(pat,str)
           push!(res,something(f,l))
       end

julia> res
13-element Vector{String}:
 "2023-01-10"
 "0.0"
 "0.0"
 "0.0"
 "0.0"
 "97.6"
 "2023-01-11"
 "0.0"
 "0.0"
 "0.0"
 "0.0"
 "174.7"
 "2023-08-15"

rocco_sprmnt21 · January 30, 2025, 8:48pm

what does the ? in (.*?) do?

cshen · January 30, 2025, 9:01pm

Makes the * matching lazy instead of eager.

\(.*\) matches 0 or more characters between two parentheses, but it’s eager by default that means that it’s going to match all characters between the first parenthesis it find until the last it finds.
Adding the ? makes the operator lazy which matches all characters until the first closing parenthesis encountered.
In short * matches as many characters as possible that fit the patter, while *? as few as possible.

rocco_sprmnt21 · January 30, 2025, 9:05pm

Thanks for clarifying.
Can you, using your scheme, exclude the ’ around the dates?

cshen · January 30, 2025, 9:28pm

julia> reg = r"push\('?(.*?)'?\)"
r"push\('?(.*?)'?\)"

julia> map(m->m.captures[1], eachmatch(reg, text))
13-element Vector{SubString{String}}:
 "2023-01-10"
 "0.0"
 "0.0"
 "0.0"
 "0.0"
 "97.6"
 "2023-01-11"
 "0.0"
 "0.0"
 "0.0"
 "0.0"
 "174.7"
 "2023-08-15"

The extra '? before and after the capture group excludes a ' if present.

rocco_sprmnt21 · January 30, 2025, 9:58pm

reg = r"push\('*(.*?)'*\)"

this form should also achieve the same result.
As far as I know ? after a character indicates 1 or more occurrences of that character, while * after a character indicates 0 or more occurrences. What is the role of ? in this case.

cshen · January 30, 2025, 10:02pm

yes, * is 0 or more, as many as possible. While ? is 0 or one, as many as possible.
It’s the equivelent of “optional” in regexes.
The * version would match the case in which you have multiple '.

Topic		Replies	Views
Cast/parse multiple variables at once General Usage parsing , datetime	5	198	May 17, 2024
Extract variables from strings General Usage macros , strings	13	1859	December 12, 2021
Regex assistance converting from R to Julia General Usage question	12	397	April 13, 2024
Extracting a float from a string New to Julia strings , regex	31	3981	October 24, 2022
Problem parsing a txt file General Usage	4	394	May 2, 2021

Extracting dates and values from a string

Related topics