julia> names(df_ln)
11-element Vector{String}:
"Aluno"
"DRE"
"Questionário: Teste 1 (T1) (Real)"
"Questionário: Primeira Prova (P1) (Real)"
"Questionário: Teste 2 (T2) (Real)"
"Questionário: Segunda Prova (P2) (Real)"
"Questionário: Teste 3 (T3) (Real)"
"Questionário: Teste 4 (T4) (Real)"
"Questionário: Terceira Prova (P3) (Real)"
"Questionário: Segunda Chamada (SC) (Real)"
"Questionário: NA_Presentes (NP) (Real)"
I would like to rename the columns from 3 to 11 (the last one) such that only the unique abbreviation within the first parentheses is used; i.e., the ensuing new names should be:
I guess this should be easily achievable via the use of the Base replace. or the DataFrames rename! functions with a convenient regular expression pair, but I cannot wrap my head around the correct choice…
Thanks in advance
help?> match
search: match eachmatch RegexMatch AbstractMatch DimensionMismatch
match(r::Regex, s::AbstractString[, idx::Integer[, addopts]])
Search for the first match of the regular expression r in s and
return a RegexMatch object containing the match, or nothing if the
match failed.
but
julia> str="Questionário: Segunda Chamada (SC) (88) (Real)"
"Questionário: Segunda Chamada (SC) (88) (Real)"
julia> match(r".*\((..)\).*",str)
RegexMatch("Questionário: Segunda Chamada (SC) (88) (Real)", 1="88")
why doesn’t the pattern matches the first occurrence of (..)?
Or am I misinterpreting the help definition?
Because regexp wildcards take as much characters as they can. And the second (…) allows a match. Replace the first . with [^(] and it can’t skip the first match.
It’s not a specific Julia issue, but while we’re at it, if you can, can you also clarify why we chose to make * consume all the characters it can, down to the last “(…)” in this case?
This seems like a more expensive strategy than the opposite one that would make him stop at the first meeting, or not?
Yeah it isn’t a Julia issue, but a regexp issue. The greedy default of regexp wildcards seems natural with a .* regexp. One wouldn’t want it to match nothing by default.
There are regexp modifiers which make wildcards lazy: .*? matches lazily.
I tried the following patterns in a string containing two potential matches.
One finds the first (in effect it matches exactly the requested group), the other finds the last (in effect it matches the entire string that ends with the group (SC.), but “extracts” only group 1).
julia> str="Questionário: Segunda Chamada (SC1) (88) (SC2)(Real)"
"Questionário: Segunda Chamada (SC1) (88) (SC2)(Real)"
julia> match(r"\((SC.)\)",str)
RegexMatch("(SC1)", 1="SC1")
julia> match(r".*\((SC.)\)",str)
RegexMatch("Questionário: Segunda Chamada (SC1) (88) (SC2)", 1="SC2")
At this point I am left wondering whether the match help documentation is correct.
When it says that it finds the first occurrence it is good for the first pattern, but it is not good for the second: since the first occurrence (at least in the common sense of the term) would be …
RegexMatch("Questionário: Segunda Chamada (SC1)", 1="SC1")
Technically, str doesn’t match the regexp r"\((SC.)\)" since it doesn’t start with a (. But match behaves as-if a .*? is inserted before the regexp. That is, finding the earliest position matching the regexp if one exists. The matching string does not include the skipped characters which matche .*?.
This isn’t rocket-science (and I’d feel much safer if rockets didn’t have a lot of regexps), but it is worth it to read a bit about the lazy vs greedy wildcards in one of the thousands of regexp expositions.