Stripping or replacing substrings into a vector of strings

Hi there,

I have a dataframe df_ln, whose names are:

julia> names(df_ln)
11-element Vector{String}:
 "Aluno"
 "DRE"
 "Questionário: Teste 1 (T1) (Real)"
 "Questionário: Primeira Prova (P1) (Real)"
 "Questionário: Teste 2 (T2) (Real)"
 "Questionário: Segunda Prova (P2) (Real)"
 "Questionário: Teste 3 (T3) (Real)"
 "Questionário: Teste 4 (T4) (Real)"
 "Questionário: Terceira Prova (P3) (Real)"
 "Questionário: Segunda Chamada (SC) (Real)"
 "Questionário: NA_Presentes (NP) (Real)"

I would like to rename the columns from 3 to 11 (the last one) such that only the unique abbreviation within the first parentheses is used; i.e., the ensuing new names should be:

 "Aluno"
 "DRE"
 "T1"
 "P1"
 "T2"
 "P2"
 "T3"
 "T4"
 "P3"
 "SC"
 "NP"

I guess this should be easily achievable via the use of the Base replace. or the DataFrames rename! functions with a convenient regular expression pair, but I cannot wrap my head around the correct choice…
Thanks in advance

You can try:

rename!(df_ln) do s
    replace(s, r"[^(]+\(([^)]+)\).*" => s"\1")
end
2 Likes

One regex-less option:

s = names(df_ln)
k = findfirst.('(', s[3:end]) .+ 1
rename!(df_ln, [s[1:2]; @. getindex(s[3:end], range(k, k+1))])
1 Like

semi-regex

 map(i->getindex(as[i], findfirst.(r"\(..\)",names)[i]), 3:11)

3/4 regex

replace.(names, r".*\((..)\).*"=>s"\1")

.* = some leading characters

\( = open parenthesis

(..) = first group of two characters

\) = closing bracket

.* = some trailing characters
help?> match
search: match eachmatch RegexMatch AbstractMatch DimensionMismatch

  match(r::Regex, s::AbstractString[, idx::Integer[, addopts]])

  Search for the first match of the regular expression r in s and
  return a RegexMatch object containing the match, or nothing if the
  match failed.

but

julia>  str="Questionário: Segunda Chamada (SC) (88) (Real)"
"Questionário: Segunda Chamada (SC) (88) (Real)"

julia>  match(r".*\((..)\).*",str)
RegexMatch("Questionário: Segunda Chamada (SC) (88) (Real)", 1="88")  

why doesn’t the pattern matches the first occurrence of (..)?
Or am I misinterpreting the help definition?

Because regexp wildcards take as much characters as they can. And the second (…) allows a match. Replace the first . with [^(] and it can’t skip the first match.

3 Likes

It’s not a specific Julia issue, but while we’re at it, if you can, can you also clarify why we chose to make * consume all the characters it can, down to the last “(…)” in this case?
This seems like a more expensive strategy than the opposite one that would make him stop at the first meeting, or not?

Yeah it isn’t a Julia issue, but a regexp issue. The greedy default of regexp wildcards seems natural with a .* regexp. One wouldn’t want it to match nothing by default.

There are regexp modifiers which make wildcards lazy: .*? matches lazily.

I tried the following patterns in a string containing two potential matches.
One finds the first (in effect it matches exactly the requested group), the other finds the last (in effect it matches the entire string that ends with the group (SC.), but “extracts” only group 1).

julia>  str="Questionário: Segunda Chamada (SC1) (88) (SC2)(Real)"     
"Questionário: Segunda Chamada (SC1) (88) (SC2)(Real)"

julia>  match(r"\((SC.)\)",str)
RegexMatch("(SC1)", 1="SC1")

julia>  match(r".*\((SC.)\)",str)
RegexMatch("Questionário: Segunda Chamada (SC1) (88) (SC2)", 1="SC2")

At this point I am left wondering whether the match help documentation is correct.
When it says that it finds the first occurrence it is good for the first pattern, but it is not good for the second: since the first occurrence (at least in the common sense of the term) would be …

RegexMatch("Questionário: Segunda Chamada (SC1)", 1="SC1")

o no?

Technically, str doesn’t match the regexp r"\((SC.)\)" since it doesn’t start with a (. But match behaves as-if a .*? is inserted before the regexp. That is, finding the earliest position matching the regexp if one exists. The matching string does not include the skipped characters which matche .*?.

This isn’t rocket-science (and I’d feel much safer if rockets didn’t have a lot of regexps), but it is worth it to read a bit about the lazy vs greedy wildcards in one of the thousands of regexp expositions.