Stripping or replacing substrings into a vector of strings

mocalvao · February 27, 2024, 3:00pm

Hi there,

I have a dataframe df_ln, whose names are:

julia> names(df_ln)
11-element Vector{String}:
 "Aluno"
 "DRE"
 "Questionário: Teste 1 (T1) (Real)"
 "Questionário: Primeira Prova (P1) (Real)"
 "Questionário: Teste 2 (T2) (Real)"
 "Questionário: Segunda Prova (P2) (Real)"
 "Questionário: Teste 3 (T3) (Real)"
 "Questionário: Teste 4 (T4) (Real)"
 "Questionário: Terceira Prova (P3) (Real)"
 "Questionário: Segunda Chamada (SC) (Real)"
 "Questionário: NA_Presentes (NP) (Real)"

I would like to rename the columns from 3 to 11 (the last one) such that only the unique abbreviation within the first parentheses is used; i.e., the ensuing new names should be:

 "Aluno"
 "DRE"
 "T1"
 "P1"
 "T2"
 "P2"
 "T3"
 "T4"
 "P3"
 "SC"
 "NP"

I guess this should be easily achievable via the use of the Base replace. or the DataFrames rename! functions with a convenient regular expression pair, but I cannot wrap my head around the correct choice…
Thanks in advance

Dan · February 27, 2024, 4:06pm

You can try:

rename!(df_ln) do s
    replace(s, r"[^(]+\(([^)]+)\).*" => s"\1")
end

rafael.guerra · February 27, 2024, 5:49pm

One regex-less option:

s = names(df_ln)
k = findfirst.('(', s[3:end]) .+ 1
rename!(df_ln, [s[1:2]; @. getindex(s[3:end], range(k, k+1))])

rocco_sprmnt21 · February 27, 2024, 10:22pm

semi-regex

 map(i->getindex(as[i], findfirst.(r"\(..\)",names)[i]), 3:11)

3/4 regex

replace.(names, r".*\((..)\).*"=>s"\1")


.* = some leading characters

\( = open parenthesis

(..) = first group of two characters

\) = closing bracket

.* = some trailing characters

rocco_sprmnt21 · February 28, 2024, 7:51am

help?> match
search: match eachmatch RegexMatch AbstractMatch DimensionMismatch

  match(r::Regex, s::AbstractString[, idx::Integer[, addopts]])

  Search for the first match of the regular expression r in s and
  return a RegexMatch object containing the match, or nothing if the
  match failed.

but

julia>  str="Questionário: Segunda Chamada (SC) (88) (Real)"
"Questionário: Segunda Chamada (SC) (88) (Real)"

julia>  match(r".*\((..)\).*",str)
RegexMatch("Questionário: Segunda Chamada (SC) (88) (Real)", 1="88")

why doesn’t the pattern matches the first occurrence of (..)?
Or am I misinterpreting the help definition?

Dan · February 28, 2024, 8:57am

Because regexp wildcards take as much characters as they can. And the second (…) allows a match. Replace the first . with [^(] and it can’t skip the first match.

rocco_sprmnt21 · February 28, 2024, 11:34am

It’s not a specific Julia issue, but while we’re at it, if you can, can you also clarify why we chose to make * consume all the characters it can, down to the last “(…)” in this case?
This seems like a more expensive strategy than the opposite one that would make him stop at the first meeting, or not?

Dan · February 28, 2024, 11:39am

Yeah it isn’t a Julia issue, but a regexp issue. The greedy default of regexp wildcards seems natural with a .* regexp. One wouldn’t want it to match nothing by default.

There are regexp modifiers which make wildcards lazy: .*? matches lazily.

rocco_sprmnt21 · February 28, 2024, 5:25pm

I tried the following patterns in a string containing two potential matches.
One finds the first (in effect it matches exactly the requested group), the other finds the last (in effect it matches the entire string that ends with the group (SC.), but “extracts” only group 1).

julia>  str="Questionário: Segunda Chamada (SC1) (88) (SC2)(Real)"     
"Questionário: Segunda Chamada (SC1) (88) (SC2)(Real)"

julia>  match(r"\((SC.)\)",str)
RegexMatch("(SC1)", 1="SC1")

julia>  match(r".*\((SC.)\)",str)
RegexMatch("Questionário: Segunda Chamada (SC1) (88) (SC2)", 1="SC2")

At this point I am left wondering whether the match help documentation is correct.
When it says that it finds the first occurrence it is good for the first pattern, but it is not good for the second: since the first occurrence (at least in the common sense of the term) would be …

RegexMatch("Questionário: Segunda Chamada (SC1)", 1="SC1")

o no?

Dan · February 28, 2024, 5:40pm

Technically, str doesn’t match the regexp r"\((SC.)\)" since it doesn’t start with a (. But match behaves as-if a .*? is inserted before the regexp. That is, finding the earliest position matching the regexp if one exists. The matching string does not include the skipped characters which matche .*?.

This isn’t rocket-science (and I’d feel much safer if rockets didn’t have a lot of regexps), but it is worth it to read a bit about the lazy vs greedy wildcards in one of the thousands of regexp expositions.

Topic		Replies	Views
String column dataframe: replace for another string built from a substring from each row Data	1	1292	January 20, 2021
Create string out of substring in dataframe General Usage dataframes	12	547	September 25, 2023
Remove spaces and units in Dataframe header? New to Julia dataframes	6	1814	September 4, 2020
Named Capture Group to create new columns New to Julia dataframes	4	754	August 8, 2021
Broadcast SubString to subset of DataFrame rows? New to Julia strings , dataframes , broadcasting	6	730	November 2, 2021

Stripping or replacing substrings into a vector of strings

Related topics