Replacing strings in specific position indices as in `str_sub()` in `stringr`

Alex_Tantos · November 13, 2022, 10:37am

Hi.

I am tring to duplicate the functionality of the R package stringr in Julia so that I prepare a cheatsheet for string manipulation in Julia. Particularly, I am thinking of how to achieve str_sub()'s output in Julia. Here is the example they are using in their stringr cheatsheet for str_sub():

> fruit <- c"apple", "banana", "pear", "pinapple")
> str_sub(fruit, 1, 3) <- "str" 
> fruit
[1] "strle"    "strana"   "strr"     "strapple"
>

So,
a) I was wondering whether there is an even more succinct way of doing what the above R line of code does than the following:

julia> fruit = ["apple", "banana", "pear", "pinapple"]
julia> replace.(fruit, first.(fruit,3) .=> "str")
4-element Vector{String}:
 "strle"
 "strana"
 "strr"
 "strapple"

and
b) whether, in case we want to replace a substring from the middlle of the initial string we could use something better than the following (since the first() would not be able to return anything in the middle of the string (?)):

julia> replace.(fruit, SubString.(fruit, Ref(3:4)) .=> "AA")
4-element Vector{String}:
 "apAAe"
 "baAAAA"
 "peAA"
 "piAApple"

Using stringr’s functionality in R you could achieve that with:

> str_sub(fruit, 3, 4) <- "str"

I know that stringr is not part of base R and it is not fair to use Base Julia for achieving elegantly the same thing that an R package does. However, I would be happy if I got some relevant feedback for possible alternative ways of achieving these things.

PS: Notice that if some string has a length less than the index I declare within Ref() I get an Error in Julia. In R, str_sub() silently succeeds by doing the replacement anyway; So:

julia> replace.(fruit, SubString.(fruit, Ref(3:5)) .=> "AA")
ERROR: BoundsError: attempt to access 4-codeunit String at index [3:5]

In R, however, it silently makes adjustments so that the replacement string replaces the last two characters for “pear” and adds one more.

> str_sub(fruit, 3, 5) <- "str" 
> fruit
[1] "apstr"    "bastra"   **"pestr"**    "pistrple"

Thanks!

rafael.guerra · November 13, 2022, 7:10pm

One single shot:

str_sub(S,str,n1,n2) = @. SubString(S,1,n1-1) * str * SubString(S,n2+1,length(S))

# result:
str_sub(fruit, "str", 3, 5)
 "apstr"
 "bastra"
 "pestr"
 "pistrple"

Alex_Tantos · November 13, 2022, 8:05pm

Great! Thanks!

What does the @. macro do? Could you redirect me somewhere to find out more about it?

Alex

rafael.guerra · November 13, 2022, 8:32pm

See the Dotfather’s blog post.

rocco_sprmnt21 · November 13, 2022, 11:28pm

a modification to deal with the presence of non-ascii characters (not tested)

str_sub(S,str,n1,n2) = @. SubString(S,1,nexindex(str,n1-1)) * str * SubString(S,nexindex(str,n2+1),lastindex(S))
fruit = ["apple", "banana", "pearr", "pinappleα8"]
# result:
str_sub(fruit, "str", 3, 5)

Edited

str_sub(S,str,n1,n2) = @. SubString(S,1,prevind(S,n1)) * str * SubString(S,nextind(S,n2),lastindex(S))

Dan · November 14, 2022, 12:13am

Fixed some typos in rocco’s untested version:

str_sub(S,str,n1,n2) =
  @. SubString(S, 1, nextind(S, 1, n1-1)) * 
    str * 
    SubString(S, nextind(S, 1, n2-1), lastindex(S)
  )

(tested)

But I’m feeling @. and vector argument is not as clean as:

function str_sub(S,str,n1,n2)
    ind1 = nextind(S, 1, n1-1)
    ind2 = nextind(S, ind1, n2-n1)
    endind = lastindex(S)
    return SubString(S, 1, ind1) * str * SubString(S, ind2, endind)
end

and

julia> str_sub.(fruit, "str", 3, 5)
4-element Vector{String}:
 "appstre"
 "banstrna"
 "peastr"
 "pinstrppleα8"

rafael.guerra · November 14, 2022, 8:35am

I think that for Unicode strings, we should use the extended graphemes function with 2 arguments (string, unit range) written by @stevengj (see here the PR for Julia 1.9 and a first version of his code further below).

function str_sub(S,str,n1,n2)
   isascii(S) && (return SubString(S,1,n1-1) * str * SubString(S,n2+1,length(S)))
   return graphemes(S,1:n1-1) * str * graphemes(S,n2+1:length(graphemes(S)))
end

fruits = ["bubu", "βüβü"]

# result:
str_sub.(fruits,"AB",2,3)
 "bABu"
 "βABü"

graphemes(s, mn) by @stevengj

import Unicode: graphemes

function graphemes(s::AbstractString, mn::AbstractUnitRange{<:Integer})
    m, n = Int(first(mn)), Int(last(mn))
    m > 0 || throw(ArgumentError("starting index $m is not ≥ 1"))
    n < m && return @view s[1:0]
    c0 = eltype(s)(0x00000000)
    state = Ref{Int32}(0)
    count = 0
    i, iprev, ilast = 1, 1, lastindex(s)
    # find the start of the m-th grapheme
    while i ≤ ilast && count < m
        @inbounds c = s[i]
        count += Base.Unicode.isgraphemebreak!(state, c0, c)
        c0 = c
        i, iprev = nextind(s, i), i
    end
    start = iprev
    count < m && throw(BoundsError(s, i))
    # find the end of the n-th grapheme
    while i ≤ ilast
        @inbounds c = s[i]
        count += Base.Unicode.isgraphemebreak!(state, c0, c)
        count > n && break
        c0 = c
        i, iprev = nextind(s, i), i
    end
    count < n && throw(BoundsError(s, i))
    return @view s[start:iprev]
end

Alex_Tantos · November 14, 2022, 10:47am

I guess this is the most up to date and complete response. Btw graphemes() is cool and is something that was really missing!

stevengj · November 14, 2022, 2:16pm

I’m a bit skeptical that this is what you would want in a realistic application. How is the data being generated that you know specific graphemes indices to replace?

Most commonly, string indices are obtained by searching/iterating a string, in which case you want the actual index and not a character or grapheme index, and this is what string slicing and substrings already do. See also my comment in another thread Substring function? - #27 by stevengj

The basic argument is that a variable-width encoding with non-consecutive codepoint indices is a good tradeoff to make (memory efficiency + speed, at the cost of less-intuitive indexing) because “give me the m-th codepoint” or “or give me the substring from codepoints m to n” is extremely uncommon in (correct) string-handling code, as opposed to “give me the substring at opaque indices I found in a previous search/loop”.

and the surrounding discussion.

“Give me the m:n-th user-perceived ‘characters’ (i.e. graphemes)” is something that people commonly ask for in their first steps of using strings in Julia, which is why I added a graphemes(s, m:n) function, but when you go farther you typically find that this is not what is needed at all.

Alex_Tantos · November 14, 2022, 7:39pm

What you say is 100% true and understandable. However, I would suggest that we don’t underestimate the power/confidence such a function could give to a beginner.

In fields ouside CS, the difficulty to understand encoding concepts such as codeunits and characters is pretty frustrating and discouraging to people who try to learn the language and whose concerns are not related to string-internals.

PS: I have been teaching a linguistics’ course this semester using Julia for the first time and ended up spending a whole hour at the very beginning of the course (for undergrad students with no prior experience to programming) explaining more advanced concepts. The same problem I have had years ago when using Python it was not easy to explain the encode-decode cycle for Greek texts.

Topic		Replies	Views
Replace two substrings in a string with the same `replace` call? General Usage question , strings	2	328	October 20, 2021
Stripping or replacing substrings into a vector of strings New to Julia strings , dataframes	9	331	February 28, 2024
Substring replacement General Usage question , function	6	635	June 30, 2021
String column dataframe: replace for another string built from a substring from each row Data	1	1280	January 20, 2021
String Handling Functions Performance functions	7	220	February 23, 2023

Replacing strings in specific position indices as in `str_sub()` in `stringr`

Related topics