Replacing strings in specific position indices as in `str_sub()` in `stringr`

Hi.

I am tring to duplicate the functionality of the R package stringr in Julia so that I prepare a cheatsheet for string manipulation in Julia. Particularly, I am thinking of how to achieve str_sub()'s output in Julia. Here is the example they are using in their stringr cheatsheet for str_sub():

> fruit <- c"apple", "banana", "pear", "pinapple")
> str_sub(fruit, 1, 3) <- "str" 
> fruit
[1] "strle"    "strana"   "strr"     "strapple"
> 

So,
a) I was wondering whether there is an even more succinct way of doing what the above R line of code does than the following:

julia> fruit = ["apple", "banana", "pear", "pinapple"]
julia> replace.(fruit, first.(fruit,3) .=> "str")
4-element Vector{String}:
 "strle"
 "strana"
 "strr"
 "strapple"

and
b) whether, in case we want to replace a substring from the middlle of the initial string we could use something better than the following (since the first() would not be able to return anything in the middle of the string (?)):

julia> replace.(fruit, SubString.(fruit, Ref(3:4)) .=> "AA")
4-element Vector{String}:
 "apAAe"
 "baAAAA"
 "peAA"
 "piAApple"

Using stringr’s functionality in R you could achieve that with:

> str_sub(fruit, 3, 4) <- "str" 

I know that stringr is not part of base R and it is not fair to use Base Julia for achieving elegantly the same thing that an R package does. However, I would be happy if I got some relevant feedback for possible alternative ways of achieving these things.

PS: Notice that if some string has a length less than the index I declare within Ref() I get an Error in Julia. In R, str_sub() silently succeeds by doing the replacement anyway; So:

julia> replace.(fruit, SubString.(fruit, Ref(3:5)) .=> "AA")
ERROR: BoundsError: attempt to access 4-codeunit String at index [3:5]

In R, however, it silently makes adjustments so that the replacement string replaces the last two characters for “pear” and adds one more.

> str_sub(fruit, 3, 5) <- "str" 
> fruit
[1] "apstr"    "bastra"   **"pestr"**    "pistrple"

Thanks!

One single shot:

str_sub(S,str,n1,n2) = @. SubString(S,1,n1-1) * str * SubString(S,n2+1,length(S))

# result:
str_sub(fruit, "str", 3, 5)
 "apstr"
 "bastra"
 "pestr"
 "pistrple"
3 Likes

Great! Thanks!

What does the @. macro do? Could you redirect me somewhere to find out more about it?

Alex

See the Dotfather’s blog post.

2 Likes

a modification to deal with the presence of non-ascii characters (not tested)

str_sub(S,str,n1,n2) = @. SubString(S,1,nexindex(str,n1-1)) * str * SubString(S,nexindex(str,n2+1),lastindex(S))
fruit = ["apple", "banana", "pearr", "pinappleÎą8"]
# result:
str_sub(fruit, "str", 3, 5)

Edited

str_sub(S,str,n1,n2) = @. SubString(S,1,prevind(S,n1)) * str * SubString(S,nextind(S,n2),lastindex(S))

Fixed some typos in rocco’s untested version:

str_sub(S,str,n1,n2) =
  @. SubString(S, 1, nextind(S, 1, n1-1)) * 
    str * 
    SubString(S, nextind(S, 1, n2-1), lastindex(S)
  ) 

(tested)

But I’m feeling @. and vector argument is not as clean as:

function str_sub(S,str,n1,n2)
    ind1 = nextind(S, 1, n1-1)
    ind2 = nextind(S, ind1, n2-n1)
    endind = lastindex(S)
    return SubString(S, 1, ind1) * str * SubString(S, ind2, endind)
end

and

julia> str_sub.(fruit, "str", 3, 5)
4-element Vector{String}:
 "appstre"
 "banstrna"
 "peastr"
 "pinstrppleÎą8"
2 Likes

I think that for Unicode strings, we should use the extended graphemes function with 2 arguments (string, unit range) written by @stevengj (see here the PR for Julia 1.9 and a first version of his code further below).

function str_sub(S,str,n1,n2)
   isascii(S) && (return SubString(S,1,n1-1) * str * SubString(S,n2+1,length(S)))
   return graphemes(S,1:n1-1) * str * graphemes(S,n2+1:length(graphemes(S)))
end

fruits = ["bubu", "βüβü"]

# result:
str_sub.(fruits,"AB",2,3)
 "bABu"
 "βABü"
graphemes(s, mn) by @stevengj
import Unicode: graphemes

function graphemes(s::AbstractString, mn::AbstractUnitRange{<:Integer})
    m, n = Int(first(mn)), Int(last(mn))
    m > 0 || throw(ArgumentError("starting index $m is not ≥ 1"))
    n < m && return @view s[1:0]
    c0 = eltype(s)(0x00000000)
    state = Ref{Int32}(0)
    count = 0
    i, iprev, ilast = 1, 1, lastindex(s)
    # find the start of the m-th grapheme
    while i ≤ ilast && count < m
        @inbounds c = s[i]
        count += Base.Unicode.isgraphemebreak!(state, c0, c)
        c0 = c
        i, iprev = nextind(s, i), i
    end
    start = iprev
    count < m && throw(BoundsError(s, i))
    # find the end of the n-th grapheme
    while i ≤ ilast
        @inbounds c = s[i]
        count += Base.Unicode.isgraphemebreak!(state, c0, c)
        count > n && break
        c0 = c
        i, iprev = nextind(s, i), i
    end
    count < n && throw(BoundsError(s, i))
    return @view s[start:iprev]
end
3 Likes

I guess this is the most up to date and complete response. Btw graphemes() is cool and is something that was really missing!

I’m a bit skeptical that this is what you would want in a realistic application. How is the data being generated that you know specific graphemes indices to replace?

Most commonly, string indices are obtained by searching/iterating a string, in which case you want the actual index and not a character or grapheme index, and this is what string slicing and substrings already do. See also my comment in another thread Substring function? - #27 by stevengj

The basic argument is that a variable-width encoding with non-consecutive codepoint indices is a good tradeoff to make (memory efficiency + speed, at the cost of less-intuitive indexing) because “give me the m-th codepoint” or “or give me the substring from codepoints m to n” is extremely uncommon in (correct) string-handling code, as opposed to “give me the substring at opaque indices I found in a previous search/loop”.

and the surrounding discussion.

“Give me the m:n-th user-perceived ‘characters’ (i.e. graphemes)” is something that people commonly ask for in their first steps of using strings in Julia, which is why I added a graphemes(s, m:n) function, but when you go farther you typically find that this is not what is needed at all.

1 Like

What you say is 100% true and understandable. However, I would suggest that we don’t underestimate the power/confidence such a function could give to a beginner.

In fields ouside CS, the difficulty to understand encoding concepts such as codeunits and characters is pretty frustrating and discouraging to people who try to learn the language and whose concerns are not related to string-internals.

PS: I have been teaching a linguistics’ course this semester using Julia for the first time and ended up spending a whole hour at the very beginning of the course (for undergrad students with no prior experience to programming) explaining more advanced concepts. The same problem I have had years ago when using Python it was not easy to explain the encode-decode cycle for Greek texts.

1 Like