Julia substring return empty string

Luigi_Marongiu · April 23, 2019, 2:00pm

Dear all,
I am trying to extract a substring from a string but the result is emtpy:

julia> x="ATATATATATATATTDTDFTFTCVTVTTSKL:J"
"ATATATATATATATTDTDFTFTCVTVTTSKL:J"
julia> s=SubString(x, 5, 3)
""
julia> typeof(x)
String
julia> typeof(s)
SubString{String}
julia> S=x[5:3]
""
julia> typeof(S)
String

What am I getting wrong?

oheil · April 23, 2019, 2:08pm

Your last index is smaller than the starting one.
You probably want:

s=SubString(x, 5, 5+3-1)

Substring start at index 5 and length=3

help?> SubString
search: SubString SubstitutionString

SubString(s::AbstractString, i::Integer, j::Integer=lastindex(s))
SubString(s::AbstractString, r::UnitRange{<:Integer})

Like getindex, but returns a view into the parent string s within range i:j or r respectively instead of making a copy.

Examples
≡≡≡≡≡≡≡≡≡≡

julia> SubString(“abc”, 1, 2)
“ab”

julia> SubString(“abc”, 1:2)
“ab”

julia> SubString(“abc”, 2)
“bc”

Luigi_Marongiu · April 23, 2019, 2:26pm

I see, so I need to work on the indices rater than START, LENGTH as in R/python. Thank you.

bkamins · April 23, 2019, 2:30pm

Note that the indices are byte indices not character indices. You probably should use the nextind function to get what you want for general UTF-8 strings.

See https://docs.julialang.org/en/latest/manual/strings/#Unicode-and-UTF-8-1 for additional explanations.

Luigi_Marongiu · April 23, 2019, 4:52pm

So would x[5, 5+3-1] be a better choice for extracting substrings? Would the indices be byte ones even in this case? Tx

ExpandingMan · April 23, 2019, 4:59pm

Note that if you are using the Julia REPL you can view documentation on functions with ?function_name.

bkamins · April 23, 2019, 5:00pm

What I want to say is that in Julia there is a difference between character indices and byte indices.

Only if your data is only ASCII they are equivalent. In your example they are, so in such case you are safe to write what you have specified.

If you have UTF-8 data you first have to tell me if you use byte indices (probably not) or character indices (probably yes). In the latter case you should to use functions that operate on characters like nextind, first, last or chop. In your case you could write for example last(first(s, 7), 3) to get a 3 character string consisting of characters 5, 6 and 7 from the original string. This is not the most efficient way to do it, but it is simplest.

More efficient approach would use nextind to calculate byte index of first and last character in your string and do SubString using these indices.

jonathanBieler · April 23, 2019, 5:40pm

You can also convert your string into a vector of char, play with the indices and convert back into a string:

string([c for c in "aαbβ"][4:-1:1]...) == "βbαa"

It’s certainly bad performance wise but in many cases it doesn’t matter.

bkamins · April 23, 2019, 6:50pm

Also you can use eachindex on string to get an iterator of byte indices that correspond to consecutive characters in the string. Here is an example (along with some performance benchmarks of different methods proposed):

function charsub(x::AbstractString, from_char, to_char)
    from_idx = 0
    to_idx = 0
    for (i, idx) in enumerate(eachindex(x))
        i == from_char && (from_idx = idx)
        i == to_char && (to_idx = idx; break)
    end
    SubString(x, from_idx, to_idx)
end

and the benchmarks:

julia> using Random, BenchmarkTools

julia> x = randstring(20)
"vfRVI4vxz4RWKu8VBv1E"

julia> @btime string([c for c in $x][5:7]...)
  194.601 ns (4 allocations: 304 bytes)
"I4v"

julia> @btime last(first($x, 7), 3)
  72.986 ns (2 allocations: 64 bytes)
"I4v"

julia> @btime charsub($x, 5, 7)
  51.919 ns (2 allocations: 48 bytes)
"I4v"

Topic		Replies	Views
Substring function? New to Julia strings , unicode	42	3884	July 18, 2022
How to extract substring of a Julia Dataframe column General Usage	3	1764	June 17, 2020
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1376	December 5, 2023
SubString doesn't work with unicode New to Julia question , unicode	13	1417	June 17, 2022
A question about SubString New to Julia strings , indexing , views	8	240	December 15, 2024

Julia substring return empty string

Related topics