Julia substring return empty string

Dear all,
I am trying to extract a substring from a string but the result is emtpy:

julia> x="ATATATATATATATTDTDFTFTCVTVTTSKL:J"
"ATATATATATATATTDTDFTFTCVTVTTSKL:J"
julia> s=SubString(x, 5, 3)
""
julia> typeof(x)
String
julia> typeof(s)
SubString{String}
julia> S=x[5:3]
""
julia> typeof(S)
String

What am I getting wrong?

Your last index is smaller than the starting one.
You probably want:

s=SubString(x, 5, 5+3-1)

Substring start at index 5 and length=3

help?> SubString
search: SubString SubstitutionString

SubString(s::AbstractString, i::Integer, j::Integer=lastindex(s))
SubString(s::AbstractString, r::UnitRange{<:Integer})

Like getindex, but returns a view into the parent string s within range i:j or r respectively instead of making a copy.

Examples
≡≡≡≡≡≡≡≡≡≡

julia> SubString(“abc”, 1, 2)
“ab”

julia> SubString(“abc”, 1:2)
“ab”

julia> SubString(“abc”, 2)
“bc”

2 Likes

I see, so I need to work on the indices rater than START, LENGTH as in R/python. Thank you.

Note that the indices are byte indices not character indices. You probably should use the nextind function to get what you want for general UTF-8 strings.

See https://docs.julialang.org/en/latest/manual/strings/#Unicode-and-UTF-8-1 for additional explanations.

2 Likes

So would x[5, 5+3-1] be a better choice for extracting substrings? Would the indices be byte ones even in this case? Tx

Note that if you are using the Julia REPL you can view documentation on functions with ?function_name.

1 Like

What I want to say is that in Julia there is a difference between character indices and byte indices.

Only if your data is only ASCII they are equivalent. In your example they are, so in such case you are safe to write what you have specified.

If you have UTF-8 data you first have to tell me if you use byte indices (probably not) or character indices (probably yes). In the latter case you should to use functions that operate on characters like nextind, first, last or chop. In your case you could write for example last(first(s, 7), 3) to get a 3 character string consisting of characters 5, 6 and 7 from the original string. This is not the most efficient way to do it, but it is simplest.

More efficient approach would use nextind to calculate byte index of first and last character in your string and do SubString using these indices.

1 Like

You can also convert your string into a vector of char, play with the indices and convert back into a string:

string([c for c in "aαbβ"][4:-1:1]...) == "βbαa"

It’s certainly bad performance wise but in many cases it doesn’t matter.

Also you can use eachindex on string to get an iterator of byte indices that correspond to consecutive characters in the string. Here is an example (along with some performance benchmarks of different methods proposed):

function charsub(x::AbstractString, from_char, to_char)
    from_idx = 0
    to_idx = 0
    for (i, idx) in enumerate(eachindex(x))
        i == from_char && (from_idx = idx)
        i == to_char && (to_idx = idx; break)
    end
    SubString(x, from_idx, to_idx)
end

and the benchmarks:

julia> using Random, BenchmarkTools

julia> x = randstring(20)
"vfRVI4vxz4RWKu8VBv1E"

julia> @btime string([c for c in $x][5:7]...)
  194.601 ns (4 allocations: 304 bytes)
"I4v"

julia> @btime last(first($x, 7), 3)
  72.986 ns (2 allocations: 64 bytes)
"I4v"

julia> @btime charsub($x, 5, 7)
  51.919 ns (2 allocations: 48 bytes)
"I4v"
3 Likes