I am trying to extract a substring from a string but the result is emtpy:
julia> s=SubString(x, 5, 3)
What am I getting wrong?
Your last index is smaller than the starting one.
You probably want:
s=SubString(x, 5, 5+3-1)
Substring start at index 5 and length=3
search: SubString SubstitutionString
SubString(s::AbstractString, i::Integer, j::Integer=lastindex(s))
Like getindex, but returns a view into the parent string s within range i:j or r respectively instead of making a copy.
julia> SubString(“abc”, 1, 2)
julia> SubString(“abc”, 1:2)
julia> SubString(“abc”, 2)
I see, so I need to work on the indices rater than START, LENGTH as in R/python. Thank you.
Note that the indices are byte indices not character indices. You probably should use the
nextind function to get what you want for general UTF-8 strings.
See https://docs.julialang.org/en/latest/manual/strings/#Unicode-and-UTF-8-1 for additional explanations.
So would x[5, 5+3-1] be a better choice for extracting substrings? Would the indices be byte ones even in this case? Tx
Note that if you are using the Julia REPL you can view documentation on functions with
What I want to say is that in Julia there is a difference between character indices and byte indices.
Only if your data is only ASCII they are equivalent. In your example they are, so in such case you are safe to write what you have specified.
If you have UTF-8 data you first have to tell me if you use byte indices (probably not) or character indices (probably yes). In the latter case you should to use functions that operate on characters like
chop. In your case you could write for example
last(first(s, 7), 3) to get a 3 character string consisting of characters 5, 6 and 7 from the original string. This is not the most efficient way to do it, but it is simplest.
More efficient approach would use
nextind to calculate byte index of first and last character in your string and do
SubString using these indices.
You can also convert your string into a vector of char, play with the indices and convert back into a string:
string([c for c in "aαbβ"][4:-1:1]...) == "βbαa"
It’s certainly bad performance wise but in many cases it doesn’t matter.
Also you can use
eachindex on string to get an iterator of byte indices that correspond to consecutive characters in the string. Here is an example (along with some performance benchmarks of different methods proposed):
function charsub(x::AbstractString, from_char, to_char)
from_idx = 0
to_idx = 0
for (i, idx) in enumerate(eachindex(x))
i == from_char && (from_idx = idx)
i == to_char && (to_idx = idx; break)
SubString(x, from_idx, to_idx)
and the benchmarks:
julia> using Random, BenchmarkTools
julia> x = randstring(20)
julia> @btime string([c for c in $x][5:7]...)
194.601 ns (4 allocations: 304 bytes)
julia> @btime last(first($x, 7), 3)
72.986 ns (2 allocations: 64 bytes)
julia> @btime charsub($x, 5, 7)
51.919 ns (2 allocations: 48 bytes)