I used to use length() for string, but right now I noticed it’s not correct for big string, around 3000 characters.
sizeof() give right result.
Why is it?
thank you in advance.
I used to use length() for string, but right now I noticed it’s not correct for big string, around 3000 characters.
sizeof() give right result.
Why is it?
thank you in advance.
Do you have Unicode/non ASCII?
To be honest, I have no idea, I am just reading string from a file.
You can use the file
command-line utility to test. For a file that’s only ascii, file myfile.txt
returns myfile.txt: ASCII text
whereas with any unicode characters it will return something like myfile.txt: UTF-8 Unicode text
. I’ve been linked to the following article before as a quick reference on the subject if you’re unaware of the difference, but I can’t vouch for it myself:
I’d say that if you are looking for the number of characters, length
is actually the function you should use. sizeof
gives you the size of the string in bytes, which is different if you are not using ASCII strings, as others have said.
So why length() give not true length ? It is less around 20 character.
Because a unicode character (length 1) may take more than one byte to store (sizeof > 1). Now what’s the “true length” you’d like to know about.
You should read this excellent manual chapter: Strings · The Julia Language.
As an example.
julia> length("α")
1
julia> sizeof("α")
2
yes =)) it’s true.
so why:
str[length(str)] !=str[end] in this case?
but str[sizeof(str)]==str[end]
Because string indexing is strange. Tthere are invalid indices. It’s way too complicated to explain in detaill and I don’t think I really like it either. You should read the document linked above.
Here’s an example though.
julia> s = "αb"
"αb"
julia> s[1]
'α': Unicode U+03b1 (category Ll: Letter, lowercase)
julia> s[2]
ERROR: StringIndexError("αb", 2)
Stacktrace:
[1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
[2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:219
[3] getindex(::String, ::Int64) at ./strings/string.jl:212
[4] top-level scope at REPL[3]:1
julia> s[3]
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
julia> s[lastindex(s)]
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
for me it’s opposite problem, I can’t reproduce it now, since string is very long, but length(str)<sizeof(str) and str[sizeof(str)]==str[end]
so, what I have to use, to know true length of string?
What’s opposite about it?
julia> length("α") < sizeof("α")
true
Again, what do you mean by “true length” of string. The number of characters? The number of bytes? Or something else (like the last index)?
but there is no:
s=“α”
length(s)
s[sizeof(s)] - error
Well, that’s because your last character isn’t a unicode. If you want to get the last character, use lastindex
.
thank you. =)) I will use lastindex(), and not length or sizeof.
More hints that may be helpful if you want to get characters out of a non-Unicode string by their position --not only the last one:
In general, do not use str[i]
to get the i
-th character of str
, unless you are sure str
is made only of ASCII characters. str[i]
means “the i
-th “code unit” of str
”, which can be different than “the i
-th character”, and might be even a non-character code. Instead:
first(str)
to get the first character, or first(str, n)
for a substring with the first n
characters.last(str[, n])
for the last (one or n
) characters.If you want characters in an intermediate position, you can use str[c1:c2]
, but previously you have to find out which are the position of the code units (c1
, c2
) that refer to the characters you are looking for, so:
nextind(str, 0, i)
to get the position of the i
-th character in the string, starting from the beginning.prevind(str, lastindex(str), i)
to get the position of the i
-th character in the string, starting from the tail.charinds = collect(eachindex(str))
, such that charinds[i]
will be the postion of the i
-th character. (But if it is a long string, take into account that you will be allocating a vector of Int
s as long as your string.)You can safely iterate through the characters of a string, e.g. in a for
loop, but do this:
for c in str
# In the i-th iteration, `c` will always be the `i`-th character.
end
Don’t do this:
for i in 1:length(str)
c = str[i] # This may fail, for the reasons told above.
end
Instead, if you want a counter of the character you are using in each iteration, do:
for (i, c) in enumerate(str)
# In the i-th iteration, `c` will also be the `i`-th character.
end
or if for some reason you don’t want i
to be a counter, but the pointer of the character in the string:
for i in eachindex(str)
c = str[i] # This works, but `i` will not generally be 1, 2, 3...
end
str[i]
is perfectly fine for all strings, but you should only use an i
that is returned by a function that produces valid indices. For example, eachindex
, nextind
, prevind
, findfirst
, etcetera.
What exactly do you want to know?
The number of bytes making up the string? sizeof(str)
The number of characters in the string? length(str)
Which is the “true” length? I don’t know, depends on what you are doing with it…you probably want bytes if you are saving it, you probably want length if you are displaying it…