Why length(str) != sizeof(str)?

BMval · March 16, 2020, 10:29pm

I used to use length() for string, but right now I noticed it’s not correct for big string, around 3000 characters.
sizeof() give right result.

Why is it?

thank you in advance.

yuyichao · March 16, 2020, 10:36pm

Do you have Unicode/non ASCII?

BMval · March 16, 2020, 10:39pm

To be honest, I have no idea, I am just reading string from a file.

non-Jedi · March 16, 2020, 11:06pm

You can use the file command-line utility to test. For a file that’s only ascii, file myfile.txt returns myfile.txt: ASCII text whereas with any unicode characters it will return something like myfile.txt: UTF-8 Unicode text. I’ve been linked to the following article before as a quick reference on the subject if you’re unaware of the difference, but I can’t vouch for it myself:

heliosdrm · March 16, 2020, 11:19pm

I’d say that if you are looking for the number of characters, length is actually the function you should use. sizeof gives you the size of the string in bytes, which is different if you are not using ASCII strings, as others have said.

BMval · March 16, 2020, 11:21pm

So why length() give not true length ? It is less around 20 character.

yuyichao · March 16, 2020, 11:25pm

Because a unicode character (length 1) may take more than one byte to store (sizeof > 1). Now what’s the “true length” you’d like to know about.

swissr · March 16, 2020, 11:28pm

You should read this excellent manual chapter: Strings · The Julia Language.

yuyichao · March 16, 2020, 11:28pm

As an example.

julia> length("α")
1

julia> sizeof("α")
2

BMval · March 16, 2020, 11:29pm

yes =)) it’s true.
so why:
str[length(str)] !=str[end] in this case?
but str[sizeof(str)]==str[end]

yuyichao · March 16, 2020, 11:31pm

Because string indexing is strange. Tthere are invalid indices. It’s way too complicated to explain in detaill and I don’t think I really like it either. You should read the document linked above.

Here’s an example though.

julia> s = "αb"
"αb"

julia> s[1]
'α': Unicode U+03b1 (category Ll: Letter, lowercase)

julia> s[2]
ERROR: StringIndexError("αb", 2)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:219
 [3] getindex(::String, ::Int64) at ./strings/string.jl:212
 [4] top-level scope at REPL[3]:1

julia> s[3]
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

julia> s[lastindex(s)]
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

BMval · March 16, 2020, 11:32pm

for me it’s opposite problem, I can’t reproduce it now, since string is very long, but length(str)<sizeof(str) and str[sizeof(str)]==str[end]

so, what I have to use, to know true length of string?

yuyichao · March 16, 2020, 11:33pm

What’s opposite about it?

julia> length("α") < sizeof("α")
true

yuyichao · March 16, 2020, 11:34pm

Again, what do you mean by “true length” of string. The number of characters? The number of bytes? Or something else (like the last index)?

BMval · March 16, 2020, 11:34pm

but there is no:

s=“α”
length(s)
s[sizeof(s)] - error

yuyichao · March 16, 2020, 11:35pm

Well, that’s because your last character isn’t a unicode. If you want to get the last character, use lastindex.

BMval · March 16, 2020, 11:37pm

thank you. =)) I will use lastindex(), and not length or sizeof.

heliosdrm · March 17, 2020, 11:38am

More hints that may be helpful if you want to get characters out of a non-Unicode string by their position --not only the last one:

In general, do not use str[i] to get the i-th character of str, unless you are sure str is made only of ASCII characters. str[i] means “the i-th “code unit” of str”, which can be different than “the i-th character”, and might be even a non-character code. Instead:

Use first(str) to get the first character, or first(str, n) for a substring with the first n characters.
Likewise, use last(str[, n]) for the last (one or n) characters.

If you want characters in an intermediate position, you can use str[c1:c2], but previously you have to find out which are the position of the code units (c1, c2) that refer to the characters you are looking for, so:

Use nextind(str, 0, i) to get the position of the i-th character in the string, starting from the beginning.
Use prevind(str, lastindex(str), i) to get the position of the i-th character in the string, starting from the tail.
If you are going to look for many characters in the string, you can make a vector with the code posistion of each character, with charinds = collect(eachindex(str)), such that charinds[i] will be the postion of the i-th character. (But if it is a long string, take into account that you will be allocating a vector of Ints as long as your string.)

You can safely iterate through the characters of a string, e.g. in a for loop, but do this:

for c in str
  # In the i-th iteration, `c` will always be the `i`-th character.
end

Don’t do this:

for i in 1:length(str)
  c = str[i]   # This may fail, for the reasons told above.
end

Instead, if you want a counter of the character you are using in each iteration, do:

for (i, c) in enumerate(str)
  # In the i-th iteration, `c` will also be the `i`-th character.
end

or if for some reason you don’t want i to be a counter, but the pointer of the character in the string:

for i in eachindex(str)
  c = str[i] # This works, but `i` will not generally be 1, 2, 3...
end

stevengj · March 17, 2020, 12:51pm

str[i] is perfectly fine for all strings, but you should only use an i that is returned by a function that produces valid indices. For example, eachindex, nextind, prevind, findfirst, etcetera.

pixel27 · March 17, 2020, 1:32pm

What exactly do you want to know?

The number of bytes making up the string? sizeof(str)
The number of characters in the string? length(str)

Which is the “true” length? I don’t know, depends on what you are doing with it…you probably want bytes if you are saving it, you probably want length if you are displaying it…

Topic		Replies	Views
Performance of length(::String) Performance	24	3934	July 28, 2018
Understanding `sizeof` return values on `Char` / `String` General Usage question , strings , char	2	1442	October 20, 2021
What is difference between "a" and 'a'? New to Julia question , strings	6	1139	October 6, 2019
How do I find the number of bytes for a character? New to Julia strings , indexing , unicode	3	207	December 24, 2024
Indexing Unicode Strings Internals & Design	10	1728	June 4, 2021

Why length(str) != sizeof(str)?

Related topics