Why length(str) != sizeof(str)?

I used to use length() for string, but right now I noticed it’s not correct for big string, around 3000 characters.
sizeof() give right result.

Why is it?

thank you in advance.

Do you have Unicode/non ASCII?

To be honest, I have no idea, I am just reading string from a file.

You can use the file command-line utility to test. For a file that’s only ascii, file myfile.txt returns myfile.txt: ASCII text whereas with any unicode characters it will return something like myfile.txt: UTF-8 Unicode text. I’ve been linked to the following article before as a quick reference on the subject if you’re unaware of the difference, but I can’t vouch for it myself:

1 Like

I’d say that if you are looking for the number of characters, length is actually the function you should use. sizeof gives you the size of the string in bytes, which is different if you are not using ASCII strings, as others have said.

1 Like

So why length() give not true length ? It is less around 20 character.

Because a unicode character (length 1) may take more than one byte to store (sizeof > 1). Now what’s the “true length” you’d like to know about.

You should read this excellent manual chapter: Strings · The Julia Language.

1 Like

As an example.

julia> length("α")
1

julia> sizeof("α")
2
4 Likes

yes =)) it’s true.
so why:
str[length(str)] !=str[end] in this case?
but str[sizeof(str)]==str[end]

Because string indexing is strange. Tthere are invalid indices. It’s way too complicated to explain in detaill and I don’t think I really like it either. You should read the document linked above.

Here’s an example though.

julia> s = "αb"
"αb"

julia> s[1]
'α': Unicode U+03b1 (category Ll: Letter, lowercase)

julia> s[2]
ERROR: StringIndexError("αb", 2)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex_continued(::String, ::Int64, ::UInt32) at ./strings/string.jl:219
 [3] getindex(::String, ::Int64) at ./strings/string.jl:212
 [4] top-level scope at REPL[3]:1

julia> s[3]
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

julia> s[lastindex(s)]
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)

for me it’s opposite problem, I can’t reproduce it now, since string is very long, but length(str)<sizeof(str) and str[sizeof(str)]==str[end]

so, what I have to use, to know true length of string?

What’s opposite about it?

julia> length("α") < sizeof("α")
true

Again, what do you mean by “true length” of string. The number of characters? The number of bytes? Or something else (like the last index)?

but there is no:

s=“α”
length(s)
s[sizeof(s)] - error

Well, that’s because your last character isn’t a unicode. If you want to get the last character, use lastindex.

thank you. =)) I will use lastindex(), and not length or sizeof.

More hints that may be helpful if you want to get characters out of a non-Unicode string by their position --not only the last one:

In general, do not use str[i] to get the i-th character of str, unless you are sure str is made only of ASCII characters. str[i] means “the i-th “code unit” of str”, which can be different than “the i-th character”, and might be even a non-character code. Instead:

  • Use first(str) to get the first character, or first(str, n) for a substring with the first n characters.
  • Likewise, use last(str[, n]) for the last (one or n) characters.

If you want characters in an intermediate position, you can use str[c1:c2], but previously you have to find out which are the position of the code units (c1, c2) that refer to the characters you are looking for, so:

  • Use nextind(str, 0, i) to get the position of the i-th character in the string, starting from the beginning.
  • Use prevind(str, lastindex(str), i) to get the position of the i-th character in the string, starting from the tail.
  • If you are going to look for many characters in the string, you can make a vector with the code posistion of each character, with charinds = collect(eachindex(str)), such that charinds[i] will be the postion of the i-th character. (But if it is a long string, take into account that you will be allocating a vector of Ints as long as your string.)

You can safely iterate through the characters of a string, e.g. in a for loop, but do this:

for c in str
  # In the i-th iteration, `c` will always be the `i`-th character.
end

Don’t do this:

for i in 1:length(str)
  c = str[i]   # This may fail, for the reasons told above.
end

Instead, if you want a counter of the character you are using in each iteration, do:

for (i, c) in enumerate(str)
  # In the i-th iteration, `c` will also be the `i`-th character.
end

or if for some reason you don’t want i to be a counter, but the pointer of the character in the string:

for i in eachindex(str)
  c = str[i] # This works, but `i` will not generally be 1, 2, 3...
end
6 Likes

str[i] is perfectly fine for all strings, but you should only use an i that is returned by a function that produces valid indices. For example, eachindex, nextind, prevind, findfirst, etcetera.

3 Likes

What exactly do you want to know?

The number of bytes making up the string? sizeof(str)
The number of characters in the string? length(str)

Which is the “true” length? I don’t know, depends on what you are doing with it…you probably want bytes if you are saving it, you probably want length if you are displaying it…

2 Likes