Space efficiency of strings

tk3369 · December 27, 2017, 6:53am

Seems very inefficient when the same string is being filled many times in an array. Wouldn’t we want to store just the pointers?

julia> x = fill("a", 1); Base.summarysize(x)
17

julia> x = fill("a", 1000); Base.summarysize(x)
17000

julia> x = fill("ab", 1000); Base.summarysize(x)
18000

Further, why does it take 17 bytes to store a single-character… I’m guessing - a 64-bit address + 64-bit length + 1 byte data.

Tamas_Papp · December 27, 2017, 7:37am

I think it is the same string, ie x[i] ≡ x[j] for all indices, but perhaps Base.summarysize does not recognize this?

ScottPJones · December 27, 2017, 3:50pm

I wouldn’t use summarysize at all, it’s results are very misleading. This has been discussed elsewhere,
but I don’t remember quite where (and I don’t have time to dig for it, wife wants me to do housework!!! )
it actually takes a good deal more than 17 bytes to store a single character, although on master (and maybe v0.6.x?) it is much less, due to a very nice change by Jeff Bezanson, to allocate strings using a special type of vector. I think the it is something like (2*sizeof(Ptr} + n + allocation_chunk_size - 1) & ~allocation_chunk_size, for smaller values of n, but I haven’t dug through the new code yet to figure out all the details).
fill is makes all entries point to the same string, which is exactly what you want for immutable strings.
the space taken will depend on what characters you have in your string, because String uses Unicode code points (as well as allowing invalid sequences) encoded with UTF-8, a single character may take from 1 to 4 bytes each (ASCII is always 1 byte, any accented characters for Western languages tend to take 2 bytes, and all the rest, such as Indian languages, Cyrillic, Greek, Arabic, Hebrew, etc, the most common Korean, Chinese and Japanese characters, as well as math symbols, etc. usually take 3 bytes, and things like Emoji take 4 bytes)

tobias.knopp · December 27, 2017, 4:04pm

Have a look at the help string of fill:

  of floats, with each element initialized to 1.0.

Hence the purpose of fill is not to create a sparse structure but to actuall fill an array densely. There is nothing special to strings here. You can try fill(1,10) and it will create a length 10 vector filled with ones.

ScottPJones · December 27, 2017, 4:15pm

I think you must be correct, that it is incorrectly adding up sizeof of each string (plus the 16 byte overhead at the beginning, which is at least an improvement over the previous behavior, of saying a 0 length string took 0 bytes,
but not actually how much memory was allocated (when at the time the overhead was actually 40 bytes, and it allocated either 48 or 64 (I can’t recall which) for each a short (0-8 or 0-24 byte long string - which might be only 2 or 6 characters represented as UTF-8).

It would need to keep a set of the pointers to all objects seen, including strings, so that they are only processed once, recursing over the structure you call summarysize on.

tshort · December 27, 2017, 6:19pm

For space efficiency with repeat values, you could try the PooledArrays.jl or CategoricalArrays.jl packages.

xiaodai · January 30, 2018, 12:40am

I have been experimenting with ShortStrings.jl. It works pretty well for strings less 15 bytes long or 7 bytes long.

ScottPJones · February 8, 2018, 9:16pm

I took a look at it, looks nice.
Note: I might recommend putting the 4 bits of length at the top of the 16-byte structure.

That might make life a bit easier if you later want to use SIMD instructions that operate on 16-byte chunks,
or also for working on 8-bytes at a time (because it will keep the bytes aligned). Also, I think it makes it easier to deal with storing 7 16-bit characters, or 3 32-bit characters), everything will be aligned, and less shifting and masking necessary. (i.e. len = (x >>> 124) gets the number of code units out of a UInt128.
You could even use 4 more bits to indicate things like whether the string is valid, if it is binary data, ASCII only, Latin1 only, UTF-8, UCS2, UTF-16, or UTF-32.
You could even pack 5 Unicode codepoints into a UInt128, by only using 21 bits for each which would still leave you 23 bits for “metadata” about the string.

xiaodai · February 8, 2018, 9:30pm

Good advice I am sure. But I don’t feel I understand the pros and cons. Specifically, I don’t understand align

ScottPJones · February 8, 2018, 9:45pm

If you have things aligned (esp. for 2 byte and 4 byte codeunits), you can take advantage of the SIMD instructions, to operate on 7 or 3 (for 16-byte values) codeunits in parallel, or even more with with 256-bit and 512-bit registers, such as in AVX2 and AVX/512.

Topic		Replies	Views
Implementation Details of Vector (or Array in General) and String Internals & Design	1	533	August 9, 2021
Understanding `sizeof` return values on `Char` / `String` General Usage question , strings , char	2	1422	October 20, 2021
Why are Julia strings immutable? Internals strings	13	951	September 28, 2024
Why Base.summarysize show pointers twice? General Usage memory , memory-allocation	1	96	December 16, 2024
What is difference between "a" and 'a'? New to Julia question , strings	6	1130	October 6, 2019

Space efficiency of strings

Related topics