Space efficiency of strings


#1

Seems very inefficient when the same string is being filled many times in an array. Wouldn’t we want to store just the pointers?

julia> x = fill("a", 1); Base.summarysize(x)
17

julia> x = fill("a", 1000); Base.summarysize(x)
17000

julia> x = fill("ab", 1000); Base.summarysize(x)
18000

Further, why does it take 17 bytes to store a single-character… I’m guessing - a 64-bit address + 64-bit length + 1 byte data.


#2

I think it is the same string, ie x[i] ≡ x[j] for all indices, but perhaps Base.summarysize does not recognize this?


#3
  1. I wouldn’t use summarysize at all, it’s results are very misleading. This has been discussed elsewhere,
    but I don’t remember quite where (and I don’t have time to dig for it, wife wants me to do housework!!! :slight_smile: )

  2. it actually takes a good deal more than 17 bytes to store a single character, although on master (and maybe v0.6.x?) it is much less, due to a very nice change by Jeff Bezanson, to allocate strings using a special type of vector. I think the it is something like (2*sizeof(Ptr} + n + allocation_chunk_size - 1) & ~allocation_chunk_size, for smaller values of n, but I haven’t dug through the new code yet to figure out all the details).

  3. fill is makes all entries point to the same string, which is exactly what you want for immutable strings.

  4. the space taken will depend on what characters you have in your string, because String uses Unicode code points (as well as allowing invalid sequences) encoded with UTF-8, a single character may take from 1 to 4 bytes each (ASCII is always 1 byte, any accented characters for Western languages tend to take 2 bytes, and all the rest, such as Indian languages, Cyrillic, Greek, Arabic, Hebrew, etc, the most common Korean, Chinese and Japanese characters, as well as math symbols, etc. usually take 3 bytes, and things like Emoji take 4 bytes)


#4

Have a look at the help string of fill:

  of floats, with each element initialized to 1.0.

Hence the purpose of fill is not to create a sparse structure but to actuall fill an array densely. There is nothing special to strings here. You can try fill(1,10) and it will create a length 10 vector filled with ones.


#5

I think you must be correct, that it is incorrectly adding up sizeof of each string (plus the 16 byte overhead at the beginning, which is at least an improvement over the previous behavior, of saying a 0 length string took 0 bytes,
but not actually how much memory was allocated (when at the time the overhead was actually 40 bytes, and it allocated either 48 or 64 (I can’t recall which) for each a short (0-8 or 0-24 byte long string - which might be only 2 or 6 characters represented as UTF-8).

It would need to keep a set of the pointers to all objects seen, including strings, so that they are only processed once, recursing over the structure you call summarysize on.


#6

For space efficiency with repeat values, you could try the PooledArrays.jl or CategoricalArrays.jl packages.


#7

I have been experimenting with ShortStrings.jl. It works pretty well for strings less 15 bytes long or 7 bytes long.


#8

I took a look at it, looks nice.
Note: I might recommend putting the 4 bits of length at the top of the 16-byte structure.

That might make life a bit easier if you later want to use SIMD instructions that operate on 16-byte chunks,
or also for working on 8-bytes at a time (because it will keep the bytes aligned). Also, I think it makes it easier to deal with storing 7 16-bit characters, or 3 32-bit characters), everything will be aligned, and less shifting and masking necessary. (i.e. len = (x >>> 124) gets the number of code units out of a UInt128.
You could even use 4 more bits to indicate things like whether the string is valid, if it is binary data, ASCII only, Latin1 only, UTF-8, UCS2, UTF-16, or UTF-32.
You could even pack 5 Unicode codepoints into a UInt128, by only using 21 bits for each :grinning: which would still leave you 23 bits for “metadata” about the string.


#9

Good advice I am sure. But I don’t feel I understand the pros and cons. Specifically, I don’t understand align


#10

If you have things aligned (esp. for 2 byte and 4 byte codeunits), you can take advantage of the SIMD instructions, to operate on 7 or 3 (for 16-byte values) codeunits in parallel, or even more with with 256-bit and 512-bit registers, such as in AVX2 and AVX/512.