Creation of large strings much slower on v0.6 than v0.5.0

I found v0.6 about 50x slower than v0.5.0 creating a 2GB string:

v"0.5.0":

julia> @timev v = String(Vector{UInt8}(2^31)) ;
  0.049646 seconds (8 allocations: 2.000 GB, 99.89% gc time)
elapsed time (ns): 49645505
gc time (ns):      49588903
bytes allocated:   2147483984
pool allocs:       7
malloc() calls:    1
GC pauses:         1

v"0.6.0-dev.2486"

julia> @timev v = String(Vector{UInt8}(2^31)) ;
  2.545837 seconds (8 allocations: 4.000 GiB, 16.09% gc time)
elapsed time (ns): 2545837251
gc time (ns):      409689388
bytes allocated:   4294967744
pool allocs:       6
non-pool GC allocs:1
malloc() calls:    1
GC pauses:         2
full collections:  1

Note: it seems v0.6 is allocating twice as much memory as v0.5.0 (I think it’s probably allocating a completely new object and copying instead of pointing at the same memory as under v0.5)

If you want to create a string from data without making a copy in 0.6, use an IOBuffer. Strings are no longer wrappers around arrays, so String(Vector) makes a copy.

(There are also some low-level string-allocation routines, but they haven’t been exported yet.)

1 Like

OK, but unfortunately, that is still showing up as much slower than on v0.5 (>23x slower).

v"0.6.0-dev.2486"

julia> @timev x = String(take!(io)) ;
  0.273760 seconds (6 allocations: 2.000 GiB, 99.98% gc time)
elapsed time (ns): 273759578
gc time (ns):      273706265
bytes allocated:   2147484000
pool allocs:       5
non-pool GC allocs:1
GC pauses:         1

v"0.5.0":

julia> @timev y = takebuf_string(x) ;
  0.011679 seconds (7 allocations: 2.000 GB, 99.62% gc time)
elapsed time (ns): 11679302
gc time (ns):      11634493
bytes allocated:   2147483968
pool allocs:       6
malloc() calls:    1
GC pauses:         1
full collections:  1

Those used an IOBuffer created by this function:

function createbuf() ; s = b"0123456789abcdef" ; io = IOBuffer(2^31) ; for i=1:2^27 ; write(io, s) ; end ; io ; end

You’re effectively measuring the GC time (see the high percentages), which is highly variable. Run it a number of times and you’ll see 100x variations on both versions. Turn off GC, and both versions are pretty comparable on my system — about 100 microseconds.

4 Likes

OK, on my system, I was getting pretty consistent times, and I ran them over 10 times each.
How much memory is available on the system that you saw such variation? (I saw very little variation, no more than 10-20%, not 100x).
I’ll try again turning off the GC though.

OK, after running many tests, both with gc enabled and disabled (inserting manual calls to gc() to before segment fault [Julia really needs to do better when it runs low on memory, IMO]) the times using takebuf_string(v) / String(take!(v)) are roughly comparable, but there is still the huge change when you are creating a string from a vector (which we do all the time in our code), it has gone from O(1) to O(n) because of the copy, and also, for some reason, I am seeing that the time to simply create the 2GB vector is around 7 microseconds average gc off, 9 milliseconds average gc on in v0.5, but somewhat slower under v0.6 (9.6 microseconds gc off, 9.5 milliseconds avg. gc on) [20 runs]. That’s a 37% increase when the gc is disabled, are you aware of any change that might be causing that?

I think that O(1) to O(n) change for creating strings from Vector{UInt8} needs to be added prominently to NEWS.md, along with the recommendation (when possible!) to use IOBuffer instead of Vectors to build strings (however, that can be a very substantial, if even possible, change to code).

In JSON, we also created strings from Vector{UInt8}. The documentation continues to say

help?> String(UInt8[])
  String(v::Vector{UInt8})

  Create a new String from a vector v of bytes containing UTF-8 encoded
  characters. This function takes "ownership" of the array, which means that
  you should not subsequently modify v (since strings are supposed to be
  immutable in Julia) for as long as the string exists.

  If you need to subsequently modify v, use String(copy(v)) instead.

so I believe it should still be possible to do this without making a copy. Is there a way to allocate a vector that can be made into a String in O(1) time? I am under the impression that such vectors exist, because Vector{UInt8}(::String) can run in O(1) time.

Never mind, I’ve found the solution. Vectors allocated with Base.StringVector(...) should have String-compatible layout. This should probably be documented.

https://github.com/JuliaLang/julia/issues/19945

2 Likes