Help with excessive, unpredictable GC time from string allocations

ExpandingMan · March 3, 2022, 2:36am

I’ve been having a hell of a time with some performance disasters when loading a large number of strings from a file.

I am hitting this in my rewrite of a parquet reader and was caught rather by surprise, because despite microbenchmarks being excellent, I eventually discovered that my actual performance on large enough files was abysmal. I am by now pretty sure that the major remaining issue is that I cannot seem to load certain types of files without hitting unpredictable and excessive GC times.

At the core of the problem (I think) is a loop in which I iterate over a view of an array from which I read strings. The strings consist of a UInt32 giving the length of the string followed by the string itself with no padding or escape codes. You can find the loop here.

I have tried multiple versions of this. First, I was writing to an array consisting entirely of Strings so that I was creating a String from the views with String(view(v, a:b)). Obviously, this will result in allocations of the strings, though I’m not sure I understand why it seems to be leading to so much garbage collection since the strings are stored in an array and not garbage collected.

Next, as an optimization, I tried an alternate version which takes place in the same loop but for which convertvalue is merely the identity (i.e. I write views to an array of views). I later wrap this array in a special AbstractVector type that simply converts the views to strings on getindex. Much to my surprise, the performance of this is only slightly better, and, even more surprising, does not seem to alleviate the GC issues at all. Yes, it’s possible that the GC issues are not from here, but I can’t find where else it might be from and this place is certainly a huge performance bottleneck, so it’s hard to understand how the GC can be anywhere else.

I even tried to mitigate this by doing GC.enable(false) only around this loop, but unsurprisingly this doesn’t help much since everything now just gets GC’d outside the loop and there are many such loops. I deem it too risky to disable GC around the loop of loops.

The thing that indicates to me that GC is such a huge problem is that @time consistently reports that when loading a file with several of these columns which are very large, > 1/2 GC time.

Much to my extreme frustration, the original Parquet.jl does not seem to suffer from this issue by nearly the same degree, and I have so far been unable to determine why not. They are reading the same format, and they are doing basically the same thing I am here at this point in the loop, and they do not even have a version that does this with the views.

I have quite a lot of Julia experience but this by now has me pretty stumped. This is the biggest discrepancy I’ve ever encountered between microbenchmarks of the performance critical code and my overall performance; in fact I can’t really think of another case in 6 or so years of using Julia in which my apparently optimized code runs so incredibly badly. Any help or advice would be appreciated. Thanks all!

nilshg · March 3, 2022, 6:12am

You might want to speak to @oxinabox - she explained to me long ago on Slack what was going on with Strings. As I don’t work much with Julia internals my recollection is vague, but my understanding at the time was that Strings put undue pressure on the GC because they need to be heap allocated, and every GC run has to scan all Strings to work out whether they should be collected (so this is slow even if nothing gets collected in the end).

Some packages that work around this:

ShortStrings is what I used to use, but I think it has been superseded now
InlineStrings is what CSV.jl now uses by default when reading in String data
WeakRefStrigns has InlineStrings as well, plus some other types - you’d probably have to ask @quinnj to find out how this relates to just the previous package

Oscar_Smith · March 3, 2022, 6:13am

You should also try testing this with https://github.com/JuliaLang/julia/pull/44215

oxinabox · March 3, 2022, 12:07pm

Right, yes.
When you have tens of millions of “reference” variables, basically things that use pointers under the hood, they make the “Mark” step of the mark and sweep GC slow.
Reference variables are things that are not isbits, so Vectors, Strings, mutable structs, structs/tuples with abstractly typed fields etc; Or structs with fields that are reference types (like SubString referencing a String).
Basically how a mark and sweep GC works is: when ever GC is triggered it goes through all references that exist in memory and tried to find a path from them to something that is definately in scope, if it does then it marks them and then it does a sweep step that removed everything that isn’t marked.

This is good because rather than scaling with the number of times a variable is used (which is how reference counting scales), it scaled with how many variables are declared (which is naturally less than how many times it is used as the declaration is a single use).
But it is bad that it also scales (multiplicatively) with how often a GC is triggered.
So if you have a lot of references in memory for a long time it scales badly.

To combat this Julia uses a generation GC.
Basically rather than 1 pool of reference to mark and sweep it breaks references into two Pools (you can keep exending this BTW, java has 3).
It applies a heuristic that if something has stuck around for a while (2(?) mark and sweep steps in julia’s case) it probably is going to be sticking around for a while more.
So everything starts in the Young pool, and then things that stay around for a while graduate to the Old pool.
Since most variables don’t hang around for long, one can free up a lot of memory by only marking and sweeping the young pool.
But if that doesn’t get you what you need you will need to sweep the old pool too – but this should be rare.
So what should happen is if you allocate a ton of strings say at the start of your program they quickly graduate to the Old pool and then rarely impose load on the Mark and Sweep step.

However, Julia’s generation GC doesn’t work very well (but as @Oscar_Smith linked there is a PR to hopefully fix it).
The issue #40644 shows that sometimes julia will effectively sweep both pools every time – making it functionally non-generational

github.com/JuliaLang/julia

The GC often doesn't act generational

opened 11:00AM - 28 Apr 21 UTC

oxinabox

performance GC

Julia has a generational mark and sweep GC. One of the properties of a mark and… sweep GC is that it needs to go through and mark every reference object in memory to decide if it should be swept whenever the GC runs. This is very expensive if you have a lot of objects in memory, and the cost is incurred even if those objects are not being freed (or even read). Which is where the generational aspect comes in. It applies a heuristic to break the objects in memory into generations. Things that have been referenced for a long time (in some sense) are likely to continue to be referenced, and so are moved into the older generations. The generational GC should mark and sweep the young generations very often, and the older generations more rarely. This should mean if you have some large object with lots of references that are e.g. allocated at the start of your program and kept until the end of the program, it will make the GC slow at first since it is so large and takes so long to mark, but after a little while it will move into the older generations and very rarely be marked, and so cease to have an effect on the time it takes to do normal GCs. The following examples use `ZonedDateTimes` from `TimeZones@1.5.3` which are for now not `isbits` though that is being worked on (https://github.com/JuliaTime/TimeZones.jl/issues/271). You can create a similar example that uses any non-`isbits` objects such as mutable objects, things with abstract typed fields, or things with non-isbits fields (such as `String` or any `Vector`). This one I had handy. You can see in this first example the generational GC working. At first, after the `zdts` are allocated there is a sharp jump in the time spent on GC, up to around 90%. But then after a few rounds, the `zdts` have moved into the older generation, and the time shifts spent on GC goes back down nicely to what it was at the start: around 20% ```julia julia> # A function that creates a decent amount of garbage, little_burn() = sum(hash, ["trashfire"^min(1000, i) for i in 1:10_000]) little_burn (generic function with 1 method) julia> @time little_burn(); 0.091142 seconds (10.00 k allocations: 82.467 MiB, 20.53% gc time) julia> using TimeZones julia> zdts = [now(tz"UTC") for _ in 1:100_000_000]; julia> for i in 1:10 @time little_burn() end 0.566361 seconds (10.00 k allocations: 82.467 MiB, 89.26% gc time) 0.620691 seconds (10.00 k allocations: 82.467 MiB, 88.32% gc time) 0.051410 seconds (10.00 k allocations: 82.467 MiB) 0.603839 seconds (10.00 k allocations: 82.467 MiB, 90.91% gc time) 0.060557 seconds (10.00 k allocations: 82.467 MiB, 17.61% gc time) 0.066966 seconds (10.00 k allocations: 82.467 MiB, 18.17% gc time) 0.058870 seconds (10.00 k allocations: 82.467 MiB, 19.43% gc time) 0.060740 seconds (10.00 k allocations: 82.467 MiB, 15.33% gc time) 0.064070 seconds (10.00 k allocations: 82.467 MiB, 19.75% gc time) 0.060747 seconds (10.00 k allocations: 82.467 MiB, 21.03% gc time) 0.061915 seconds (10.00 k allocations: 82.467 MiB, 18.41% gc time) ``` However, it's very easy to run a different function that does not display the generational behavior `burn()` is identical to `little_burn` but runs for 10x longer and allocates 10x as much memory. Like `little_burn` it never actually touches the `zdts`. with `burn` however the GC time always stays at 60-80%. This is due to it triggering a full sweep of both generations of our GC. Thus it keeps needing to mark all of the big and complex `zdts` which is slow. ```julia julia> # A function that creates a decent amount of garbage, burn() = sum(hash, ["trashfire"^min(1000, i) for i in 1:100_000]) burn (generic function with 1 method) julia> @time burn(); 0.913665 seconds (100.00 k allocations: 863.183 MiB, 19.98% gc time) julia> using TimeZones julia> zdts = [now(tz"UTC") for _ in 1:100_000_000]; julia> for i in 1:10 @time burn() end 3.127583 seconds (100.00 k allocations: 863.183 MiB, 78.35% gc time) 2.073914 seconds (100.00 k allocations: 863.183 MiB, 64.82% gc time) 2.355717 seconds (100.00 k allocations: 863.183 MiB, 77.97% gc time) 1.842793 seconds (100.00 k allocations: 863.183 MiB, 66.04% gc time) 2.335667 seconds (100.00 k allocations: 863.183 MiB, 74.24% gc time) 1.804693 seconds (100.00 k allocations: 863.183 MiB, 66.61% gc time) 2.344848 seconds (100.00 k allocations: 863.183 MiB, 73.77% gc time) 2.396741 seconds (100.00 k allocations: 863.183 MiB, 74.44% gc time) 1.883768 seconds (100.00 k allocations: 863.183 MiB, 64.91% gc time) 2.262383 seconds (100.00 k allocations: 863.183 MiB, 77.07% gc time) 1.835220 seconds (100.00 k allocations: 863.183 MiB, 64.84% gc time) ``` I have seen real code that takes tens of seconds to do a full mark and sweep. We really don't want to be running that very often. (if we are running that kind of mark and sweep too often it might even be better to OOM error) It seems we need to do something to the GC to have it do a full mark and sweep less often. This might be tweaking the heuristics. It might be adding another generation. it might be both

ShortStrings basically represents a string not with a pointer to a block of memory (which would need to be tracked by the GC( but with a isbits value (like an UInt) that it reinterprets the bits out of.
InlineStrings is ShortStrings but implemented better, basically because @quinnj is smarter than me.
WeakRefStrings reexports InlineStrings.
WeakRefString also however has a number of other useful string types, including the titular WeakRefString.
The WeakRefString was common in older (versions of) packages but has gone away for safer constructs, however unlike much of what replace it, it is isbits so didn’t incur load on the GC.
Basically, it was a SubString but it didn’t actually contain a julia reference to the original string, rather it contained just a raw Ptr to a position in the string to start reading from (+ a length).

ExpandingMan · March 3, 2022, 3:04pm

Thanks all for your help, especially @oxinabox for her detailed explanation!

This has been very enlightening. I thought I understood what WeakRefStrings was for, but now I realize I did not. In particular, despite being aware of it I did not think it was useful to me here because I thought what I was doing with the views was equivalent to using WeakRefStrings, but now I see that this is not the case. I’m worried that it might be very tricky to use here because I have a view rather than a Vector{UInt8} and I’m not a priori sure whether there will be some trick involved in getting WeakRefStrings. I’ll give it a shot and report back here.

Edit: In retrospect, it won’t be hard to use WeakRefStrings here, this is guaranteed to always be either a Vector{UInt8} or a view of one. I’m going to make it so that there’s an option to choose one of two approaches: either allocate a new buffer containing the string data which gets copied into it via WeakRefStrings (this is implemented by WeakRefStrings.StringArray) or create a special output AbstractVector type which contains WeakRefStrings and exactly one reference to the original buffer, which will take up way more memory but obviate the need to allocate a whole new array.

Benny · March 3, 2022, 4:33pm

Let me check if I’m reading this correctly. So the marking step scales with the number of non-isbits “reference variables”, which would include both heap allocated objects and stack-allocated objects containing references, yes? What’s confusing about that to me is that I wouldn’t expect tracking 1000 arrays on the heap to be the same amount of work as 1 array on the heap and 999 views of it on the stack; once the 1st view is found, the array can just be marked.

Also, would this issue be improved at all by the recent work on escape analysis and eliding (heap) allocations?

Oscar_Smith · March 3, 2022, 8:20pm

this is half right. the part you’re missing is that what matters is the number of reference variables at the time gc is run. the advantage of putting variables on the stack is that they disappear when the function returns so GC doesn’t have to deal with the ones that have disappeared.

stevengj · March 3, 2022, 10:25pm

You could use something like StringViews.jl to treat a subarray view of your array as a string. Those views are all isbits types that don’t require heap allocations (or copying data out of our array).

ExpandingMan · March 3, 2022, 11:05pm

I don’t think this will work because the views themselves contain references and the references are causing the problem, not allocations. I already tried a vector of UInt8 views that was wrapped in an AbstractArray to turn them to String on request but was surprised to discover this barely seemed to help (surprised because at the time I thought it must have something to do with allocations).

Anyway, I have just implemented an alternative with copying using WeakRefStrings and this has provided a drastic improvement! I’m also going to implement a zero-copy alternative and I expect that will be even faster.

As a side note, in my case a small performance fix was needed for WeakRefStrings which I have fixed in a PR, so once that is merged Parquet2 strings will finally be reasonably fast.

Thanks again! (This was unbelievably frustrating before I understood what was happening, so this has been a huge relief.)

Benny · March 4, 2022, 12:56am

Now I look at the code again, it clicked that the references (views) are being put in an array on the heap, so my mentioning the stack wasn’t really relevant. If the 999 views were in a vector instead of the stack, my question still stands. I’m thinking I may be imagining the GC backward? I thought the GC goes through each object on the heap, so I imagined it’d mark only 2 things: the views vector because it’s in scope, and the viewed array because it’s referenced by the vector. But if the GC scales with references, then it must be checking the reference to the views vector and its 999 views instead?

Oscar_Smith · March 4, 2022, 1:17am

The way GC works is it goes through each object on the heap, but the way to find objects on the heap is to look at all the references from the stack to the heap.

stevengj · March 4, 2022, 2:25am

You could use UnsafeArrays.jl for the array view, and then wrap a StringView around this. This way the views contain no references (that Julia knows about) to the heap object (only a raw pointer). (You are responsible for holding on to the original array until you are done with the unsafe views.)

This way you never have to materialize a String object.

quinnj · March 5, 2022, 10:49pm

Just a point of clarification; the InlineStrings code was moved out of the WeakRefStrings.jl package into a stand-alone InlineStrings.jl package; the original InlineStrings code is still in WeakRefStrings.jl package to prevent any breakage, but is effectively “deprecated” and shouldn’t be relied upon. Only the InlineStrings.jl package has received bugfixes/improvements since the code was moved out.

Topic		Replies	Views
Redesigning String, optimising small strings and comparison Internals & Design strings	71	2636	May 3, 2025
What exactly is "allocation" in Julia? Performance question , memory-allocation	45	6087	November 4, 2022
GC occurs at the worst time in tight loop (Garbage Collection) Performance question	93	3323	November 7, 2023
Questions for a new string type and on GC/alignment/padding General Usage	6	739	December 21, 2016
Why is garbage collection called here? Performance	11	400	February 8, 2024

Help with excessive, unpredictable GC time from string allocations

Related topics