Help with excessive, unpredictable GC time from string allocations

I’ve been having a hell of a time with some performance disasters when loading a large number of strings from a file.

I am hitting this in my rewrite of a parquet reader and was caught rather by surprise, because despite microbenchmarks being excellent, I eventually discovered that my actual performance on large enough files was abysmal. I am by now pretty sure that the major remaining issue is that I cannot seem to load certain types of files without hitting unpredictable and excessive GC times.

At the core of the problem (I think) is a loop in which I iterate over a view of an array from which I read strings. The strings consist of a UInt32 giving the length of the string followed by the string itself with no padding or escape codes. You can find the loop here.

I have tried multiple versions of this. First, I was writing to an array consisting entirely of Strings so that I was creating a String from the views with String(view(v, a:b)). Obviously, this will result in allocations of the strings, though I’m not sure I understand why it seems to be leading to so much garbage collection since the strings are stored in an array and not garbage collected.

Next, as an optimization, I tried an alternate version which takes place in the same loop but for which convertvalue is merely the identity (i.e. I write views to an array of views). I later wrap this array in a special AbstractVector type that simply converts the views to strings on getindex. Much to my surprise, the performance of this is only slightly better, and, even more surprising, does not seem to alleviate the GC issues at all. Yes, it’s possible that the GC issues are not from here, but I can’t find where else it might be from and this place is certainly a huge performance bottleneck, so it’s hard to understand how the GC can be anywhere else.

I even tried to mitigate this by doing GC.enable(false) only around this loop, but unsurprisingly this doesn’t help much since everything now just gets GC’d outside the loop and there are many such loops. I deem it too risky to disable GC around the loop of loops.

The thing that indicates to me that GC is such a huge problem is that @time consistently reports that when loading a file with several of these columns which are very large, > 1/2 GC time.

Much to my extreme frustration, the original Parquet.jl does not seem to suffer from this issue by nearly the same degree, and I have so far been unable to determine why not. They are reading the same format, and they are doing basically the same thing I am here at this point in the loop, and they do not even have a version that does this with the views.

I have quite a lot of Julia experience but this by now has me pretty stumped. This is the biggest discrepancy I’ve ever encountered between microbenchmarks of the performance critical code and my overall performance; in fact I can’t really think of another case in 6 or so years of using Julia in which my apparently optimized code runs so incredibly badly. Any help or advice would be appreciated. Thanks all!

You might want to speak to @oxinabox - she explained to me long ago on Slack what was going on with Strings. As I don’t work much with Julia internals my recollection is vague, but my understanding at the time was that Strings put undue pressure on the GC because they need to be heap allocated, and every GC run has to scan all Strings to work out whether they should be collected (so this is slow even if nothing gets collected in the end).

Some packages that work around this:

  • ShortStrings is what I used to use, but I think it has been superseded now
  • InlineStrings is what CSV.jl now uses by default when reading in String data
  • WeakRefStrigns has InlineStrings as well, plus some other types - you’d probably have to ask @quinnj to find out how this relates to just the previous package
5 Likes

You should also try testing this with https://github.com/JuliaLang/julia/pull/44215

2 Likes

Right, yes.
When you have tens of millions of “reference” variables, basically things that use pointers under the hood, they make the “Mark” step of the mark and sweep GC slow.
Reference variables are things that are not isbits, so Vectors, Strings, mutable structs, structs/tuples with abstractly typed fields etc; Or structs with fields that are reference types (like SubString referencing a String).
Basically how a mark and sweep GC works is: when ever GC is triggered it goes through all references that exist in memory and tried to find a path from them to something that is definately in scope, if it does then it marks them and then it does a sweep step that removed everything that isn’t marked.

This is good because rather than scaling with the number of times a variable is used (which is how reference counting scales), it scaled with how many variables are declared (which is naturally less than how many times it is used as the declaration is a single use).
But it is bad that it also scales (multiplicatively) with how often a GC is triggered.
So if you have a lot of references in memory for a long time it scales badly.

To combat this Julia uses a generation GC.
Basically rather than 1 pool of reference to mark and sweep it breaks references into two Pools (you can keep exending this BTW, java has 3).
It applies a heuristic that if something has stuck around for a while (2(?) mark and sweep steps in julia’s case) it probably is going to be sticking around for a while more.
So everything starts in the Young pool, and then things that stay around for a while graduate to the Old pool.
Since most variables don’t hang around for long, one can free up a lot of memory by only marking and sweeping the young pool.
But if that doesn’t get you what you need you will need to sweep the old pool too – but this should be rare.
So what should happen is if you allocate a ton of strings say at the start of your program they quickly graduate to the Old pool and then rarely impose load on the Mark and Sweep step.

However, Julia’s generation GC doesn’t work very well (but as @Oscar_Smith linked there is a PR to hopefully fix it).
The issue #40644 shows that sometimes julia will effectively sweep both pools every time – making it functionally non-generational

ShortStrings basically represents a string not with a pointer to a block of memory (which would need to be tracked by the GC( but with a isbits value (like an UInt) that it reinterprets the bits out of.
InlineStrings is ShortStrings but implemented better, basically because @quinnj is smarter than me.
WeakRefStrings reexports InlineStrings.
WeakRefString also however has a number of other useful string types, including the titular WeakRefString.
The WeakRefString was common in older (versions of) packages but has gone away for safer constructs, however unlike much of what replace it, it is isbits so didn’t incur load on the GC.
Basically, it was a SubString but it didn’t actually contain a julia reference to the original string, rather it contained just a raw Ptr to a position in the string to start reading from (+ a length).

15 Likes

Thanks all for your help, especially @oxinabox for her detailed explanation!

This has been very enlightening. I thought I understood what WeakRefStrings was for, but now I realize I did not. In particular, despite being aware of it I did not think it was useful to me here because I thought what I was doing with the views was equivalent to using WeakRefStrings, but now I see that this is not the case. I’m worried that it might be very tricky to use here because I have a view rather than a Vector{UInt8} and I’m not a priori sure whether there will be some trick involved in getting WeakRefStrings. I’ll give it a shot and report back here.

Edit: In retrospect, it won’t be hard to use WeakRefStrings here, this is guaranteed to always be either a Vector{UInt8} or a view of one. I’m going to make it so that there’s an option to choose one of two approaches: either allocate a new buffer containing the string data which gets copied into it via WeakRefStrings (this is implemented by WeakRefStrings.StringArray) or create a special output AbstractVector type which contains WeakRefStrings and exactly one reference to the original buffer, which will take up way more memory but obviate the need to allocate a whole new array.

2 Likes

Let me check if I’m reading this correctly. So the marking step scales with the number of non-isbits “reference variables”, which would include both heap allocated objects and stack-allocated objects containing references, yes? What’s confusing about that to me is that I wouldn’t expect tracking 1000 arrays on the heap to be the same amount of work as 1 array on the heap and 999 views of it on the stack; once the 1st view is found, the array can just be marked.

Also, would this issue be improved at all by the recent work on escape analysis and eliding (heap) allocations?

this is half right. the part you’re missing is that what matters is the number of reference variables at the time gc is run. the advantage of putting variables on the stack is that they disappear when the function returns so GC doesn’t have to deal with the ones that have disappeared.

2 Likes

You could use something like StringViews.jl to treat a subarray view of your array as a string. Those views are all isbits types that don’t require heap allocations (or copying data out of our array).

I don’t think this will work because the views themselves contain references and the references are causing the problem, not allocations. I already tried a vector of UInt8 views that was wrapped in an AbstractArray to turn them to String on request but was surprised to discover this barely seemed to help (surprised because at the time I thought it must have something to do with allocations).

Anyway, I have just implemented an alternative with copying using WeakRefStrings and this has provided a drastic improvement! I’m also going to implement a zero-copy alternative and I expect that will be even faster.

As a side note, in my case a small performance fix was needed for WeakRefStrings which I have fixed in a PR, so once that is merged Parquet2 strings will finally be reasonably fast.

Thanks again! (This was unbelievably frustrating before I understood what was happening, so this has been a huge relief.)

2 Likes

Now I look at the code again, it clicked that the references (views) are being put in an array on the heap, so my mentioning the stack wasn’t really relevant. If the 999 views were in a vector instead of the stack, my question still stands. I’m thinking I may be imagining the GC backward? I thought the GC goes through each object on the heap, so I imagined it’d mark only 2 things: the views vector because it’s in scope, and the viewed array because it’s referenced by the vector. But if the GC scales with references, then it must be checking the reference to the views vector and its 999 views instead?

The way GC works is it goes through each object on the heap, but the way to find objects on the heap is to look at all the references from the stack to the heap.

1 Like

You could use UnsafeArrays.jl for the array view, and then wrap a StringView around this. This way the views contain no references (that Julia knows about) to the heap object (only a raw pointer). (You are responsible for holding on to the original array until you are done with the unsafe views.)

This way you never have to materialize a String object.

Just a point of clarification; the InlineStrings code was moved out of the WeakRefStrings.jl package into a stand-alone InlineStrings.jl package; the original InlineStrings code is still in WeakRefStrings.jl package to prevent any breakage, but is effectively “deprecated” and shouldn’t be relied upon. Only the InlineStrings.jl package has received bugfixes/improvements since the code was moved out.

2 Likes