I’ve been having a hell of a time with some performance disasters when loading a large number of strings from a file.
I am hitting this in my rewrite of a parquet reader and was caught rather by surprise, because despite microbenchmarks being excellent, I eventually discovered that my actual performance on large enough files was abysmal. I am by now pretty sure that the major remaining issue is that I cannot seem to load certain types of files without hitting unpredictable and excessive GC times.
At the core of the problem (I think) is a loop in which I iterate over a view of an array from which I read strings. The strings consist of a UInt32
giving the length of the string followed by the string itself with no padding or escape codes. You can find the loop here.
I have tried multiple versions of this. First, I was writing to an array consisting entirely of String
s so that I was creating a String
from the views with String(view(v, a:b))
. Obviously, this will result in allocations of the strings, though I’m not sure I understand why it seems to be leading to so much garbage collection since the strings are stored in an array and not garbage collected.
Next, as an optimization, I tried an alternate version which takes place in the same loop but for which convertvalue
is merely the identity (i.e. I write views to an array of views). I later wrap this array in a special AbstractVector
type that simply converts the views to strings on getindex
. Much to my surprise, the performance of this is only slightly better, and, even more surprising, does not seem to alleviate the GC issues at all. Yes, it’s possible that the GC issues are not from here, but I can’t find where else it might be from and this place is certainly a huge performance bottleneck, so it’s hard to understand how the GC can be anywhere else.
I even tried to mitigate this by doing GC.enable(false)
only around this loop, but unsurprisingly this doesn’t help much since everything now just gets GC’d outside the loop and there are many such loops. I deem it too risky to disable GC around the loop of loops.
The thing that indicates to me that GC is such a huge problem is that @time
consistently reports that when loading a file with several of these columns which are very large, > 1/2 GC time.
Much to my extreme frustration, the original Parquet.jl does not seem to suffer from this issue by nearly the same degree, and I have so far been unable to determine why not. They are reading the same format, and they are doing basically the same thing I am here at this point in the loop, and they do not even have a version that does this with the views.
I have quite a lot of Julia experience but this by now has me pretty stumped. This is the biggest discrepancy I’ve ever encountered between microbenchmarks of the performance critical code and my overall performance; in fact I can’t really think of another case in 6 or so years of using Julia in which my apparently optimized code runs so incredibly badly. Any help or advice would be appreciated. Thanks all!