Can this performance be improved?

I have used ProfileView.@profview to generate the attached flame graph. I have read and re-read the documentation, but I’m still not confident of my interpretation, or whether there is something I can improve. The top longest red bar on the right side of the flame graph is for a function called “augmentCountIndex”, shown below. It adds 1 to the count of a particular ngram “fv” (a tuple of two UInt32s) and, if “indicesQ” is true, it also stores the index where it occurred. The first argument, “ngramDict,” can have anywhere from a few thousand entries to millions of entries. My code swings from taking three or so minutes to run, to a half hour, or even multiple days. Do the red bars mean Julia is doing dynamic dispatch? Is there something that will improve the performance? Thanks in advance for any help!

ngramIdsSorted = Union{IdsSorted1, IdsSortedN}[]
IdsSorted1 = Tuple{UInt32, Int32}
IdsSortedN = Tuple{UInt32, UInt32}

function augmentCountIndex(ngramDict, fv, tokId, indicesQ)
     if haskey(ngramDict, fv)
         ngramDict[fv].c += 1
         indicesQ && push!(ngramDict[fv].indices, tokId)
     else c = Int32(1)
         inds = indicesQ ? [UInt32(tokId)] : []
         ngramDict[fv] = CountIndex(c, inds) 
     end
end

it would be much easier for people here to help you if you have a runnable benchmark workload

2 Likes

As @jling said, it’s not easy to tell without a runnable example, but one possible culprit is this line:

inds = indicesQ ? [UInt32(tokId)] : []

since [] is a Vector{Any}. If you can give the empty collection a type, that might improve performance, though I can’t quite tell whether this is important in practice in your context.

3 Likes

From the profile view, there is a good probability that the performance can be improved because there is a wide red part of the profile at the surface of the graph (unstable or allocating code responsible for a large fraction of elapsed time).

Again, a MWE would help a lot :slight_smile:

Are you talking about a wider program? There’s not enough detail to know what augmentCountIndex does, but it’d be very surprising that such a simple method could vary in runtime that much. I can’t rule it out but it’s very hard for something as simple as extra runtime dispatch to do that.

Is this an intentional lack of a newline before an assignment or did you intend this to be a condition elseif c == Int32(1)? Not sure because there isn’t a local or global c otherwise, but the assignment appears unnecessary and I would’ve expected you to just inline the value to CountIndex(Int32(1), inds).

Yeah, this sounds like the most believable culprit to me if augmentCountIndex` is the big red line in the frame graph.

Just to expand on this and make a bit more actionable, the way to fix this would be to change this line like

-inds = indicesQ ? [UInt32(tokId)] : []
+inds = indicesQ ? [UInt32(tokId)] : UInt32[]

The synatax UInt32[] means that the array can only hold elements of type UInt32, whereas [] can hold anything (thus causing type instability in CountIndex(c, inds))

1 Like

There’s a bit too little info to get a full picture here, but there are som big red flags, performancewise.

Firstly, whenever there is a bare, untyped [] in your code, it is likely to be slow. You should definitely avoid that.

Secondly, the fact that you dictionary, ngramDict apprently accepts untyped arrays in its .indices field, indicates that your dictionary is also not sufficiently typed.

Thirdly, I am skeptical that there is any point in keeping a count of indices in the field .c, since that is both error-prone, and also does not serve any purpose for performance. Just use length(ngramDict[fv].indices) instead of ngramDict[fv].c. That means, afaict, that the entire CountIndex type is obsolete. Unless there is more functionality, you might just store the indices vector directly in ngramDict[fv].

Fourthly, is there any particular reason that you are using UInt32 for the indices? It’s not necessarily wrong, but there are not that many compelling arguments for doing so. Normally, native Int is used for indices, for various reason.

2 Likes

Sorry for the delayed reply. I’ve been trying to construct a Minimum Working Example, but so far have not succeeded. My problem involves more than 1500 lines of code. I fixed the bare . The CountIndex struct exists because I’m not storing indices for every ngramDict entry. (There are too many of them and I’m already having problems with maxed out memory and thrashing with swap on my little 8G RAM laptop. The ngramDict is on the order of ~1 million entries and each entry could potentially have up to ~1 million indices, though I’m storing only the 2% shortest entries. There are also a dozen other data structures on the order of ~1 million elements.) I am using UInt32s in lots of places because of the memory issues just described. I am continuing to work on a MWE…

1 Like

That is a LOT of stuff, 1M x 1M x 2% bits is 2.5 GB. 1 bit each is already excessively optimistic, just going up to a more reasonable 1 byte each would exceed my 16GB of physical RAM. Larger data types and pointers due to abstract element types would only multiply the memory usage. Any conceivable way to work on a small fraction at a time with the rest in storage? Do you have enough storage for the problem?

1 Like

That sounds challenging, indeed. But, as a start, did the advice you received so far help with the problem you described?

1 Like