Can this performance be improved?

ilanggear · February 10, 2025, 8:24pm

I have used ProfileView.@profview to generate the attached flame graph. I have read and re-read the documentation, but I’m still not confident of my interpretation, or whether there is something I can improve. The top longest red bar on the right side of the flame graph is for a function called “augmentCountIndex”, shown below. It adds 1 to the count of a particular ngram “fv” (a tuple of two UInt32s) and, if “indicesQ” is true, it also stores the index where it occurred. The first argument, “ngramDict,” can have anywhere from a few thousand entries to millions of entries. My code swings from taking three or so minutes to run, to a half hour, or even multiple days. Do the red bars mean Julia is doing dynamic dispatch? Is there something that will improve the performance? Thanks in advance for any help!

ngramIdsSorted = Union{IdsSorted1, IdsSortedN}[]
IdsSorted1 = Tuple{UInt32, Int32}
IdsSortedN = Tuple{UInt32, UInt32}

function augmentCountIndex(ngramDict, fv, tokId, indicesQ)
     if haskey(ngramDict, fv)
         ngramDict[fv].c += 1
         indicesQ && push!(ngramDict[fv].indices, tokId)
     else c = Int32(1)
         inds = indicesQ ? [UInt32(tokId)] : []
         ngramDict[fv] = CountIndex(c, inds) 
     end
end

jling · February 10, 2025, 9:02pm

it would be much easier for people here to help you if you have a runnable benchmark workload

JonasWickman · February 10, 2025, 9:36pm

As @jling said, it’s not easy to tell without a runnable example, but one possible culprit is this line:

inds = indicesQ ? [UInt32(tokId)] : []

since [] is a Vector{Any}. If you can give the empty collection a type, that might improve performance, though I can’t quite tell whether this is important in practice in your context.

LaurentPlagne · February 11, 2025, 8:09am

From the profile view, there is a good probability that the performance can be improved because there is a wide red part of the profile at the surface of the graph (unstable or allocating code responsible for a large fraction of elapsed time).

Again, a MWE would help a lot

Benny · February 11, 2025, 10:32am

Are you talking about a wider program? There’s not enough detail to know what augmentCountIndex does, but it’d be very surprising that such a simple method could vary in runtime that much. I can’t rule it out but it’s very hard for something as simple as extra runtime dispatch to do that.

Is this an intentional lack of a newline before an assignment or did you intend this to be a condition elseif c == Int32(1)? Not sure because there isn’t a local or global c otherwise, but the assignment appears unnecessary and I would’ve expected you to just inline the value to CountIndex(Int32(1), inds).

Mason · February 11, 2025, 11:54am

JonasWickman:

As @jling said, it’s not easy to tell without a runnable example, but one possible culprit is this line:
inds = indicesQ ? [UInt32(tokId)] : []
since [] is a Vector{Any}. If you can give the empty collection a type, that might improve performance, though I can’t quite tell whether this is important in practice in your context.

Yeah, this sounds like the most believable culprit to me if augmentCountIndex` is the big red line in the frame graph.

Just to expand on this and make a bit more actionable, the way to fix this would be to change this line like

-inds = indicesQ ? [UInt32(tokId)] : []
+inds = indicesQ ? [UInt32(tokId)] : UInt32[]

The synatax UInt32[] means that the array can only hold elements of type UInt32, whereas [] can hold anything (thus causing type instability in CountIndex(c, inds))

DNF · February 11, 2025, 12:04pm

ilanggear:

function augmentCountIndex(ngramDict, fv, tokId, indicesQ)
     if haskey(ngramDict, fv)
         ngramDict[fv].c += 1
         indicesQ && push!(ngramDict[fv].indices, tokId)
     else c = Int32(1)
         inds = indicesQ ? [UInt32(tokId)] : []
         ngramDict[fv] = CountIndex(c, inds) 
     end
end

There’s a bit too little info to get a full picture here, but there are som big red flags, performancewise.

Firstly, whenever there is a bare, untyped [] in your code, it is likely to be slow. You should definitely avoid that.

Secondly, the fact that you dictionary, ngramDict apprently accepts untyped arrays in its .indices field, indicates that your dictionary is also not sufficiently typed.

Thirdly, I am skeptical that there is any point in keeping a count of indices in the field .c, since that is both error-prone, and also does not serve any purpose for performance. Just use length(ngramDict[fv].indices) instead of ngramDict[fv].c. That means, afaict, that the entire CountIndex type is obsolete. Unless there is more functionality, you might just store the indices vector directly in ngramDict[fv].

Fourthly, is there any particular reason that you are using UInt32 for the indices? It’s not necessarily wrong, but there are not that many compelling arguments for doing so. Normally, native Int is used for indices, for various reason.

ilanggear · February 12, 2025, 4:43pm

Sorry for the delayed reply. I’ve been trying to construct a Minimum Working Example, but so far have not succeeded. My problem involves more than 1500 lines of code. I fixed the bare . The CountIndex struct exists because I’m not storing indices for every ngramDict entry. (There are too many of them and I’m already having problems with maxed out memory and thrashing with swap on my little 8G RAM laptop. The ngramDict is on the order of ~1 million entries and each entry could potentially have up to ~1 million indices, though I’m storing only the 2% shortest entries. There are also a dozen other data structures on the order of ~1 million elements.) I am using UInt32s in lots of places because of the memory issues just described. I am continuing to work on a MWE…

Benny · February 12, 2025, 5:20pm

That is a LOT of stuff, 1M x 1M x 2% bits is 2.5 GB. 1 bit each is already excessively optimistic, just going up to a more reasonable 1 byte each would exceed my 16GB of physical RAM. Larger data types and pointers due to abstract element types would only multiply the memory usage. Any conceivable way to work on a small fraction at a time with the rest in storage? Do you have enough storage for the problem?

DNF · February 12, 2025, 9:23pm

That sounds challenging, indeed. But, as a start, did the advice you received so far help with the problem you described?

ilanggear · February 13, 2025, 5:05pm

The elimination of bare “” was helpful. Even more helpful was to hear that the red bars on the flame graph didn’t automatically indicate a problem with type instability or dynamic dispatch and that more information was needed. It may be the case that the slow processing I’m observing is purely due to the size of the problem space.
My problem far exceeds the physical RAM on my machine, but the behind-the-scenes use of virtual memory and swap space seems to make the problem nearly manageable. I’ll have to think for a while on whether there are further ways to subdivide the problem into sections to improve the manageability.

Topic		Replies	Views
Tutorial: How to benchmark and profile your code? Performance blog-post	11	2904	March 24, 2021
Alternative for flamegraphs New to Julia	3	443	February 16, 2024
Alternatives to flame graphs for profiler Profiling vscode , profiling	6	810	November 20, 2023
[ANN] More flexible & user-friendly profile visualization (ProfileView 0.6) Package Announcements profiling	8	2591	January 22, 2020
Saving Profile data Performance	13	1227	December 3, 2020

Can this performance be improved?

Related topics