Investigate slowness

This question is about how I should go about investigating why exactly my code is slow.

The scenario is that I have a large-ish (10GB) Vector{Any} that I want to interate over, do some calculations, and insert into another Vector{Any} which is then returned. I don’t know from the beginning the size of the output vector; it’s smaller than the input, but still large-ish, think of the order of 25% the input.

I run my current code with ProgressBars and I see that at the start I’m doing about 200k iterations per second, but towards the end (15 minutes) I’m doing roughly 100k it/s. The first version of the code has the obvious problem that when inserting into a huge Vector{Any} some times you will trigger enormous allocations and copies. Indeed, these are so large that I can see the ProgressBar hang for a couple of seconds, before continuing, and these hangups get longer and longer as time progresses. So the symptoms I’m seeing are consisten with my mental picture of allocating and copying vectors.

So what I’ve tried is the following. While iterating the input and inserting, I instead insert into a Deque{Any}, and at the end of the iteration I instanciate an undefined vector by using out = Vector{Any}(undef, len) and iterate the deque and copy the elements one by one by using

len = length(out_deque)
out = Vector{Any}(undef, len)
for (n,i) in enumerate(out_deque)
    out[n] = i

According to the DataStructures docs, deques are implemented as unrolled lists. Therefore, my naive expectation is that I should still get allocations, but copies should no longer be happening (or rather, copies longer than the size of each elemental vector should never happen), and so the I shouldn’t see that the number of iterations per second goes down.

Surprisingly, I still see the exact same symptoms. Iterations per second go down over time, I still see the same periodic hangups getting longer and longer, and I still start with 200k it/s and go down to 100k it/s.

Question: How can I investigate what’s going on?

My first question would be: does your machine have enough RAM to handle all your memory needs? What system are you using? Could it be just memory swapping?

1 Like

First step to figuring out slow code should be to @profile it to find where the bottlenecks are (they might not be where you think).

Do the arrays need really to be of Any type? If they could be of a few different types it would be possible to crate, for example, a struct array, and even make it a subtype of array with some proper indexing while keeping types constant. I know that this does not exactly addresses your question, but for data if that size probably a more specialized approach is valuable.


I don’t know if you’re running code on a laptop… If you do or you have a pc with very bad thermal, this could happen because of thermal throttling. Pay attention to the CPU frequencies.
Let’s say you just do the calculation and throw it way, don’t save it to the output, there is no other reason for it to slow down since there is no increase output arrays.
If that will never slowdown, it could just be memory issue with swapping.
Or it’s donig heavy GC, but i don’t think this is likely to have great impact on performance.

1 Like

Debian. I don’t know how to check that though.


free -h
              total        used        free      shared  buff/cache   available
Mem:           31Gi        22Gi       4.5Gi       491Mi       4.3Gi       7.9Gi
Swap:         974Mi          0B       974Mi

This is what it looks, and looks like this whether the code is running or not. I have no idea how to interpret this.

I just read somewhere that you should make your swap twice as big as your RAM. So my swap is tiny? Why would this matter though? Even with a 10GB object in memory, that only uses up 1/3rd of my RAM.

Great suggestion, thank you, and what a nice new tool to play with. Reading the docs within I’ve used Tim Holy’s ProfileView and came away with quite a surprise. I’ll try to find how to post the results in a bit.

EDIT: Here it is.

The red stuff on the left is my code, so that turns out to be a minority of the time. The three rows at the bottom are (from top to bottom):

task.jl, task_done_hook: line 494
task.jl, wait: line 768
task.jl, poptask: line 760

So I think it’s clear that in the big picture, my replacement of Vector with Deque is a tiny optimization, and there’s a much bigger issue somewhere. Also note my reply to ChunXi_Zhang, about the cpu frequency drops.


Found this issue[1] that sounds related, superficially, but there’s a difference: My code doesn’t use threads (at least that I know of, maybe some lib code that I use uses threads?)

[1] most of the profile runtime is in task_done_hook unless threads = 1 · Issue #41713 · JuliaLang/julia · GitHub

I’m not on a laptop. I’ve just tried running it, and the temperature never left 50-60C.

At the same time I’ve run watch -n 2 lscpu and kept an eye on CPU MHz. It’s 3500 most of the time, but some times drops to 700-800?? I couldn’t be 100% sure if such drops coincided with my code’s hangups, but weird nonetheless.

A couple of ideas:

  1. preallocate the output vector, say at 30% of the input size. Chop it to the correct length one you know it.

  2. can you store the result of the calculations inside the input vector? Do that, then just store the indexes you want to copy out at the end.

A Vector{Any} should just be an array of pointers. when the copy occurs it’s copying pointers. This is about as quick as you’re likely to get.

What are you doing with tasks in your code? Is this multi-threaded?

Certainly my code isn’t multi-threaded, but it’s possible that something else around it is? For example, maybe some Juno stuff has some degree of multi-threadedness? What seems clear is that only about 25% of the time is actually spent on my code.

For it to spend 75% of its time in wait makes me think something you do does a ccall to a multithreaded thing and then you’re just waiting for it to return. Is there some enormous computation you do by calling something else? I don’t understand how that profile makes any sense. I do vaguely remember having similar issues with the profiler a year ago. It might be that the profiler is confused.

Aha I found the thread: [solved] Agents.jl spending 3/4 of its time doing something before it even starts to compute? - #9 by dlakelan

The issue is that you’re probably running 4 threads, and the profiler samples all the threads. so 3 out of 4 of them are doing nothing.

My code doesn’t do any of that. It’s as vanilla as you can imagine. The things in the Vector{Any} represent events, and I iterate through them one by one and update an internal state, some times emiting events. Even if I wanted to use multi-threading, which I don’t, I wouldn’t know how to solve my problem in a multi-threaded way.

Ah thanks for that. So this would just be a misleading picture. I’ll try starting julia with 1 thread and see how the picture looks then.

Yeah, but you probably have JULIA_NUM_THREADS=4 or 5 or something, and the profiler samples all of them, and all but one is doing nothing. So just zoom the picture to the one where it’s doing something, or yes, start with 1 thread and run.

Interestingly, I don’t have JULIA_NUM_THREADS on my main user’s env, but my inspecting the julia’s runtime I can see that indeed it’s set to 4. Is that the default somehow? Or maybe it’s in some config file that I’ve edited at some point in the past? I’ve check that I don’t have a ~/.julia/config/startup.jl, so that’s not it. It’s probably somewhere in Atom settings. But I think you’re on to something.

EDIT: Just checked that when I start Julia from the command line, JULIA_NUM_THREADS isn’t set and Threads.nthreads() returns 1. So it’s probably Juno/Atom somehow.

EDIT2: This seems to confirm that Juno has this behavior: How to change the number of threads? - #9 by BeastyBlacksmith

EDIT3: Found it. In Juno it’s in Settings > Julia Options > number of threads. Set it to 1, re-running

You were right. Now my code takes 100% of the time.

But there’s still so much wrong with this profile I don’t even know where to start :rofl:

That super tall tower in the middle is something in abstractinterpretation.jl calling itself recursively many times, it seems. It seems to be a small fraction of the time so I don’t know if I can be bothered with it, but clearly something is going wrong with it.

Not necessarily, where is most of the time being spent? Also you can post pics direct to the forum, no need to use imgur. It’s much better because the forum will preserve it