Why is push! still so slow?

Sukera · March 28, 2024, 9:25am

Because the possibility of throwing an error is an observable sideeffect of the computation. It’s pretty difficult to “roll back” other affected vectorized state that may have been computed as a result of the vectorization (and possibly reordering) when any one of the results would result in an error.

Not necessarily. Rust for example prevents these kinds of data races by design, because they can lead to pretty gnarly vulnerabilities. Any time you encounter a segfault in a program, you’re more or less encountering an opportunity for a security exploit, typically through an out-of-bounds write that can often be leveraged for arbitrary code execution, if an attacker is motivated enough.

Per se memory safe languages like Julia generally try to prevent this by having more high level memory management in the form of a GC. This can provide spatial safety (tracked/live objects can’t alias) but is not enough to give full memory safety in the presence of data races. Rust goes one step farther and uses its borrow checker to ensure there are no data races in your code (barring use of unsafe blocks).

That’s why this is so insidious: it won’t manifest until some other task has access to the input vector and resizes it while you are loading some data from the old memory location. pop!/push! themselves try to detect this (it won’t catch everything), but that can’t do anything about existing tasks reading from the old object. That’s why I’m saying that the optimizer really shouldn’t vectorize here, since it has no knowledge about whether the memory location of the vector changed since the last iteration.

sgaure · March 28, 2024, 9:45am

I agree that there could be all sorts of concurrency problems in asynchronous code, but that is still the user’s responsibility to avoid with excplicit atomics or locks or other methods? If another task pulls the rug, it will be gone, and segfaults or undefined behaviour will result. Otherwise it would be impossible to write hpc-code in julia. If the GC was compacting, I could understand such precautions, but nothing in core julia will pull your rug, you’ll have to do it yourself, and take precautions in your own asynchronous code?

On the other hand, I see a concurrency check in Base._growend!, with a reminder to use correct locks. That must be some debug code accidentally left there? Are there such things in many places?

Sukera · March 28, 2024, 10:58am

I’d personally like it not to be - humans are pretty bad at keeping the entire state space of a program in their head. I view programming languages as a tool to help me accomplish a task, so if the computer has the opportunity to prevent my mistakes, why shouldn’t I want that?

If it would always segfault, I’d maybe agree, but since it can just as well produce very wrong results instead, that’s not a risk I’m comfortable with. A data race like this can manifest just as well as a wrong-but-plausible-looking result in an analysis pipeline. Are you saying that HPC relies on these data races for its results, or are you saying that without the vectorization (that can be done with proper software architecture in a safe way!) HPC would no longer be high performance?

That’s not some debug check, that’s a sanity check for users of push!. If the check fires, the implementation has caught a detectable race condition in the user code, where they’re trying to push! to/resize the same vector from two different tasks at the same time. If either of those would go through, one of the two tasks would see a different result than they would have expected, where the just pushed value may not be in the vector at all.

sgaure · March 28, 2024, 12:29pm

If you really want to utilize the parallelism in the processor, you don’t want the compiler to insert safety measures. Indeed, in certain loops you don’t want if-tests at all, and certainly not inserted by the compiler without your knowledge, just because it feels that you somehow might be incompetent. Julia makes a selling point about HPC, but micro managing such things yourself is definitely part of HPC. Not that everybody does, but the problems in this thread are about an order of magnitude performance, in a quite simple loop. It matters.

That’s not some debug check, that’s a sanity check for users of push!. If the check fires, the implementation has caught a detectable race condition in the user code, where they’re trying to push! to/resize the same vector from two different tasks at the same time. If either of those would go through, one of the two tasks would see a different result than they would have expected, where the just pushed value may not be in the vector at all.

Absolutely, and in this particular case it does probably not hurt performance much, I’m not sure. But julia docs are quite clear that update of core structures are not thread safe, and the user must take precautions. It comes up now and then, about thread safe Dicts, thread safe this and that, but the reason nothing is thread safe is performance. If core julia starts babysitting, we risk that vectorization, instruction pipelining, prefetching, whatnot, suffers, with potentially disastrous performance hits, or elaborate workarounds. We must then go back to two-language development, linking in C routines, manually writing llvm and so on. Or ditch julia altogether.

Sukera · March 28, 2024, 6:00pm

Even at the cost of the code (potentially) no longer producing correct results?

I mean, I’m certainly familiar with the kinds of SIMD level optimizations one can and generally needs to do for ultimate performance near the machine limit, but that’s very different from the basic building blocks of a language being correct & reliable. For that usecase, the example in this post definitely isn’t representative; scaling the “computation” up you’re very quickly going to saturate your memory bandwidth before you saturate your compute units.

Julia makes a selling point about easier & correct HPC; the large number of packages enabling that on very high-level code without compromising correctness is testament to that.

I don’t find this (IMO needlessly hyperbolic) argument convincing. You can have your @inbounds and your @fastmath, but the default should IMO always focus on correct code first and foremost. If you can’t be sure that your code is computing the right thing, how can you claim that how fast you computed it matters? After all, if the result doesn’t need to be correct anymore, we could all just return const RANDOM_NUMBER = 4.0 and go home

uniment · March 28, 2024, 8:45pm

So in Julia we’re okay that normally adding two numbers s += i isn’t thread-safe for performance, but we’re willing to trade ~order-of-magnitude performance loss for push!(a, n) to be thread-safe. Do I understand this correctly?

adienes · March 28, 2024, 8:57pm

it does seem strange to have some operations be thread-safe but others not

if anything, I think making push! lock like this could be more likely to lead to errors. without, you might fail faster with some out of bounds index access error. with, you might get a false sense of security (but other concurrency bugs are still present)

foobar_lv2 · March 28, 2024, 9:11pm

push! / resize with concurrent other ops on the vector is not safe, it is a bad data race with unpredictable results.

The checks try to opportunistically sometimes detect this, and then throw an error.

The error is not thrown to avoid returning incorrect results. It is thrown such that users have a better chance of detecting the bug.

I am not an enemy of such checks – if they are mostly free, as they are, then they are nice. I mean, modern code is absolutely littered with such checks, under the heading of “security mitigations”.

No, we’re not – we’re not getting thread-safety, and we’re not paying orders of magnitude for the thread-safety that we’re not getting anyways. The code here is a complete outlier: It pushes a lot of values that need close to zero code to generate.

Normally, you only push! in a loop if you need to check a nontrivial condition for every value. Otherwise you use a sensible API (resize! followed by just setting the indices). Code with such nontrivial computations / condition doesn’t vectorize anyway!

If it turns out that real code patterns for push! exhibit the slowdown due to the opportunistic checks, then these checks will almost surely be removed.

sgaure · March 28, 2024, 11:13pm

Most definitely, yes. HPC is mostly about performance. Serial performance, simd, mimd, pipelining, what have you. If you’re messing around with shared memory parallel code you simply have to learn how to do it correctly. Including cache dynamics, false sharing, memory alignment, elementary atomics semantics, memory fences, locks, semaphores, barriers and other constructs. In julia there is more than enough work to avoid allocations in some such critical paths. If in addition there were a whole lot of babysitting to be circumvented it would be unbearable.

Sukera · March 29, 2024, 6:28am

The check is a single pointer compare, which should be dwarfed by the cost of copying the existing array data around to a new allocation Except for tiny arrays, that copying alone is going to hit I/O, so I highly doubt that this check will ever become a bottleneck.

I don’t know why you keep repeating this “babysitting” claim; I’m not arguing for anything of the sort. If you take my stance of “code should first and foremost be easier to write correctly” as “babysitting”, then I guess we’ll just have to agree to disagree. Using tools like OhMyThreads.jl, KernelAbstractsion.jl or Octavian.jl that take care of pretty much all of these kinds of details enables you as a developer to focus on what you actually care about; the computation, not the fiddly machine details that differ from machine to machine. I see the possibility for having these kinds of higher-level abstractions as one of the big points for julia by making it easier for people to write high performance code, and don’t view them as “babysitting”.

sgaure · March 29, 2024, 8:03am

Exactly. And the reason such packages can be written in pure julia is because julia does not have overhead to ensure thread safety at every turn. Remember, we’re talking about having julia reload the base address of an array in every iteration, just in case it was changed by some concurrent task. Every decent loop would have to be offloaded to C.

sgaure · March 29, 2024, 9:08am

Btw, here’s an interesting experiment. Take the original length_separate_loop, but add another argument flag.

function length_separate_loop(a, flag)
    length_separate = 0
    n = 0
    while n < 1000
        n += 1
        flag[] = true
        if n <= length(a.ref.mem)
            length_separate += 1
            a[length_separate] = n
        else
            error("out of bounds")
        end
    end
end

As we have seen, it vectorizes, and the behaviour is undefined if you change the Vector a concurrently. What exactly happens depends on how you change it, but it’s undefined. Not undefined in a particular way like it throws, segfaults, yields incorrect results or something, just undefined.

Now, if you pass the flag as a Ref(false) it still vectorizes, but if you pass flag as Atomic{Bool}(false), it does not vectorize. The optimizer discovers that you’re doing memory synchronization in the loop, and can’t take any shortcuts. It’s still undefined to change a concurrently. Something else will likely happen, but undefined is undefined. Hence, if you actually use a lock (which does memory synchronization) to protect access to a it will work as intended.

Sukera · March 29, 2024, 10:08am

If that were the case, Rust wouldn’t be as fast as it is I’m not claiming that every loop over an array needs to be sequential, to be clear. The compiler (usually) knows whether a given variable is task local or not (or rather, whether a function accesses only task-local memory!). What I’m saying is that in the absence of such information, such as when you’re benchmarking with a just-passed-in-variable, the compiler should be conservative and not vectorize (since, presumably, you want to benchmark how fast you got the correct result, not something that can be wrong).

This also directly leads to better algorithmic choices, since you’re then incentivized to chunk your input locally (thereby ensuring no task-based shenanigans), and/or mark the sections you deem safe to always index with @inbounds. For highest performance, it’s typical to copy some data around into a local buffer for better locality, alignment etc anyway (maybe even to get a layout that is even more amicable to vectorization, taking register pressure into account), and pushing people to those patterns that are better for performance (all while being safer) seems like a win to me.

The :release semantics of Threads.Atomic are only guaranteed when the store is paired with a corresponding :acquire read (see the LLVM docs on memory ordering). Since there is no corresponding :acquire in the loop, it should be legal to move this outside of the loop. LICM just doesn’t do that by default, and only does that for :unordered semantics instead.

sgaure · March 29, 2024, 10:34am

I still don’t see how length_separate_loop can yield the wrong result when vectorizing. Do you have an example? Changing a concurrently, is an undefined operation, how can it then be wrong? That is, you have no guarantees of anything. The result may happen to be what you expect, it may happen to be something else, or you get a segfault, or something nasty. You can expect anything from undefined operations. And it is well known that you must protect concurrent access with locks or other mechanisms.

sgaure · March 29, 2024, 11:00am

Does it really? I think it knows the type of the variable, not its whereabouts? When precompiling it’s even explicitly not a variable involved, only a type.

Sukera · March 29, 2024, 12:22pm

The compiler knows a lot more about the SSA form in its IR. Here:

julia> Base.infer_effects(length_separate_loop, (Vector{Int},))
(!c,?e,!n,!t,+s,?m,+u)

The m stands for inaccessiblememonly, which refers to whether the memory accessed by this function is only accessible by the task executing that function. The ? in front means “I don’t know”; that is, the compiler doesn’t explicitly know whether all of the used arguments (in this case, the input Vector{Int}) is from another task or the same one. What I’m saying is that, in my opinion, the compiler should be more cautious here and emit safe-by-default code.

The ! means that the compiler knows that this effect is tainted - e.g. for !c it means that the compiler knows that the result is not consistent, i.e. for different instances of Vector{Int} the function returns different results.

By hoisting the length check on a.ref.mem/vectorizing, the Memory backing the Vector may be swapped out while you’re iterating in your loop, which either results in mixing old & new data (bad, since that will likely not be the result you intended to compute, but does so silently) or segfault if the original memory has been freed (also bad, since that catastrophically aborts your computation), or result in a completely different result if the memory region of the original Memory has been reallocated & filled with something else.

If you don’t vectorize, you’ll at least get a nice BoundsError once the Memory has been changed out, since in the scalar version you need to check inboundsness either way. That’s much more friendly to debug than a segfault, because you get a stacktrace (and also more benign from a security POV).

Yes, exactly, and I haven’t disputed that! What I’m saying is that there are degrees of badness, with various levels of “ease of debuggability”. Especially when it comes to concurrent algorithms, I’ll take anything that makes it easier to debug over performancee. This is not opposed to using @inbounds or @fastmath (where appropriate), because you can easily add those after you’ve verified that your code otherwise works (adding them may still uncover some hidden bugs though).

So, if the array is truly local to the same task, I’m absolutely fine with the function vectorizing! There are lots of situations where the compiler (in principle) can know that. Propagating that information is just not always possible.

sgaure · March 29, 2024, 1:05pm

Ok, so you want the defaults so that we must annotate functions (or loops, or call sites) with something like @assume_effects :inaccessiblememonly if we want them to vectorize? That’s a fair wish.

mnemnion · March 29, 2024, 3:33pm

Somewhat related, I think: If someone wants to use push! instead of making Vector{Type}(undef, n) (there are reasons one might: unsure of the exact value, splatting) then sizehint! is available to preallocate the Vector to avoid those copies. Probably worth testing the OP’s benchmark code with sizehint! and seeing how much that helps.

Sukera · March 29, 2024, 3:45pm

The OP is already using sizehint! for the push! version though

mnemnion · March 29, 2024, 4:10pm

Ah yes! So it is. ¯\_(ツ)_/¯

Topic		Replies	Views
Improving my mental model: Why is `sizehint!()` so much slower than `zeros()`? Performance	9	1566	March 31, 2022
Thread-safe array building General Usage multithreading	21	7551	October 24, 2017
Push! and insert! very slow? (julia 1.11) (EDIT: No; just one-time allocations.) Performance	21	385	November 12, 2024
Documenting performance model of `empty!`, `sizehint!`, `push!` & friends Internals & Design proposal	14	3720	May 2, 2019
How to create a vector of variable length in julia? General Usage	7	11638	January 25, 2019

Why is push! still so slow?

Related topics