Array storage isn’t based on isbitstype is based on mutability.
Edit I was mistaking isbitstypes for primitive types. The correct phrasing is that all structs that aren’t mutable and don’t have mutable elements are isbitstype and thus can be stored inside arrays.
I admittedly can’t check right now, but I’m fairly certain primitive types are allocated inline as well After all, they are immutable and defined by their bitpattern.
To assess overall performance impact, let’s measure the entire pascal function with different initializations. Here’s a version of pascal that lets you provide an initialization function:
function pascal(c::Function, ::Type{T}, n::Integer) where {T<:Real}
A = c(T, n) :: Matrix{T}
A[:, 1] .= one(T)
A[1, :] .= one(T)
for j = 2:n, i = 2:n
A[i, j] = A[i-1, j] + A[i, j-1]
end
return A
end
Here’s a benchmarking function for getting the relative slowdown of initializing with a shared zero value and with a copied zero value:
using BenchmarkTools
function brel(::Type{T}, n::Integer) where {T<:Real}
tu = @belapsed(pascal((T,n) -> Matrix{T}(undef, n, n), $T, $n))
tz = @belapsed(pascal((T,n) -> zeros(T, n, n), $T, $n))
tc = @belapsed(pascal((T,n) -> [zero(T) for i=1:n, j=1:n], $T, $n))
tz/tu, tc/tu
end
The huge difference between zeros and comprehensions for Int seems like a matter of insufficient optimization: Int is a value type, so the comprehension version should, in theory be possible to do as efficiently as zeros. However, that is a tricky optimization since you probably have to recognize that you can just fill the memory with a single bit pattern. Could be done, but not so easy. In any case, here we see that initialization does have a significant impact on overall performance.
Luckily, I think, initialization is usually not the dominant cost in a computation - and if it is, then that’s a pretty good sign you are using a dense structure when you should be using a sparse.
I don’t buy this sparsity argument. Ultimately, whether a data structure is dense or sparse, there’s a plain old dense array at the bottom and it has to be allocated and if we force it to also have an expensive initialization step for memory that’s just going to be overwritten, then that’s going to be wasted time and make things slower. Whether the structure is dense or sparse is not really relevant. Perhaps you’re thinking about sparse data structures starting out empty and being grown incrementally; for example, doing this with sparse matrices:
using SparseArrays
n = 10^6
I = rand(1:n, 10^4)
J = rand(1:n, 10^4)
@time let
S = spzeros(BigInt, n, n)
for (i, j) = zip(I, J)
S[i, j] = 1
end
end
This does indeed start out with empty internal data structures and expand them one element at a time, growing them as necessary, which avoids the question of what value to initialize the internal arrays with since you have a value by the time you have to grow an array. However, this is also a notoriously slow way to initialize a sparse matrix—on my system, this code takes 0.84 seconds. If we instead use the sparse constructor, it’s much faster—doing sparse(I, J, big(1), n, n) takes only 3.8 milliseconds, 222x faster than incremental construction. Why is this version faster? Because it can pre-allocate the internal arrays with the right size instead of growing them incrementally. And it’s not just a small improvement—it’s two orders of magnitude.
So even for sparse data structures, we want to allocate (uninitialized) arrays as much as possible up front and then fill them with the right values, which gets slower if we have to initialize the arrays with a default value before writing the final values. How much worse is this? Fortunately, this is easy enough to measure in Julia by patching the core sparse method, which I do in this benchmark script. Results:
# Initialization: none
BenchmarkTools.Trial: 79 samples with 1 evaluation.
Range (min … max): 60.279 ms … 71.610 ms ┊ GC (min … max): 0.81% … 4.61%
Time (median): 63.709 ms ┊ GC (median): 5.12%
Time (mean ± σ): 63.852 ms ± 1.853 ms ┊ GC (mean ± σ): 5.07% ± 0.51%
▁▃▆▁ ▁ ▃▁▆█ ▁
▄▁▁▁▁▁▄▁▇▇▇▄████▄█▇▇▇▄████▇▇▇▄▄▄▄█▇▇▄▁▁▄▁▇▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▇ ▁
60.3 ms Histogram: frequency by time 69.1 ms <
Memory estimate: 232.70 MiB, allocs estimate: 22.
# Initialization: zeros
BenchmarkTools.Trial: 75 samples with 1 evaluation.
Range (min … max): 64.880 ms … 71.832 ms ┊ GC (min … max): 5.18% … 4.67%
Time (median): 67.445 ms ┊ GC (median): 5.07%
Time (mean ± σ): 67.511 ms ± 1.447 ms ┊ GC (mean ± σ): 5.08% ± 0.52%
▂ ▂ ▅▂ ▂ ▂ █ ▂
█▁▁▁██▅▅▅▁▁▁▅█▅██▁▅▁▅██▅█▅█▅█▅▅██▁▅████▅▁▁▅▅▅██▁█▁▁▁▁█▁▅▅▅▅ ▁
64.9 ms Histogram: frequency by time 70.2 ms <
Memory estimate: 232.70 MiB, allocs estimate: 24.
# Initialization: comprehension
BenchmarkTools.Trial: 68 samples with 1 evaluation.
Range (min … max): 66.618 ms … 88.653 ms ┊ GC (min … max): 4.71% … 8.73%
Time (median): 73.164 ms ┊ GC (median): 10.21%
Time (mean ± σ): 73.644 ms ± 3.148 ms ┊ GC (mean ± σ): 10.12% ± 0.73%
▆▆▂ ▂▄ ██ ▄
▄▁▁▁▁▁▁▁▁▁▁▁▁▄█████████▆██▁▆▆▁▄▁▁▄▁▁▁▄▁▁▁▁▁▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▄ ▁
66.6 ms Histogram: frequency by time 85 ms <
Memory estimate: 236.51 MiB, allocs estimate: 200022.
If we compare minima, then array initialization with zeros is 7% slower and with comprehensions is 10% slower. If we compare medians or means, then zeros is 5% slower and comprehension is 14% slower. Either way, those are not insignificant performance hits. I think this shows that even using sparse data structures you don’t escape from the initialization issue.
Since my stance in this whole thread is very much negative, I’d like to also add what I do think would potentially be a possible way to eliminate undef elements in arrays without sacrificing generality or performance. You introduce an UndefArray type which is like the current Array type and change Array to be guaranteed fully initialized. When you construct an Array the API would be something like this:
a = Array{T}(m, n) do u::UndefArray
# u is uninitialized, initialize it here
end
# a is guaranteed initialized
This method would, in principle, scan through u after the user callback has been run, make sure that there are no more uninitialized values, and then reuse the same memory that u references to construct a and return it. What are the challenges with this?
The main one is to try to make sure that the compiler can avoid the undef check in most cases where it’s clear that the initialization closure has already defined everything.
Another issue was that closures can have performance issues, but that’s something we’ve been improving and will continue to improve, so maybe not the biggest issue.
I had originally been thinking that ensuring that u doesn’t escape was an issue, but I don’t actually think that’s a problem. If someone hangs onto u they just have a view of a’s memory that doesn’t know that everything is defined.
Smaller arrays are allocated inline right after the array object header. If we do that with u then we would have to replace the u object header with the a object header. But I think instead we could allocate a with inline data, then create u sharing that data, and finally return a.
This is a pretty good API. I’m not sure if the compiler can be smart enough to eliminate checking for undefs in common cases. That’s the big question mark here. If it can’t then this still leaves performance on the table. If it can, then this would be a way to eliminate undef checks for normal arrays.
If we didn’t have to worry about backwards compatibility, we would just delete the Array{T}(undef, dims...) method and add the Array{T}(c::Function, dims...) method. However, we can’t do that without breaking people’s code. We need Array{T}(undef, dims...) to keep working and if we’re going to change Array to be always fully initialized, then we need a value of T to initialize those arrays with. If types were required to provide a default(T) method, then we could use that, but we also can’t require that since that would also be a breaking change. It’s tempting to try to do something where Array{T} is guaranteed to be initialized if default(T) is defined, but that’s not a static property. What if someone created an uninitialized Array{T} object before the default(T) method was defined? Then you have a situation where code gen produces code that assumes all Array{T} objects are fully initialized, but you have an Array{T} object that isn’t. Perhaps it would be acceptable for that to simply segfault the process.
How do we make sense of this? Is it reasonable to say that:
cost of Matrix{T}(undef, n, n) is about C + cn^2 for some larger constant C (cost of malloc call), and smaller constant c (nullptr-initialization),
cost of [ zero(T) for i=1:n, j=1:n ] is about C + C’n^2, where C’ > c is another constant (allocation and initialization of BigInt + storing ptr to it in array)?
Then, slowdown is (C + C’n^2)/(C + cn^2) which tends to 1 as n → 0, and to C’/c as n → infinity. Is this a reasonable explanation, of is there some other phenomenon significantly involved here? What is the asymptotic ratio C’/c?
Not that this is important, but the factor 2.5 could probably be improved quite a bit. Currently _new_array_ performs nullpointer initialization (with memset), and then a second pass sets pointers to the BigInt zero. Here the nullpointer initialization could be skipped (replaced with second pass).
Ok, I can understand this.
Yes, absolutely, this seems much harder than arrays - which is why I stuck to arrays. I would think that “reducing undef” is useful even if undef remains in the language.
I’m not suggesting any change in Julia. My point was only that Julia does not (to the best of my knowledge) allow every conceivable way of placing objects in arrays.
But if we deviate briefly from the topic. Why must that work? For example, maybe we could use mutable value struct A to denote that A is a mutable value. Compare with C:
typedef struct { int el; } A;
int main()
{
A a = {.el = 1};
A b[1] = {a};
b[0].el = 2;
return a.el == 2 ? 0 : 1;
}
Interesting numbers. How do you explain that tc/tu is so large for BigInt?
Since the tu-version constructs about n^2 BigInt’s, and the tc-version constructs about 2n^2 BigInt’s, I would have guessed the slowdown would be at most ~2.
Yes, I don’t fully agree with myself either.
I still withhold the first part is true, that initialization is usually not the dominant part. But, I admit that there are cases (such as your Pascal example) where very little computation is done - and for those cases (as you’ve shown in the benchmarks) the overhead can be quite large. However, I was mainly reacting to the extreme numbers in your earlier benchmarks (stating 89x slowdown).
Clearly, you can really have to pay a lot for the initialization if you are using a dense structure when you should have used sparse. But yes, the converse need not be true.
I like this suggested design, obviously.
Perhaps stating the obvious, I think there are benefits to this other than “eliminating a check and a predictable branch”. Just to take one example (I’ve already tried to tie it to “simplicity” of the language):
If the unthinkable happens, someone forgets to initialize their vector, then this can result in an exception “far from” the cause (perhaps giving the appearance that something is wrong in a innocent unrelated package). It makes for a better user experience if the error is detected earlier.
Array intentionally strikes a balance between performance and convenience. With the advent of more fine grain control over memory instantiation (https://github.com/JuliaLang/julia/pull/51319), it might be better to discuss lower level ways of achieving some of these features/performance gains. Perhaps the lack of as many built in interfaces for these new memory types leaves some wiggle room to explore what’s being discussed here independently.
Just to be clear, I am not (in this thread) trying to argue for some change in Julia. (Yes, personally I don’t particularly like that reading an element in a::Vector{Vector{Int}} involves a runtime check, and can throw – but that’s clearly not a reason for the language to change.)
Some answers gave me the vibe that initialization at creation was practically impossible, or went against the “Julian way”. I didn’t understand this, and wanted to better understand the obstacles.
Perhaps this is because of GC?
The tc version creates about 2n^2 objects, and GC may clean up up to n^2 objects.
The tu version creates about n^2 objects, and GC can not clean up a single one.
Thus, the tc version can be more than a factor 2 slower than tu. Is this a reasonable explanation?
If we change to measure the cost both of constructing the data and also GC’ing it, then tc/tu should be no more than ~2 (?). I tried to check this by adding a call to GC.gc(), but from this I got nonsensical results.
It is very possible and idiomatic, we have ready-made ways for common initialization patterns like comprehensions, fill, zeros, ones, etc. already, and if those somehow can’t do what you want and thus wastes precious runtime, the undef allocation step is exposed at the Julia level so that you can initialize how you want in your own methods. There are no obstacles to eliminating undefined references because people don’t want to eliminate something they need to work with sometimes. The only meaningful downside I’ve seen so far is that undefined reference error code must be compiled and shows up in code reflection, but it takes up miniscule memory and has no runtime costs due to branch prediction (the branch is especially predictable because only 1 branch doesn’t halt the program to indicate to the user they need to edit their initialization code).
As already talked about, this latter user feature does come with the cost of user mistakes that throw errors or give undefined values, but unsafe features like eliding bounds-checks are present in any language. Just don’t make these mistakes, or avoid the features entirely if you don’t think you can avoid making such mistakes.
Yes, introducing a new notion of struct would solve the problem of storing mutable structs inline in arrays too - at the cost of having to know the exact type of an object to (reasonably) guess how the program will behave. I personally don’t think that’s worth it
C in contrast only has the value semantics and distinguishes the “is this a reference” case by actually having a pointer in the array exposed to the user. Notably, this also doesn’t get around the initialization problem; having a type in C that must not ever have a bitpattern of all zeros and trying to initialize an array of such objects with calloc is unsafe just the same.
I originally thought of using an array of pointers initialized with calloc as the example for something that isn’t actually initialized but as it turns out, the standard doesn’t actually define ((void*)0) to be the NULL pointer. C sure is fun…
And the cost of unsafe or inefficient memory usage. Let’s say a function call makes a large array of inlined mutable instances then returns 3 random elements. Here’s a couple unsavory possbilities.
The array is freed afterward when the garbage collector finds the program lacks any references to it. Now those 3 elements do not have their underlying data aka dangling pointers.
The array is not freed. Those instances have to each track their origin array, like views do, until they are freed so the garbage collector knows to free the array too. Your RAM usage skyrockets as your garbage collector is incapable of freeing many such containers.
And here is the obvious impossibility:
We automatically copy the 3 elements upon return so they’re not affected by the array being freed. But we can’t specify if or when that happens. Some method call, possibly introduced by a macro, might’ve given it a reference e.g. caching it. Only the garbage collector does the work to track referenced memory, and it doesn’t run at every return statement. So when the array isn’t freed or hasn’t been freed yet, we have 2 independent copies of each mutable instance, defeating the purpose.
Memory management is not manual in Julia for a balance of safety and efficiency; we cannot go as far as arbitrarily declaring where we want something allocated. It’s easy to think that we have limited indirect control because immutables go on the stack or inline while mutables go on the heap, but that’s only a rule of thumb, though one useful enough to optimize allocations in practice. The compiler can decide otherwise; impractically large immutables cannot go on the stack, and some fixed-size mutables with determined lifetimes can go on the stack. If you want real control over memory, use C/C++. If you want some extra rules and compiler warnings to keep you safer, use Rust.
In Julia, forget about mutability for inline elements, it’s not even necessary. Instead of having several variables access the same mutable instance and share its changes, you can just index the array when you need the value. In a way, the array and an index can serve the same purpose as a pointer, and indexing replaces dereferencing. To alter the element, just reassign it, Accessors.jl provides macros to use mutation syntax to make this easier than full constructor calls. Mutability is not simply about changing values but about sharing a distant chunk of memory, and reassignment more often than not is appropriate for altering values (and compiles to what would be considered mutation in languages with a variable-pointer memory model).
The main perf problem of push! is that it’s a call into libjulia that can’t be inlined. Ideally we’d move to a model where libjulia is shipped with included bitcode (did someone say ThinLTO?) such that the JIT can cross-optimize that. But that’s a hell of a build change.
One can very easily write an optimized branch by pointer-slinging (cf push!/pop! always does a ccall · Issue #24909 · JuliaLang/julia · GitHub). The uglyness here is that the hand-written code depends on the data layout of Vector, which is dependent on the C world, i.e. it would be super brittle to make that work on all systems (e.g. 32 bit). “portable C” it is not.
Less ugly but more work would be to have a branch in the ccall-intrinsics (like e.g. isdefined) that does the fast variant. Then one needs to write the C++ code to emit the correct llvm nodes. I think this is the most realistic fix, it just needs someone who is slightly knowledgeable about codegen to put in the time. I think push! is important enough for that, but it’s not my time that I’m proposing to put in
Can we do that, realistically speaking? I.e. don’t branch in the code, instead emit a memory read, let the CPU trap, and then throw the exception? That could make the thing “mostly free”.
The three issues I see are:
We’d need to do the trap-based exception handling for this specific case (we currently only do setjmp/longjmp based exception handling). But llvm does support it, so there is a chance.
We’d need to properly tell llvm not to remove the unused memory read: We’d be emitting it for its side-effect of maybe trapping. One can always go volatile, but that prevents almost all intruction reordering.
If we do a volatile read, then this is probably no big perf win compared to the branch (and possibly much worse). E.g. still prevents SIMD.
I think keeping the nullpointer and chugging on with it until something blows somewhere is semantically unacceptable – then we would get an exception at an unpredictable place in the code, and everyone would need to check null-ness everywhere, java style.
It’s already free—the branch is perfectly predictable and the null check is dwarfed by memory loads. My comment is not a suggestion that we change UndefRefError to be literally implemented using a segfault, it is pointing out that if @tmpo wants it to be a segfault, they can already just think of UndefRefError as a kind of segfault.