Is accessing an `undef` array undefined behavior?

And after some digging by @jar1 on slack, this article came up:

Which I wholeheartedly agree with & is why I don’t want to call this kind of thing “undefined behavior”.

6 Likes

Unfortunately, that post is also sloppy and most standards distinguish between undefined, unspecified and implementation-dependent behaviour (see this Stackoverflow post for instance).
The Common Lisp Hyper Spec – to quote from a standard of a non C-like language – gives a similar definition:

  • Unspecified: This means that the consequences are unpredictable but harmless. Implementations are permitted to specify the consequences of this situation. No conforming code may depend on the results or effects of this situation [...]

  • Undefined: This means that the consequences are unpredictable. The consequences may range from harmless to fatal. No conforming code may depend on the results or effects. Conforming code must treat the consequences as unpredictable. [...] An implementation is permitted to signal an error in this case.

Thus, given that the following program is unpredictable

v = Vector{Int16}(undef, 3)
if sum(v) > 0; "y" else "n" end

I guess that one could call it undefined behaviour.

For most newer languages, which don’t have a standard and multiple more or less conforming implementations, the distinction between unspecified and undefined is less useful, but undefined in the sense that No conforming code may depend on the results or effects is certainly a valid terminology to warn and prevent usage of certain constructs in certain context. I.e., to be precise, in the above example, the undefined behaviour would not be in the construction of the array – which always gives a valid instance – but rather the access of its elements – as it cannot be relied on what properties those instances have.

Maybe adding to the confusion, the Rust documentation on binary_search_by states that If the slice is not sorted [...], the returned result is unspecified and meaningless., i.e., uses unspecified to refer to a valid, yet unpredictable (and most likely incorrect) result.
(Unfortunately, I’m currently unable to find in which thread and by whom the binary search example had been mentioned as requiring an invariance that cannot be checked by most compilers – except for some dependently typed languages which can enforce that a list must be sorted at compile time). In the end, it’s always a compromise on which programs slip through, either by the compiler as its unable to check this type of invariance or in dynamic languages by skipping a runtime check which might have raised an appropriate error.

1 Like

Quite right - I just thought it prudent to use the short version instead of the more extensive three parter this is ultimately based on, because most users reading this thread likely won’t read the full nuances :slight_smile:

Still, for completeness sake, here it is:

https://blog.regehr.org/archives/213

There is some thinking about creating a lower level Buffer type that would back an Array, and I suppose this would probably resolve these issues if one really wanted to construct an array via push!.

What I’m still confused about is if you know that the vector is going to be a certain size or least have some known upper bound, I’m still unsure why you would build an array using push!.

If you know you want to construct a vector of length N you could construct the array via

A = collect(f(i) for i in 1:N)
B = map(f, 1:N)

To me the push! case is really when the ultimate length is unknown. Even then, the strategy might be to allocate a large enough array and then return a view of the known elements. Importantly, the latter strategy also generalizes to N dimensions.

1 Like

push! is a very efficient way to represent and implement that. Just because the current implementation is mediocre does not mean the whole concept is needing replacement, just the implementation of it.

3 Likes

Let’s build an AbstractVector to handle this case then.

julia> begin
           mutable struct BufferedVector{T} <: AbstractVector{T}
               buffer::Vector{T}
               length::Int
               BufferedVector{T}(len = 0; capacity = len) where T = new{T}(Vector{T}(undef, capacity), len)
           end
           Base.size(A::BufferedVector) = (A.length,)
           Base.getindex(A::BufferedVector, i::Int) = (checkbounds(A, i); @inbounds A.buffer[i])
           Base.IndexStyle(::Type{<: BufferedVector}) = IndexLinear()
           Base.setindex!(A::BufferedVector, v, i::Int) = (checkbounds(A, i); @inbounds A.buffer[i] = v)
           Base.length(A::BufferedVector) = A.length
           Base.resize!(A::BufferedVector, i::Int) = begin
               i > length(A.buffer) && resize!(A.buffer, i)
               A.length = i
           end
           Base.push!(A::BufferedVector, v) = begin
               A.length += 1
               A.buffer[A.length] = v
           end
       end

I then benchmarked this against a few different methods of general array initialization.

julia> function benchmark_alloc()
           n = 2^8
           m = 2^14

           @info "sizehint!"
           @btime for _ in 1:$n
               v = Int64[]
               sizehint!(v, $m)

               for i in 1:$m
                   push!(v, i^3)
               end
           end

           @info "undef"
           @btime for _ in 1:$n
               v = Vector{Int64}(undef, $m)

               for i in 1:$m
                   v[i] = i^3
               end
           end

           @info "BufferedVector"
           @btime for _ in 1:$n
               v = BufferedVector{Int64}(; capacity = $m)
               for i in 1:$m
                   push!(v, i^3)
               end
           end

           @info "Array comprehension"
           @btime for _ in 1:$n
               v = [i^3 for i in 1:$m]
           end

           @info "Map"
           @btime for _ in 1:$n
               v = map(1:$m) do i
                   i^3
               end
           end

           @info "Collect Generator"
           @btime for _ in 1:$n
               v = collect(i^3 for i in 1:$m)
           end
       end
benchmark_alloc (generic function with 1 method)

Here are the results:

julia> benchmark_alloc()
[ Info: sizehint!
  21.395 ms (512 allocations: 32.03 MiB)
[ Info: undef
  3.053 ms (512 allocations: 32.01 MiB)
[ Info: BufferedVector
  2.991 ms (512 allocations: 32.01 MiB)
[ Info: Array comprehension
  2.169 ms (512 allocations: 32.01 MiB)
[ Info: Map
  2.048 ms (512 allocations: 32.01 MiB)
[ Info: Collect Generator
  2.033 ms (512 allocations: 32.01 MiB)
2 Likes

Nice!

Some notes:

  • You are missing a check in push! for the case that the length is equal to the capacity to increase it and maybe reallocate first.
  • The length should not be an argument for the constructor, not even an optional one.
  • You should not set the length equal to the new length in resize! if the new length is bigger than the old one.

Something like this should be used instead of the current arrays implementation to avoid ccalls.

2 Likes

It’s a nice proof of concept but it’d be a lot smoother to edit push! or :jl_array_grow_end, the former requiring exposure of the equivalents of length and capacity to the Julia side. I grabbed the Rust link from the other thread as an example of what it can look like. There is another function right below that one, push_within_capacity, that is only justified because capacity is exposed to the user, that’s about what your current version of push! does without the length < capacity check.

I’d say I wouldn’t like this limitation but since the buffer starts off as an undef array with length of capacity, there’s little point to starting this new type with nonzero length. It’s not like this new type currently leverages append! or broadcasting to fill in some elements.