Is accessing an `undef` array undefined behavior?

Sukera · July 22, 2023, 7:53am

And after some digging by @jar1 on slack, this article came up:

Which I wholeheartedly agree with & is why I don’t want to call this kind of thing “undefined behavior”.

bertschi · July 22, 2023, 9:38am

Unfortunately, that post is also sloppy and most standards distinguish between undefined, unspecified and implementation-dependent behaviour (see this Stackoverflow post for instance).
The Common Lisp Hyper Spec – to quote from a standard of a non C-like language – gives a similar definition:

Unspecified: This means that the consequences are unpredictable but harmless. Implementations are permitted to specify the consequences of this situation. No conforming code may depend on the results or effects of this situation [...]
Undefined: This means that the consequences are unpredictable. The consequences may range from harmless to fatal. No conforming code may depend on the results or effects. Conforming code must treat the consequences as unpredictable. [...] An implementation is permitted to signal an error in this case.

Thus, given that the following program is unpredictable

v = Vector{Int16}(undef, 3)
if sum(v) > 0; "y" else "n" end

I guess that one could call it undefined behaviour.

For most newer languages, which don’t have a standard and multiple more or less conforming implementations, the distinction between unspecified and undefined is less useful, but undefined in the sense that No conforming code may depend on the results or effects is certainly a valid terminology to warn and prevent usage of certain constructs in certain context. I.e., to be precise, in the above example, the undefined behaviour would not be in the construction of the array – which always gives a valid instance – but rather the access of its elements – as it cannot be relied on what properties those instances have.

Maybe adding to the confusion, the Rust documentation on binary_search_by states that If the slice is not sorted [...], the returned result is unspecified and meaningless., i.e., uses unspecified to refer to a valid, yet unpredictable (and most likely incorrect) result.
(Unfortunately, I’m currently unable to find in which thread and by whom the binary search example had been mentioned as requiring an invariance that cannot be checked by most compilers – except for some dependently typed languages which can enforce that a list must be sorted at compile time). In the end, it’s always a compromise on which programs slip through, either by the compiler as its unable to check this type of invariance or in dynamic languages by skipping a runtime check which might have raised an appropriate error.

Sukera · July 22, 2023, 12:05pm

Quite right - I just thought it prudent to use the short version instead of the more extensive three parter this is ultimately based on, because most users reading this thread likely won’t read the full nuances

Still, for completeness sake, here it is:

https://blog.regehr.org/archives/213

mkitti · July 22, 2023, 9:22pm

There is some thinking about creating a lower level Buffer type that would back an Array, and I suppose this would probably resolve these issues if one really wanted to construct an array via push!.

github.com/JuliaLang/julia

Buffer types for array backend

JuliaLang:master ← Tokazama:buffers

opened 11:21PM - 19 Feb 23 UTC

Tokazama

+2431 -45

## Motivation After a conversation on slack with @StefanKarpinski and @brenhi…nkeller it seemed like a lot of other PRs on improving arrays stalled out and the first fundamental step is having a generic buffer type. I still have crashes in the REPL with this current implementation and it appears I haven't fully implemented the GC side correctly, but I figured it was better to put this out there for someone else to start with then completely abandon this. The goal here is to provide the core functionality for bare-bones storage of multiple repeated elements. ## Implementation There are three primary storage types this PR is intended to support: * `Buffer{T}`: has a fixed size and mutable elements * `DynamicBuffer{T}`: has a dynamic size and mutable elements. * `ImmutableBuffer{T}`: has a fixed size and immutable elements (not implemented here but should be as a future PR for with the `freeze`/`thaw` compiler optimizations worked on in prior `ImmutableArray` PRs). The type layout is modeled after that of `jl_array_t` and is similar to the following native Julia structure: ```julia struct MemoryChunk{T} length::Int data::Ptr{T} end ``` Just like `Array`, the number of bytes stored determines if `data` points to inline allocations that extends the size of `MemoryChunk` via `jl_gc_alloc` or a seperately stored chunk via `jl_gc_managed_malloc` and `jl_gc_track_malloced_buffer`. Storage of elements is similar as bit unions, bits, or pointers to boxed types is identical to that of `Array` or `jl_array_t`. Note that there are not explicitly stored flags, offsets, additional dimensions, or element size. This means that: * Unlike `jl_array_t`, all information about the element type must be derived again everytime it's needed when running C code. Of course, this is not an issue once we are working with TBAA and that is all optimized away. However, I'm uncertain whether that will slow things down that don't often interact with TBAA directly (such as the garbage collector). * We don't have `flags.how == 1` to mark a julia-allocated buffer that needs to be marked or `flags.isshared`. I've tried to use the last two its of the data pointer to mark these (see `jl_buffer_isshared` and `jl_buffer_isunmarked`). It may make more sense to have a type that explicitly wraps a shared buffer, preserving that bit for some other future use and allowing more explicit representation of data storage through the type system. I've yet to spend much time grocking the unmarked allocated data aspect of this, so there may be a better approach that I've yet to consider. * The lack of an offset or maximum size means that resizing for `DynamicBuffer` will always result in a reallocation or new allocation. Feature parity with `Vector` is intended to be accomplished through Julia code with something like ```julia mutable struct DynamicVector{T} <: DenseVector{T} buffer::DynamicBuffer{T} offset::Int length::Int end ``` where the length of `DynamicBuffer` here is functionally equivalent to `Array`s maxsize. I'm in the process of moving the implementation to native Julia. This is a bit challenging since this is essentially an attempt to rewrite large portions of "array.c" where a lot of things aren't implemented in native Julia code (interacting with the GC, traversing `Union`s) and sometimes working efficiently with a pointer get tricky (storing pointers to immutable types). ## Remaining Work * Names may not be sufficiently self documenting and may even be misleading. The term "buffer" is often used to refer to something more akin to what our current `Vector` is with offsets and resizing. Perhaps something that is more clearly describing a continuous chunk of memory such as `MemoryChunk`, `ResizableMemoryChunk`, and `ImmutableMemoryChunk`. * What needs to be done here to support better allocation practices? There's a lot of interest in providing users with the ability specify how things are allocated (allocating to the stack, bump allocators, smart pointers). I've assumed that most of that would be better addressed in future PRs by those who really understand how to optimize that sort of thing, but I'd also like to ensure that the implementation here is not prohibittive to future developments. * Could we provide better support for `Bool` here so that we don't need an explicit `BitBuffer` type? * More support for future implementation of `DynamicVector` with user friendly resizing methods * The assumed effects for methods should change based on the buffer variant. This is currently buggy. For example, the effects for `length(::Buffer)` should not be the same as `length(::DynamicBuffer)`, but currently are. * Would it make sense to implement `ImmutableBuffer` here without any of the `thaw`/`freeze` stuff that would complicate things?

gist.github.com

https://gist.github.com/JeffBezanson/a25dde3bebb5a734af87bb5ddcf31fb0

buffer.md

# Julep: Redesigning `Array` using a new lower-level container type

## History and motivation

From its inception, Julia has had an `Array` type intended to be used any time you need
"a bunch of things together in row". This is one of the most important and most unusual
types in the language. When `Array` was designed, it was believed (possibly by multiple people,
but Jeff is willing to take the blame) that the easiest-to-use interface would be a
single array type that does most of what users need. Many Julia programmers now feel that
the language and community have outgrown this trade-off, and problems are emerging:

This file has been truncated. show original

What I’m still confused about is if you know that the vector is going to be a certain size or least have some known upper bound, I’m still unsure why you would build an array using push!.

If you know you want to construct a vector of length N you could construct the array via

A = collect(f(i) for i in 1:N)
B = map(f, 1:N)

To me the push! case is really when the ultimate length is unknown. Even then, the strategy might be to allocate a large enough array and then return a view of the known elements. Importantly, the latter strategy also generalizes to N dimensions.

jameson · July 23, 2023, 12:03am

push! is a very efficient way to represent and implement that. Just because the current implementation is mediocre does not mean the whole concept is needing replacement, just the implementation of it.

mkitti · July 23, 2023, 8:42pm

Let’s build an AbstractVector to handle this case then.

julia> begin
           mutable struct BufferedVector{T} <: AbstractVector{T}
               buffer::Vector{T}
               length::Int
               BufferedVector{T}(len = 0; capacity = len) where T = new{T}(Vector{T}(undef, capacity), len)
           end
           Base.size(A::BufferedVector) = (A.length,)
           Base.getindex(A::BufferedVector, i::Int) = (checkbounds(A, i); @inbounds A.buffer[i])
           Base.IndexStyle(::Type{<: BufferedVector}) = IndexLinear()
           Base.setindex!(A::BufferedVector, v, i::Int) = (checkbounds(A, i); @inbounds A.buffer[i] = v)
           Base.length(A::BufferedVector) = A.length
           Base.resize!(A::BufferedVector, i::Int) = begin
               i > length(A.buffer) && resize!(A.buffer, i)
               A.length = i
           end
           Base.push!(A::BufferedVector, v) = begin
               A.length += 1
               A.buffer[A.length] = v
           end
       end

I then benchmarked this against a few different methods of general array initialization.

julia> function benchmark_alloc()
           n = 2^8
           m = 2^14

           @info "sizehint!"
           @btime for _ in 1:$n
               v = Int64[]
               sizehint!(v, $m)

               for i in 1:$m
                   push!(v, i^3)
               end
           end

           @info "undef"
           @btime for _ in 1:$n
               v = Vector{Int64}(undef, $m)

               for i in 1:$m
                   v[i] = i^3
               end
           end

           @info "BufferedVector"
           @btime for _ in 1:$n
               v = BufferedVector{Int64}(; capacity = $m)
               for i in 1:$m
                   push!(v, i^3)
               end
           end

           @info "Array comprehension"
           @btime for _ in 1:$n
               v = [i^3 for i in 1:$m]
           end

           @info "Map"
           @btime for _ in 1:$n
               v = map(1:$m) do i
                   i^3
               end
           end

           @info "Collect Generator"
           @btime for _ in 1:$n
               v = collect(i^3 for i in 1:$m)
           end
       end
benchmark_alloc (generic function with 1 method)

Here are the results:

julia> benchmark_alloc()
[ Info: sizehint!
  21.395 ms (512 allocations: 32.03 MiB)
[ Info: undef
  3.053 ms (512 allocations: 32.01 MiB)
[ Info: BufferedVector
  2.991 ms (512 allocations: 32.01 MiB)
[ Info: Array comprehension
  2.169 ms (512 allocations: 32.01 MiB)
[ Info: Map
  2.048 ms (512 allocations: 32.01 MiB)
[ Info: Collect Generator
  2.033 ms (512 allocations: 32.01 MiB)

Mo8it · July 23, 2023, 9:58pm

Nice!

Some notes:

You are missing a check in push! for the case that the length is equal to the capacity to increase it and maybe reallocate first.
The length should not be an argument for the constructor, not even an optional one.
You should not set the length equal to the new length in resize! if the new length is bigger than the old one.

Something like this should be used instead of the current arrays implementation to avoid ccalls.

Benny · July 23, 2023, 11:18pm

It’s a nice proof of concept but it’d be a lot smoother to edit push! or :jl_array_grow_end, the former requiring exposure of the equivalents of length and capacity to the Julia side. I grabbed the Rust link from the other thread as an example of what it can look like. There is another function right below that one, push_within_capacity, that is only justified because capacity is exposed to the user, that’s about what your current version of push! does without the length < capacity check.

I’d say I wouldn’t like this limitation but since the buffer starts off as an undef array with length of capacity, there’s little point to starting this new type with nonzero length. It’s not like this new type currently leverages append! or broadcasting to fill in some elements.

Topic		Replies	Views
Has `undef` lost its undefedness? General Usage	4	1905	July 12, 2018
Access to undefined reference: But not always General Usage	9	407	August 3, 2023
Subsetting uninitialized arrays Internals & Design question	7	698	June 21, 2019
Modifying view index results in silent out-of-bound access General Usage views	1	368	June 27, 2021
"Undefined behaviour" in Julia; does it e.g. apply to signed overflow addition? General Usage	1	436	April 11, 2021

Is accessing an `undef` array undefined behavior?

Related topics