Why does arrayref throw?

Well in that case we were talking about isbits types, the docs only mentions

The initial contents of a plain data type is undefined

For references, it said

While you are allowed to create objects with uninitialized fields, any access to an uninitialized reference is an immediate error… This avoids the need to continually check for null values.

Accessing #undef (the mutable pointer kind) is not not undefined behavior. It’s defined to throw. This ironically makes #undef more defined than the behavior for bits-type, but so is the linguistic irony.

5 Likes

Unlike the reference case, this would be a performance disaster. It would force every int/float/whatever array access to look in two places and potentially throw an error. Unlike the reference case where you have to do an indirect memory load anyway, that is a lot of overhead for something as lightweight as an Int load. Not to mention increasing the size of basically every object and making Julia’s struct layout no longer C-compatible. That’s an awful lot to give up for the sake of “they should behave the same.”

3 Likes
4 Likes

Yes, that’s basically what’s going on here. What is “undefined” is what content you’ll get, which is much more constrained than what C uses the term for. It’s just a very natural term to use though. I don’t think there’s any conflict between what I said and what Keno has said.

3 Likes

Thank you @StefanKarpinski for taking the time to explain the approach (and its rationale) you took with the language!

Perhaps I have an unwarranted fear of exceptions, but I feel they make the language more complicated. Not concerned about performance issues.

I promise to write on this board when I have figured it out :slight_smile:

But, since you asked …
For Vector{T} with value type T: allow undef initializer.
For Vector{T} with reference type T: don’t allow undef initializer.

Expanding the second case (reference type): I do believe that fill (with some default) and collect (from some generator) handles many cases. The other cases by Union{T,Nothing}.

Yes, I’m sure there are also cases where this is not good enough.

How do you implement fill and collect? I suppose in this design they would have to be primitives?

The other cases by Union{T,Nothing}.

This isn’t very satisfying since that forces Hoare’s “billion dollar mistake” on people taking that route.

Yes, I’m sure there are also cases where this is not good enough.

Right, if this weren’t a programming language then getting the majority of cases would be fine, but we have to allow people to implement literally everything, so no corner case is too small.

2 Likes

Yes, I was referring to the isbits portion of stefan’s comment, I should have been clearer about that - sorry! Throwing a UndefVarError for references is certainly well defined behavior (albeit maybe confusing terminology). I’d personally prefer it be called UninitVarError, but that’s a topic for another day.

Gotcha, then there must have been a misunderstanding in the other thread, because the stance that “we shouldn’t use undefined behavior for this” has been my stance there as well. Keno never clarified my question of what exactly he meant with “undefined behavior” :person_shrugging:

There is the idea that using calloc instead of malloc would “fix” this across the board, but as it turns out, that also doesn’t quite work; you can easily have a datatype where there is no zero and/or where its bitpattern needs to contain at least one 1, resulting in calloc producing invalid instances yet again. In general, you need to go through a constructor to create valid instances of any type.

I would guess with the help of something like Vector{T}(iterator) (for iterator supporting size(iterator)), or Matrix{T,k}(element_constructor, dims), where element_constructor is some function from indices - here element_constructor is a bad name, this function could supply existing objects, or create new.

With fill and collect I should also have listed vector/matrix comprehension.

I agree to some extent, but I think this is oversimplified.

In e.g. C you always refer to objects of type T through a pointer (T *obj), thus every function working with T objects must check for nullptrs. The type Union{T, Nothing} is not the same as T. So, we are in a much better place and get protection from function interfaces, and can “contain the spread” of the nullptr-tainted data: Functions operating on T objects can assume the input is valid objects, and data validation and error handling is done earlier. But, you know this.

Yes, you are right of course. I certainly don’t claim to know how to do it “better than Julia”, or that “Julia does a bad job”. Far from. I’m only trying to explain what I experienced as counterintuitive/confusing.

I do think that a language should be optimized for the “normal case”, not for edge cases. Performing run time checks to make sure that the programmer did not violate protocol when creating the data seems to go against this. To me it seems we should prefer to make sure data is correct at creation, over to check it every time it is accessed. However, I also understand that a design involves tradeoffs along multiple dimensions. A lot of hassle creating data leads to a cumbersome user experience, and perhaps a less flexible language. But, these things are not in focus in my current project :slight_smile:

Yes. Nothing I wrote is in contradiction with this, right?

Yes, not directly. I just meant that “some default” is doing quite a lot of heavy lifting there; if you want to eliminate the problem entirely, allocating things in bulk ought to mean requiring the user to give that default for their type. There is no default the language can give you a priori that works across the board, without then also repeating that “billion dollar mistake”.

2 Likes

This is not the implemented semantics. Using the result of the value e.g. in a branch is actually undefined behavior in the C sense, currently. I.e. in LLVM terms the value is undef, not freeze undef. Of course when this behavior was implemented LLVM didn’t have the distinction (LLVM’s undef behavior was not particularly well defined for many years), but we never changed this.

4 Likes

The issue is that allocating memory and then filling it with valid values is the normal case—it happens every time you create a new array or object. Of course many users use functions like fill, collect, comprehensions and so on that only return full initialized arrays, but it’s a pretty big hole in the language if implementing such functions isn’t possible.

1 Like

I’m not sure I understand what you mean by this. The thing I tried to suggest was to make the vector constructor take some generator, iterator, or map, so that it could initialize the vector at creation. This is pretty standard, no?
[EDIT: removed bad example, mixed this up with another discussion]

We already have comprehensions and other methods for this, and they all internally create uninitialized arrays at the start. When there are input arrays, this is handled by similar. Uninitialized arrays must be exposed at the language level for these to be implemented directly in Julia rather than wrap C. Adding redundant methods to array constructors won’t remove that need.

Yes, these methods (fill, similar, collect, comprehension, …) use an array constructor (well, at least I assume so - I have only checked a few). For example, in array.jl we have

fill(v, dims::NTuple{N, Integer}) where {N} = (a=Array{typeof(v),N}(undef, dims); fill!(a, v); a)

Thus, in this case the array is constructed with the undef initializer, and then filled with fill!.

The construction methods appears to be defined in boot.jl. For example, the 1d case looks like this.

Array{T,1}(::UndefInitializer, m::Int) where {T} =
    ccall(:jl_alloc_array_1d, Array{T,1}, (Any, Int), Array{T,1}, m)

Digging deeper, in array.c we see that jl_alloc_array_1d does its job via

static jl_array_t *_new_array_(jl_value_t *atype, uint32_t ndims, size_t *dims,
                               int8_t isunboxed, int8_t hasptr, int8_t isunion, int8_t zeroinit, size_t elsz)

This method allocates memory, and perform initialization if needed (nullptrs when storing reference).

Now, if jl_alloc_array_1d is viewed as a “language internal” function (don’t know if this is the case), then one could define

function Array{T,1}(::InitializerFunction, m::Int, initializer) where {T}
    a = ccall(:jl_alloc_array_1d, Array{T,1}, (Any, Int), Array{T,1}, m)
    for i = 1:m
        a[i] = initializer(i)
    end
    a
end

or something more though through in the same vein.

Then one could define fill without undef like

fill(v, dims::NTuple{N, Integer}) where {N} = Array{typeof(v),N}(InitializerFunction(), dims, () -> v)

If jl_alloc_array_1d is a user exposed function, then I guess the modifications would go into _new_array_.

Out of curiosity, do you have example for use cases that need uninitialized arrays storing reference type data?
As I said before, I think for most use cases the vectors can be initialized immediately,
and in other cases Union{T, Nothing} could be used. And, I’m curious to know if there a big gaps in my thinking here, and in that case what those are.

Another thing that probably would help, is to make vectors distinguish between capacity and size. This would make it possible to have a partially (in a controlled way) initialized vector - the uninitialized part would be out of bounds. Thus making it more convenient to initialize the vector.

There is

help?> sizehint!
search: sizehint!

  sizehint!(s, n) -> s


  Suggest that collection s reserve capacity for at least n elements. That is,
  if you expect that you're going to have to push a lot of values onto s, you
  can avoid the cost of incremental reallocation by doing it once up front;
  this can improve performance.

  See also resize!.

  Notes on the performance model
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  For types that support sizehint!,

    1. push! and append! methods generally may (but are not required to)
       preallocate extra storage. For types implemented in Base, they
       typically do, using a heuristic optimized for a general use case.

    2. sizehint! may control this preallocation. Again, it typically does
       this for types in Base.

    3. empty! is nearly costless (and O(1)) for types that support this
       kind of preallocation.

which does seem to suggest that there is a difference between capacity and size for some types, though I’m not sure which.

It makes a small difference for a normal Array at least, so probably works there.

julia> function f!(v)
           for i in 1:100000
               push!(v, i)
           end
       end
f! (generic function with 1 method)

julia> @benchmark f!(v) setup=(v=Int[])
BenchmarkTools.Trial: 4999 samples with 1 evaluation.
 Range (min … max):  713.700 μs …   2.259 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     806.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   937.147 μs ± 243.461 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▇█▂▁
  ██████▆▅▅▅▅▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▆▅▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▁▁▁▁▁ ▂
  714 μs           Histogram: frequency by time         1.49 ms <

 Memory estimate: 1.83 MiB, allocs estimate: 10.

julia> @benchmark f!(v) setup=begin v=Int[]; sizehint!(v, 100000) end
BenchmarkTools.Trial: 5276 samples with 1 evaluation.
 Range (min … max):  622.100 μs …   3.599 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     884.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   893.094 μs ± 152.367 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    █▂▃
  ▄▅▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▆███▇▇▆▅▅▄▅▄▄▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  622 μs           Histogram: frequency by time         1.41 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
2 Likes

In other thread, push!ing to a sizehint!ed Array was found to be less performant than initializing an uninitialized Array. Most of the difference isn’t inherent, push! could have an optimized branch where it’s effectively initializing an element + incrementing the semantic size.

This would make an uninitialized array on the Julia level so the compiler still cannot assume all arrays are fully initialized. I get you’re trying to justify eliding that defined-reference check, but that is not going to go away even if you somehow guaranteed fully initialized arrays because checking pointers is part of Julia’s memory safety designs; a segfault isn’t even the worst thing that can happen, and errors are safer and more informative. That defined-or-error check gets handled by branch prediction so it’s not even a performance loss.

I wouldn’t like to be stuck with 1 initializer, which itself is stuck at 1 argument, which in turn is stuck in 1:m or whatever indices. Instead of writing straightforward loops, users have to shove the entire implementation into a boilerplate closure.

You’re not supposed to index uninitialized elements, so nothing beyond incremental array initialization in pure Julia. The error branch is just there to let you know what mistake you made and prevent more unsavory consequences like segfaults and data corruption.

Could you elaborate on what you mean?

I don’t agree that Union{T, Nothing} is as bad as “plain pointers”, as I also wrote in reply to @StefanKarpinski.

And, asking for a default constructor for T in order to support Vector{T}(n) does not appear completely crazy. For example, this is how C++ works, right?

For example, you cannot do

class A
{
private:
    A();
    int a;
};

int main()
{
    std::vector<A> a(5);
    return 0;
}

And in

std::vector<int> a(10);
std::vector<std::vector<int> > b(10);

everything is fully initialized. Not saying that C++ “does it right”, I’m just saying there are alternatives.

Sure - what I thought you meant was giving some default of type T, which doesn’t work. The “billion dollar mistake” is having such a default value that typechecks as such a T, for any T. Your example of Union{T, Nothing} is quite different though; Nothing doesn’t typecheck as a T. Additionally, I think it’d be quite confusing if I’d request a Vector{T} and get a Vector{Union{Nothing, T}} back instead (not to mention causing type instability across the board). Hence, the only safe way to get an actual object of type T means requiring the user to either provide a default, or require a function that constructs an object for every index.

This is of course on top of the difficulty in providing such an interface if you want to be able to implement functions like fill in the language itself, instead of getting them as builtins.

1 Like