Should Generators finally be given eltype

ndinsmore · February 6, 2019, 2:44pm

With the compiler improvements that went into v1 is it finally time to allow Generators to return an element type? With union splitting, it now appears that all the needed information can be inferred at compile-time.
So shouldn’t something as simple as:

function Base.IteratorEltype(::Type{Generator{I,T}}) where {I,T}
        Base.IteratorEltype(I)===Base.HasEltype() &&
                Length(ITA=Base.return_types(T,Tuple(I))) == 1 &&
                ITA[1] != Union{} &&
                return Base.HasEltype()
        return Base.EltypeUnknown()
end

function eltype(::Base.Generator{I,T}) where {I,T}
   promote_type( Base.return_types(T.instance,(eltype(I),))...)
end

This has certainly been covered before in:

This solution passes the tests that @yuyichao put forward in the first reference.

I apologize if this is beating a dead cat.

nalimilan · February 6, 2019, 2:59pm

I think the general objection is that inference should not be exposed to users. This means the return value of eltype shouldn’t depend on the result of Base.return_types.

yuyichao · February 6, 2019, 3:36pm

If you are talking about the “test” cases for @typed in that one, those are just showing that the correct answer cannot be derived from syntactic info, which was the point of @typed. It comes after the discussion why an inference based implementation is unacceptable, which is what you brought up here.

It’s not but it’s impossible to define this API without a formal definition of type inference or at least the inferenced used for this. It is not to say it’s impossible (extremely unlikely though) but this feature is the wrong thing to ask without such a definition.

How much can the inference do is completely irrelevant for implementation of such a feature. In fact, the absense of such a definition (i.e. constraint) is the very reason inferenced can be improved in the ways you’ve seen at all.

ndinsmore · February 6, 2019, 5:53pm

@yuyichao & @nalimilan thank you for your explanations both of which make sense to me… I think.
That said with the limitation/flexibility of the current system could you help me understand how to solve the problem that drove me to ask this:

cstjean · February 6, 2019, 6:45pm

As someone who works with industrial time series of type ::Vector{Union{Float64, Missing}}, I’ve come to the conclusion that inference-based eltypes (as in Julia 0.4) would be much easier to work with than the 4 data-dependent cases I have to handle when I use generators/comprehensions/broadcasting.

julia> x = [missing, 2]
2-element Array{Union{Missing, Int64},1}:
  missing
 2       

julia> typeof(x[1:0] .* 1)
Array{Union{Missing, Int64},1}

julia> typeof(x[1:1] .* 1)
Array{Missing,1}

julia> typeof(x[1:2] .* 1)
Array{Union{Missing, Int64},1}

It would be formally acceptable to say “the eltype of a comprehension/generator is guaranteed to be a supertype of the inferred eltype”, and it would not constrain the implementation at all. I know, Julia is not supposed to be that kind of language, but as you well know it’s already done for empty comprehensions. Has there been significant issues in practice? The only potential problem I’m aware of is if someone has two methods f(::Vector{Union{Float64, Missing}}) and f(::Vector{Missing}) which are semantically different. But as far as I’m concerned, that’s bad code.

nalimilan · February 6, 2019, 8:21pm

Have you encountered real situations where this type unstability is a problem in practice? I’ve never heard about issues with it (except in complex systems such as Query, due to the combination with named tuples).

jeff.bezanson · February 6, 2019, 10:48pm

Right; to some extent it’s ok to use the inferred type, as long as your code is fine with it being Any. It’s safe to use as a hint, like a size hint. For example I can imagine other implementations of Julia, or other execution environments, where return_type just always returns Any so that the type inference code doesn’t have to be included. But in my experience most people are not fine with Any; they have a more specific type in mind they’d like to get.

Inferred types do not have to get more specific over time, as the compiler improves, either. In fact at this very moment I am hoping to make many inferred types worse to speed up the compiler. Learn to love Any; it’s efficient to deduce!

The OP has an interesting point: since we have IteratorEltype anyway, we could potentially return HasEltype for generators when, say, return_type is a concrete type. Of course, then people would have to be ok with IteratorEltype being unpredictable the same way return_type is.

yuyichao · February 6, 2019, 10:59pm

That just completely destroy any non-breaking guarantee though… Especially given what people actually depends on in the past and how easy it is to depend on these behavior implicitly…

jeff.bezanson · February 7, 2019, 12:25am

On the whole I agree. We don’t want to deliberately expand unpredictable behaviors like that.

ndinsmore · February 7, 2019, 1:14am

Would it be possible to have a function like return_type that would be much more computationally expensive but more predictable? Called maybe deduced_return_type.
Then add a non-breaking addition to the iteration interface of IteratorDeducedEltype HasDeducedEltype and deduced_eltype all which would by default fall back to the non-deduced versions when not explicitly define? which it would be for Generators.

That way there would at-least be a base way to do:

myfunction(iter::IT)=_myfunction(deduced_eltype(IT),iter)
function _myfunction(::Type{T},iter) where T <: MyType 
      ...
end

yuyichao · February 7, 2019, 4:50am

Computationally expensive isn’t even the issue. The issue is,

AFAICT, this is very non-trivial to define, especially if you want it to get anywhere near what the inference is currently capable of doing. Strictly speaking, it also means that any change to such a definition is a breaking change and practically at least any change that can cause a wider type to be returned is breaking.

This no regression requirement might seem easier to maintain and reason about, but note that the inference is coupled with optimization and different transformations in optimization is known to interfere with (i.e. hide information from) each other, meaning improvements may in some cases cause regression as well. Again, certainly won’t happen for most users for most cases but with enough users someone will definitely hit such case…

I’m not sure if this is true since I’ve never learn much(any?) formal theories about it so sorry if any part of this is inaccurate, but IMHO one of the main issues is that the inference is currently defined/constraint by a set of valid rules and transformations. This is enough to guarantee correctness and as long as we implement enough of these rules it’ll be good enough for performance too. However, to get predictable result following this route seems to not only require defining a fixed set of rules that can be applied but also in which order/manner they are applied (any change to either can lead to observable regression as mentioned above). In another word, following what we are doing, we can only define type inference based on an implementation and I believe such a definition is usually not very useful, (e.g. for the purpose of deciding whether we can make a change to it without being breaking).

Tamas_Papp · February 7, 2019, 7:34am

I think the correct long-run idiomatic solution would be exposing the building blocks of what collect does, so that users can easily write code that assumes a certain type (as @jeff.bezanson said, like a size hint, possibly deduced from the first element obtained) yet seamlessly go to a different branch if that assumption is violated. When the compiler knows a lot, if can prune these branches and emit very efficient code, but even the worst case heuristic fallbacks (widen as necessary) are not that bad.

I have used one component of this in the example code here, and I am working on a strategy that does something similar for “pre-allocated” buffers for output types, but that needs quite a bit of polish.

yuyichao · February 7, 2019, 3:11pm

But Base.return_type is the building block? It should be the main function needed for that. I don’t feel like the exact code that is written around it to implement collect and other function is particularly reusable. The pattern/idea certainly is useful/reusable of course.

Or in another word, I feel like if you can formulate your problem into collecting an iterator, you can just use collect. Otherwise, you’ll probably need to make some non-trivial tweak to the code pattern in collect and I’m not sure how easy it is to split out more reusable part. Of course, this was my feel of it when I looked at it a while ago and I can easily be wrong on this since I don’t have a proof.

cstjean · February 7, 2019, 3:37pm

Just yesterday I had code that worked fine for several months

v::Vector{Union{Float64, Missing}}
t = f.(v)
t .= ifelse.(ismissing.(t), NaN, t)

crash because v was all-missing on some new data, so t::Vector{Missing}, which cannot store NaN. Thankfully, that didn’t happen in production, but it’s a serious concern whether we’ve tested all cases.

I agree with @Tamas_Papp that we’re missing some building blocks for generic code. For instance, here’s a short version of the basic loop for generic Kalman filter code:

function kalman_filter(initial_state, observations, theta, N)
   filtered = []
   likelihoods = []
   state = initial_state
   for i in 1:N
       state, ll = kernel(state, observations[i], theta)
       push!(filtered,  state)
       push!(likelihoods, ll)
   end
   return filtered, likelihoods
end

How should I pick the eltype of filtered? I could naively take the type of initial_state, but then if I use ForwardDiff to do gradient descent on theta, the state and likelihoods will contain DualNumbers, so that does not work. What works:

Use Base.return_type. Oh, but maybe we need to call it several times until convergence of the type of state
Use accumulate. Have the accumulator function return the tuple (state, ll), then unzip the sequences afterward.
Do one or two iterations of state = kernel(...) just to pick the eltype, and hope that observations doesn’t have any missings.
Do my own type widening.

I tried accumulate. The code was ugly, and performance was bad, but it’s a solution in theory. Type widening doesn’t sound like beautiful code either. As for Base.return_type, I had some bad experience using it in Julia 0.6, but if it’s officially The Julian Way, I could give it another try. Right now, we use option #3, and I’m not happy with it, but it generally works.

Tamas_Papp · February 7, 2019, 3:44pm

No, the building block is an interface for a collection that can

either store a new value (eg <: eltype),
or allocate a new collection that can.

The caller determines if widening is needed, and calls itself accordingly. Just review “take 2” here, it is really short.

Tamas_Papp · February 7, 2019, 3:45pm

This. But we need to abstract it to nice building blocks.

yuyichao · February 7, 2019, 3:55pm

As Jeff said, the issue is even wider type and no unless we remove eltype completely, writing functions that depends on it isn’t bad code. There are countless reason one might want to rely on eltype. The fact that these questions come up is the proof…

Yes, I know. But I mean that’s really just using the building block already…

I think your code clearly show that the majority of the code is written already, AFAICT most of the code in there is actually dealing with iterator, i.e. the customized and non-reusable part, rather than dealing with widening. As you said, that part is really short. My point is basically that there’s not much point to simplify 3 lines of code to 1 when it’ll add extra restriction and is expected to be used in another 10 lines of code…

And I’m just saying that AFAICT the fact you can write the code within so few line proofs that the building block is there. I don’t believe there’s a strong need to add any code as building blocks. Documenting the correct pattern or giving more examples is certainly always welcome.

nalimilan · February 7, 2019, 3:57pm

cstjean:

Just yesterday I had code that worked fine for several months
v::Vector{Union{Float64, Missing}}
t = f.(v)
t .= ifelse.(ismissing.(t), NaN, t)
crash because v was all-missing on some new data, so t::Vector{Missing} , which cannot store NaN . Thankfully, that didn’t happen in production, but it’s a serious concern whether we’ve tested all cases.

OK, indeed that’s annoying.

However in that particular case you’d probably better allocate a new vector since you know it won’t contain any missing values. Then performance will be higher down the line. And replace(t, missing=>NaN) or coalesce.(t, NaN) is quite compact to write.

yuyichao · February 7, 2019, 4:08pm

I’ll note that this crash is in general impossible to avoid given the way the code is written. Your complaint is that the eltype is too narrow (instead of too wide, which is what almost every other complaint is) and you want to use inference because it gives you a wider type. However, there’s absolutely no way we can guarantee any particular wider type is returned so using inference here won’t actually fix your problem in all cases.

In another word, you are relying on the inference not able to figure out the actual runtime return type of f, which might be true now but that’s not something we can guarantee. We can only guarentee that the inferenced type is a super type of the runtime type, if the runtime type is only Missing for a run, it’s entirely allowed for the inference/optimization of that particular run to return Missing as the inferred type.

cstjean · February 7, 2019, 4:41pm

I’m saying that the fewer branches to test, the better. f.(::Vector{Union{Missing, Float64}}) gives me four possible outcomes to worry about. When I don’t care about performance, I use purely functional, non-mutating code, and it’s a non-issue. But Julia is all about performance, so code like mine happens, and it’s a significant worry for real-time systems. @andyferris said the same:

As a programmer, this scares me somewhat because I might get subtly different program behavior (or program correctness) depending on my input data. My concerm is that I might perform tests with mixtures of Missing and Float64 input values but be caught out in rare cases in production when all (or none) of the inputs are missing.

I understand the appeal of the current rules, from a certain perspective. I don’t really have a solution to offer. I’m just giving my report from real-world usage. Maybe I just need to assume that I can’t rely on the type of f.(v) at all, and code accordingly.

Topic		Replies	Views
Can `eltype()` deduce the element type of a generator? Internals & Design data	30	2934	December 11, 2019
`eltype` could be smarter Internals & Design	8	630	May 4, 2022
Type instability in list comprehensions General Usage repl , type-stability , comprehension	46	779	August 9, 2024
Collect and type inference Internals & Design question	5	644	January 13, 2021
Inference bug when using nested generators? General Usage	21	873	June 23, 2022

Should Generators finally be given eltype

Related topics