Adding two Array{Union{Missing,Float64}} returns Array{Float64}, is this by design?

for starter, you can calculate a Matrix{<:Numer}'s determinant with a fallback routine, right? Now throw in missings. Also it can be missing of anything, for example if you have a collection of pairs or collection of vectors. Just doesn’t make much sense to see missing as strictly a scalar

1 Like

Yes this behavior of broadcast can be annoying sometimes (other times it’s what you want though). But it’s unavoidable as the return type should not depend on type inference, only on the actual contents of the result (otherwise every change in the compiler could change the return type, which would often be Vector{Any} when inference fails). Preallocating is the right solution (if you know the return type in advance), and with similar it will be as efficient as if broadcast handled it itself.

Ideally convert(AbstractArray{Union{T, Missing}}, x::Vector{T}) would avoid making a copy so that would also be an efficient solution, but currently it does so the similar approach is better.

5 Likes

Could you please help understanding the advantage of using similar compared to copy?
Using @time multiple times or @btime, seems to favor copy:

using BenchmarkTools
y = 1:1_000_000
@btime similar($y)  # 29.08 μs (2 allocations: 7.63 MiB)
@btime copy($y)     # 0.001 ns (0 allocations: 0 bytes)

NB: using Julia 1.7.0-beta4 on Win10.

Thanks.

I believe BenchmarkTools advises not to trust results when they are in the 1ns scale. My guess was theoretical, that is, in theory similar simply allocates memory for an Array of the same size and type whereas copy needs to copy every single value of that Array.

Imagine an Array of 1Gb, would it be faster to copy the Array or simply allocate 1Gb of memory? All this in theory :slight_smile:

1 Like

That’s why I was asking about treating missing as a Number the same way NaN is a Number despite the fact it literally means Not-a-Number.

Missing values are ubiquitous in analytics and every time I need to resort to Union types is nearly exclusively to deal with Missing values, given the analytical goals of Julia I thought it might make sense.

Thank you for your answer though @nalimilan

1 Like

(a) I have found using the setup form of @btime to be helpful.
(b) 1:10 is a range, collect(1:10) is a Vector

  • copying a range does not allocate, but your use would take a vector
  • with larger vectors, similar is much more efficient
using BenchmarkTools
julia> arange = 1:10_000;
julia> avector = collect(arange);

julia> @btime copy(z) setup=(z=arange;);
  14.715 ns (0 allocations: 0 bytes)
julia> @btime similar(z) setup=(z=arange;);
  701.546 ns (2 allocations: 78.17 KiB)

julia> @btime copy(z) setup=(z=avector;);
  4.050 μs (2 allocations: 78.17 KiB)
julia> @btime similar(z) setup=(z=avector;);
  714.721 ns (2 allocations: 78.17 KiB)
2 Likes

@JeffreySarnoff, thanks for your time and very clear explanation. Sounds very good.

1 Like

While we are at it, what would be the drawbacks in your workflows of using NaN instead of missing? In practical situations, it is not uncommon to have missing instruments readings often labelled as NaN.

I would say the main drawback would be poor user experience and poor performance; imagine you have a large amount of data with missing values and you want to use a package with forces you to relabel all missing values as NaN, and then yet another package uses negative values to identify missing values and you need to relabel again, and now the next package is using columns in a Matrix where you typically have rows, and then…

It is important to have some community agreement on how to format data and that is why, unless I have no other choice, I follow the Julia way as much as I can.

1 Like

In most situations where I’m working with data, I want to fix any failing inference so the compiler and I can both understand what’s going on. Having Vector{Any} would be fine then since I’d have to fix my function anyway.

When would I want this behavior? I’m wondering how unavoidable this really is.

“Most” typically isn’t a strong enough guaranty when you design a programming language. :wink: People are not happy if their programs only work “most” of the time, so Julia has to behave exactly the same whether or not inference has been able to guess in advance what the return type will be.

Maybe you’d be fine with a Vector{Any} in many situations, but in others it can create problems. For example, many functions expect a vector with a particular element type. Also in terms of performance, a Vector{Any} will make everything slow down the line, which may have much more dramatic consequences than the original inference failure (which are sometimes perfectly fine).

3 Likes

Maybe this is veering too far off topic but I wonder: would it be possible to create a macro for telling the compiler to use the inferred return type on a particular function?

1 Like

Yes a macro could probably get the Broadcasted object, see whether the return type of its function is inferred, and use that to allocate an array with that element type before applying the operation in-place.

2 Likes