Is there light at the end of the DataFrames tunnel?

So are you implying that we should use columns that are Union{T,Missing,Void}? That seems extremely pedantic.

I think it’s great that by using Union columns users can do this sort of thing, but I have to think that in the overwhelming majority of cases users would only want a single “invalid” type? (Use of the word invalid should not be misconstrued as a suggestion :laughing:)

Why would that be extremely pedantic? It’s just different from R and Panda, but I for one really like it. (of course you could just define const NA = Union{Missing, Void})

1 Like

As I think about it more, I guess it’s fair. It’s more a question of what the default behavior is of these various different types. Like I said I really like that this is possible, but in reality I’m having a hard time imagining anyone actually doing this except in extremely rare and esoteric cases. I am a little afraid that if someone opens up the DataFrames documentation and they see a huge page about this they will run away in terror.

That said, nobody will be able to claim that Julia made the same mistake pandas did with missing values. No, Julia takes missing values very seriously :laughing:.

Well, that’s just an issue of presenting it. The usage of missing values is hopefully very simple and explained in one paragraph - and details could be hidden deeper… After all, if you document a function like println you also don’t drop a page about writing asynchronously to input output streams via libuv :stuck_out_tongue:

3 Likes

I’m now thoroughly confused :slight_smile: I thought the idea was that nothing is the software engineers missing value and that it should never be used in the data science stack. I had assumed that for example nothing would never propagate, e.g. log(nothing) would throw an error, not return nothing. I had also thought that missing will be the sole missing value in the data stack.

Is the new plan now that we have two different missing values in the data stack that have slightly different semantics and that we encourage folks to use both in the data stack? And that one of the missing values in the data stack doubles as the software engineer’s missing value?

That all seems very confusing to me, to be honest. I thought that the idea to have one thing for the software dev story and one for the data stack still seems like the right choice. I think the more clearly we can keep these two apart, the better. So in my mind, whatever ends up being the software engineer’s missing value should just never show up as a value in the data stack, and ideally the naming of these things should encourage that separation. Whether we then need two different missing values in the data stack seems a separate question. I would strongly argue against it, I think it all just gets way too confusing and over-engineered. Better to just have one clear and well defined semantics in the data stack, and if folks need something else, they would need to hand-roll that for their own situation.

Honestly, the non-missing part of the story hasn’t been well thought through at this point. We have enough work on our plate with just one kind of missing value. What is clear is that missing represents a missing value, i.e. a value which could be here in theory but that we don’t know for some reason. Using it as a representation of “invalid”/“not applicable” will work with most operations, but with comparison operators it uses three-valued logic, which are not completely correct for “invalid”.

@StefanKarpinski suggested that we introduce a dedicated invalid type, but AFAIK that was just an idea, there’s no concrete plan at this point. Other languages happily conflate several meanings under NA/NULL/missing, so that’s certainly not the end of the world if people use missing for “invalid”/“not applicable”. It’s just that Julia naturally allows distinguishing them, which tends to make us greedy.

1 Like

Something that’s convenient when working with a dataframe is taking a column out, performing a bunch of operations on it as a Vector, and then putting it back in the dataframe. This way you can work with functions that don’t allow a DataFrame column as an imput.

If I have a column with many “missing” values, would those missing values become something else when I go from having the column in a dataframe to having it be its own vector?

1 Like

I emphatically agree with this. As far as I’m concerned, this is the “killer” feature of Julia DataFrames. It makes them wonderfully simple to work with and is a really big advantage over R and pandas. I don’t think anyone was suggesting that we sacrifice this.

1 Like

To build on this, handling missing values is crucial here.

Lets say I have a hypothetical function

f(x::Vector) 
    return x .+ 1
end

If f() cannot handle missing values, it might be tempted to just drop them

f(x::Vector) 
    y = x[x .!= null]
    return y .+ 1
end

Now we have no way to merge out vector back into the dataframe, since we have lost all the missing values and can no longer keep track of which row in the vector corresponds to which observation in the dataframe.

From the point of view of data-analysis, there are operations that should propagate a missing value (in @pdeffebach’s example f(x) would return x .+ 1 and have missing where there were missing in x), and those that should ignore them, like mean(x) should filter out the missing. But it should never error-out on a missing.

I’m sure this has been discussed already.

1 Like

This has been discussed ad nauseam. My impression from these debates is that there are no universally right semantics. No, not even in a specific domain like “data analysis”. Sometimes you want to silently drop values, sometimes you want to drop incomplete cases, and sometimes you want the function to just fail.

Lacking universally accepted semantics, the next best thing is to have a

  1. consistent set of rules which are well-defined, easy to reason about, and useful “most of the time”,
  2. a flexible mechanism for getting other behavior, either on an ad hoc basis, or by switching to another well-defined set of rules (“programmer’s null”, “missing value”, “impossible value”).

For me, having a good mental model of the semantics which I have to override occasionally is preferable to having a mechanism that is obscure but tries to do the “right thing” all the time.

3 Likes

Sounds good :slight_smile:

The behavior which is already implemented in Missings.jl is that standard operators propagate missing values. The package also implements methods so that many Base functions operating on scalars propagate missing values (but some are not covered yet). For functions taking arrays, it’s up to the author to decide whether there’s an appropriate behavior for missing values, but in general the idea is that you would drop missing values replace them with another value before passing them to functions, e.g. via mean(Missings.skip(x)) (Missings.skip will probably be renamed).

1 Like

@dmbates @nalimilan @dave.f.kleinschmidt @quinnj I am compiling a Julia 0.7 development environment for statistical models. It should be done soon. Basically it uses the forks of the master branches of the related packages from JuliaData/JuliaStats… (1) JuliaData includes I/O (CSV and Feather) and tabular packages (Missing, CategoricalArrays, DataFrames), and for modeling StatsBase and StatsModels. These forks will have all the latest syntax (no depreciations and dropped support) and the versions will reflect latest PR and compatibility (e.g., StatsBase now has a coefnames and thus StatsModels will import and expand it to ModelFrames). It should be under JuliaEconometrics… Still working a few kinks after the latest merges of CategoricalArrays which broke Feather for string columns. The main purpose is to allow for development and updating statistical packages, but can hopefully help in updating the source projects.

For now I recommend using Julia 0.7.0-DEV.2279 (Commit 8798936) and the master for most packages. Reading files is currently broken for files with string variables, but should work fine for only numeric ones.

I don’t really like the idea of having forks of all these packages floating around. What’s the idea behind this? We already have a hard time maintaining them with a small team…

I’ve just made a PR to tag a new StatsBase version including coefnames. Feel free to ping us when you need features which are on master for some time to be released.

1 Like

The forks include only the ones that the master hasn’t been updated yet. As soon as the latest is updated I remove the fork and clone from the source. The idea is for the source projects to have time to update their infrastructure (e.g., porting Nulls to Missings) with the necessary breaks while providing a last “stable” version for package developers to work on their packages. For example, I was able to make the port from Nulls to Nothing using the forks and test those. The solution is only temporary and oriented towards package developers not general usage (which probably should be using latest release). An alternative would be instead of having those forks make a dedicated branch which developers can pull from. Those branches could be updated to include the latest compatible stages even if not as thoroughly tested (could be a good way to test changes as well). Let me know if you think the branches would be a better solution and I could submit a candidates for those.

The light has finally arrived: DataFrames 0.11 released.

8 Likes