Announcement: An Update on DataFrames Future Plans

10 posts were split to a new topic: Representing Nullable Values

Potentially, but there are tradeoffs with performance. For now, it will be one extra byte per array element. Also note that the current compiler improvements underway generalize to more than just Vector{Union{T, Null}}; it supports Arrays of arbitrary dimensions, Unions with more than 2 unioned types, and supports the modifying operations of Vector, (i.e. push!, append!, deleteat!, etc.).

We could certainly explore the limited case where you only have a Union of 2 types and using a single bit per array element, but it would slow down indexing with the additional byte->bit manipulations needed.

1 Like

There are codegen Union optimizations that have already been merged to do the call-site method splitting @malmaud mentioned above; unfortunately, there have been issues inlining after the method splitting, so the optimizations aren’t fully realized (I believe this is fully fixed in master though). The additional optimizations being worked on in Jameson & my PRs are related to allowing Union{T, S, ...} to be “unboxed” when T and S (and ...) are all isbits types. This affects struct fields and array elements, allowing more efficient memory layout and access/setting.

Oh great. Might you have any example of using @code_llvm or what have you that shows call-side method splitting in action?

NM, I think I have one:

f(x::Int)=x+1
f(::Void) =nothing
g(x) = x>0 ? f(x) : f(nothing)
@code_llvm g(0)
define { i8**, i8 } @julia_g_61178([8 x i8]* noalias nocapture, i64) #0 !dbg !5 {
top:
  %2 = icmp slt i64 %1, 1
  br i1 %2, label %L3, label %if

if:                                               ; preds = %top
  %3 = add i64 %1, 1
  %4 = bitcast [8 x i8]* %0 to i8**
  %5 = insertvalue { i8**, i8 } undef, i8** %4, 0
  %6 = insertvalue { i8**, i8 } %5, i8 1, 1
  %7 = bitcast [8 x i8]* %0 to i64*
  store i64 %3, i64* %7, align 1
  br label %L3

L3:                                               ; preds = %if, %top
  %merge = phi { i8**, i8 } [ { i8** null, i8 2 }, %top ], [ %6, %if ]
  ret { i8**, i8 } %merge
}

@nalimilan, is there an update on this? I’ve noticed that all of the DataTables commits have been merged with DataFrames and that DataFrames now uses Nulls instead of DataArrays. This is definitely a welcome update, I’d like to start migrating my packages from DataTables to DataFrames in anticipation of whatever the world will look like in 1.0, so I wanted to try to get some idea of how stable current plans are.

Also, @quinnj has overhauled DataStreams, and I’d like to get things up and running with the new version, I was wondering if there was any willingness to start porting things over (I can definitely help with this).

Thanks to everyone for their ongoing efforts.

1 Like

Yes, I wanted to post an short update but you beat me to it. You summarized the situation pretty well. Users and package authors are encouraged to test the master branch of DataFrames, which is now based on Nulls.jl and reflects the new API. The branch should be fully usable, but it will not be as fast on Julia 0.6 as on 0.7. Please report any issues on GitHub. Benchmarks on Julia 0.7 (comparing performance with the DataArrays-based DataFrames) would be particularly useful.

The main limitation right now is that most packages depending on DataFrames have no yet been updated (e.g. Query and RCall; there’s an open pull request for StatsModels). But DataStreams, Feather and CSV have already been ported. Of course new releases of dependent packages cannot be tagged before the new DataFrames itself has been tagged, so until we’re ready to make a release that code will have to live in package branches.

2 Likes

Great. Do we know yet whether the promised improvements to union types have already been implemented in 0.7? I haven’t seen anything about it in NEWS.

Yes, the Union struct/array optimizations have landed in Base. There is still an open issue to improve inference (currently planned and in progress), as well as a SIMD isbits Union array issue that will hopefully be helped by the first issue and some other planned codegen cleanups. Both are on track for 1.0.

I’ve been letting the dust settle a bit on the isbits Union optimizations before doing a NEWS entry (we’ve had a couple bug reports come in that have all been fixed). I’m also planning a more details blog post on the optimizations once all the issues have been cleaned up/finished.

6 Likes

I just added this PR Add support for Nulls.jl based DataFrames.jl by davidanthoff · Pull Request #60 · queryverse/IterableTables.jl · GitHub that should add support for the new Nulls.jl based DataFrames.jl to IterableTables.jl (and thereby to Query.jl and all the other packages in that ecosystem).

Though note that Query.jl and the rest of the ecosystem around it will not move to the Union{T,Null} approach, but will continue to use DataValues.jl for representing missing values. The whole Union{T,Null} design has some fundamental issues that make it more or less unusable within Query.jl, as far as I can tell at this point (I’d be happy to be corrected on this point). So the situation we’ll have is that Query.jl and the iterable tables universe will be able to interop with the Nulls.jl based DataFrames.jl, but will use a different missing value story both under the hood and within any query users write.

:frowning: Do you have a writeup explaining the issues?

I would imagine the discussion here: https://github.com/JuliaData/Nulls.jl/issues/6

3 Likes

That is the one.

1 Like

I think I more or less understand why this won’t work for Query.jl design as to produce columns efficiently out of a named tuples iterator you’d need to infer the return type, which is harder with Union{T, Null}. However I’m still a bit confused as to what is the policy when collecting into a data structure. The following puzzles me:

julia> using DataFrames, IterableTables, IndexedTables

julia> df = DataFrame(x = [1, 2], y = rand(2))
2×2 DataFrames.DataFrame
│ Row │ x │ y         │
├─────┼───┼───────────┤
│ 1   │ 1 │ 0.0905943 │
│ 2   │ 2 │ 0.220108  │

julia> t = IndexedTable(df)
x │ y
──┼──────────
1 │ 0.0905943
2 │ 0.220108

julia> typeof(t[1])
NamedTuples._NT_y{DataValues.DataValue{Float64}}

Did we end up with DataValues because those are the standard way of representing missing data with IndexedTables, or is it an artefact of having converted via IterableTables?

Artifact of converting via IterableTables

Yep, that is right. IndexedTables.jl doesn’t have a default way to represent missing values as far as I can tell, so for now I’m just creating Vector{DataValue{T}} columns when you convert a source that has columns that can hold missing data. Of course for source columns that are just Arrays you will just get an Array in the sink. For all the other sinks I respect whatever that sink uses for missing data, i.e. for DataFrame you’ll get DataVectors, for DataTable you’ll get NullableVectors etc. At the end of the day this is a decision each sink needs to make, i.e. how that sink wants to represent missing data. The iterable tables interface in itself doesn’t mandate/prescribe anything on that front, essentially it just interops with whatever choice the sink makes.

If the maintainers of IndexedTables.jl at some point pick a standard way to represent missing data in that package, it should be super easy to adjust things so that when you convert from an iterable table with missing data, you’ll get that standard data structure.

I’m sure this is already in-process, but it would be really nice if all of the new versions of DataFrames, DataStreams, CSV, Feather, CategoricalArrays and whatever-else get tagged. Right now they are blocking each other and the package manager is flipping it’s shit.

1 Like

That’s on purpose. The new DataStreams/CSV/Feather and CategoricalArrays are incompatible with the current DataFrames, so upper bounds have been added. You need to choose either to stay on the released version of DataFrames, and the package manager will automatically use older releases of the dependencies; or to use DataFrames from git master, and update all dependencies to their most recent releases.

This situation will of course change as soon as we tag a new DataFrames release. But then it will be incompatible with all packages depending on the current DataFrames, so a similar problem will appear until they are all updated.

You can still support more than two types with a bitarray, and even with a more compressed form:

I was thinking, is a SparseArray worth it, say if most are values are not Null? It seems it could be slow, but a hybrid approach could work.

A bitarray, to check if say if you have your likely type (not Nulls or other types) for say every 16 values (just an idea, for something like a cache-line worth) in your main array.

Is there an update of when will DataFrames v0.11 will be released? I suspect many packages would like to start migrating to Julia 0.7, but have DataFrames/StatsModels dependencies. It might take a bit of work in StatsModels, but were waiting for a new DataFrames version to get tagged.