Various constructors and equality for DataFrame

question

#1

I came across this problem when writing tests for a package, for a function that produces dataframes using the

DataFrames(columns::AbstractArray{T<:Any,1}, cnames::AbstractArray{Symbol,1})

constructor. Consider this MWE:

using DataFrames
a = collect(1:5)
b = string.(a)
df1 = DataFrame([a,b], [:a,:b])
df2 = DataFrame(a=a,b=b)

First, df1 prints funny:

julia>  df1
5×2 DataFrames.DataFrame
│ Row │ a │ b   │
├─────┼───┼─────┤
│ 1   │ 1 │ 1 │
│ 2   │ 2 │ 2 │
│ 3   │ 3 │ 3 │
│ 4   │ 4 │ 4 │
│ 5   │ 5 │ 5 │

Second,

julia>  df1 == df2
false

I think the issue is that one has NullableArrays, the other plain vanilla Arrays. So should I not be using the constructor above? Or always make the arguments NullableArray? Or use a different function for comparison if I want equality?


#2

If you use the first form of the constructor, you’re responsible for choosing the appropriate column type. This is likely to change at some point though, in which case you’ll get a standard Array in both cases (see this issue).

The printing issue was just fixed.


#3

Thank you — I read the issues, but I am still not sure what the “appropriate column type” is until #1119 is resolved. Would using the constructor as

DataFrame(map(NullableArray, [a,b]), [:a,:b])

be the recommended solution for now?


#4

The most appropriate/default column type is DataArray for DataFrames 0.8.x and NullableArray for git master. If you just want to use this type, then use the keyword argument constructor. The other one is only useful when you really want to preserve the original type, which IIUC isn’t your case at all.

We also need to solve the question of whether == and isequal should consider NullableArray and Array equal when they have the same contents. See this issue and this one.


#5

Thanks! So if the column names and values are the result of some other computation, and the names their number is not known in advance, is the recommended constructor something like

DataFrame(; [Pair(key_column...) for key_column in zip(keys,columns)]...)

?


#6

Or even just DataFrame(; zip(keys, columns)...).