Column types in DataFrames vs. InMemoryDatasets

monopolynomial · March 25, 2022, 12:28am

it’s very interesting, because it seems the following point is more about DataFrames.jl after all.

because

using InMemoryDatasets
using NamedArrays

initial = NamedArray([5748.61], ["AUT-A01"])
df_initial = Dataset(:sector => names(initial,1), :value => initial[:,1])
df_weights = Dataset(sector = ["AUT-A01","AUT-A01","AUT-A01","AUT-A01","AUT-A01"],
    new = ["a","b","c","d","e"],
 weights = [0.2, 0.2, 0.4, 0.1, 0.1]
)

test = leftjoin(df_weights, df_initial, on = :sector)
5×4 Dataset
 Row │ sector    new       weights   value    
     │ identity  identity  identity  identity 
     │ String?   String?   Float64?  Float64? 
─────┼────────────────────────────────────────
   1 │ AUT-A01   a              0.2   5748.61
   2 │ AUT-A01   b              0.2   5748.61
   3 │ AUT-A01   c              0.4   5748.61
   4 │ AUT-A01   d              0.1   5748.61
   5 │ AUT-A01   e              0.1   5748.61

bkamins · March 25, 2022, 6:59am

I am not sure what you mean by this. What I am saying there is the following (using your code above):

julia> typeof(df_initial.value)
DatasetColumn{Dataset, NamedVector{Union{Missing, Float64}, Vector{Union{Missing, Float64}}, Tuple{OrderedCollections.OrderedDict{String, Int64}}}}

julia> df_initial.value isa AbstractVector
false

so as you can see when you get a column from a Dataset it is a custom type that is not an AbstractVector. While in DataFrames.jl if you get a column from a DataFrame you get a vector you have used to create this column without any wrappers.

monopolynomial · March 28, 2022, 10:58pm

What I meant is that you claimed that DataFrames is designed to store anything but your solution for OP problem is to change the column to vector because dataframes cannot handle namedarray here, but ironically InMemoryDatasets works ok for OP.

bkamins · March 29, 2022, 6:57am

is to change the column to vector because dataframes cannot handle namedarray here

I think of this the following way:

DataFrames.jl is built around the idea of respecting composability across Julia packages. This means that if user made a NamedArray as a column DataFrames.jl will respect this choice.

In particular, using NamedArray column implies that you not want to duplicate rows in such a column (as this is the property of NamedArray type). Therefore DataFrames.jl respects this choice and enforces this constraint.

This might seem strange at first but it has many benefits in the long run:

if you use special vectors in DataFrames.jl, like CategoricalArrays.jl, PooledArrays.jl, MappedArrays.jl, IndexedArrays.jl, NamedArrays.jl, everything will work just fine out-of-the-box; DataFrames.jl will respect your choices and you will have all the benefits that these special columns give you (of course this means that you must understand what you put into a data frame and why you did this);
DataFrames.jl behaves predictably - if you know Julia Base you can expect that DataFrames.jl will follow its rules; again - this is quite important when you build larger pipelines/projects, where DataFrames.jl is only a part of the whole project.

In short, as I have written in the title of one of my tutorials - DataFrames.jl is designed to be support package. Most likely when you use Julia the core of your work is ML/optimization/simulation/you name it. DataFrames.jl wants to be as non-intrusive as possible to your core workflow, while allowing you to simply do data pre/post processing.

This is a different philosophy to e.g. R/Python where for many users data frame is a core of their work (i.e. everything revolves around processing data frames). And this approach (to be a lightweight wrapper around vectors with names) influenced DataFrames.jl design a lot.

mostafa1342004 · March 29, 2022, 7:03am

This was a rash comment from @bkamins. Even if a column has NaN DataFrames will complain

ArgumentError: currently for numeric values NaN and -0.0 in their real or imaginary components are not allowed. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.

bkamins · March 29, 2022, 7:09am

@mostafa1342004 - how would you expect -0.0 and NaN to be handled in joins? This behavior is intentional to safeguard users against unexpected and incorrect results.

mbauman · March 29, 2022, 2:55pm

I’ve split this thread as it’s unrelated to the question that was asked in DataFrame: Cannot have duplicated names for indices. I understand that InMemoryDatasets uses a different philosophy than DataFrames here — and that’s great — but let’s please not make usage questions about either of these packages a battleground on their respective design philosophies.

Topic		Replies	Views
Should I use either Dataframes.jl or Named Array for a long and wide array for sci computing General Usage	7	2587	July 25, 2019
Most popular tabular/multidimensional data types in Julia New to Julia data , type , dataframes	18	1334	December 8, 2021
Hierarchical or multi-index for data frames Data	10	7398	October 9, 2019
Converting NamedTuple to DataFrame seems expensive? New to Julia	7	676	May 3, 2020
What have we learned from DataFrames in Julia? Community poll	4	1649	November 29, 2017

Column types in DataFrames vs. InMemoryDatasets

Related topics