Column types in DataFrames vs. InMemoryDatasets

it’s very interesting, because it seems the following point is more about DataFrames.jl after all.


using InMemoryDatasets
using NamedArrays

initial = NamedArray([5748.61], ["AUT-A01"])
df_initial = Dataset(:sector => names(initial,1), :value => initial[:,1])
df_weights = Dataset(sector = ["AUT-A01","AUT-A01","AUT-A01","AUT-A01","AUT-A01"],
    new = ["a","b","c","d","e"],
 weights = [0.2, 0.2, 0.4, 0.1, 0.1]

test = leftjoin(df_weights, df_initial, on = :sector)
5×4 Dataset
 Row │ sector    new       weights   value    
     │ identity  identity  identity  identity 
     │ String?   String?   Float64?  Float64? 
   1 │ AUT-A01   a              0.2   5748.61
   2 │ AUT-A01   b              0.2   5748.61
   3 │ AUT-A01   c              0.4   5748.61
   4 │ AUT-A01   d              0.1   5748.61
   5 │ AUT-A01   e              0.1   5748.61

I am not sure what you mean by this. What I am saying there is the following (using your code above):

julia> typeof(df_initial.value)
DatasetColumn{Dataset, NamedVector{Union{Missing, Float64}, Vector{Union{Missing, Float64}}, Tuple{OrderedCollections.OrderedDict{String, Int64}}}}

julia> df_initial.value isa AbstractVector

so as you can see when you get a column from a Dataset it is a custom type that is not an AbstractVector. While in DataFrames.jl if you get a column from a DataFrame you get a vector you have used to create this column without any wrappers.


What I meant is that you claimed that DataFrames is designed to store anything but your solution for OP problem is to change the column to vector because dataframes cannot handle namedarray here, but ironically InMemoryDatasets works ok for OP.

is to change the column to vector because dataframes cannot handle namedarray here

I think of this the following way:

DataFrames.jl is built around the idea of respecting composability across Julia packages. This means that if user made a NamedArray as a column DataFrames.jl will respect this choice.

In particular, using NamedArray column implies that you not want to duplicate rows in such a column (as this is the property of NamedArray type). Therefore DataFrames.jl respects this choice and enforces this constraint.

This might seem strange at first but it has many benefits in the long run:

  • if you use special vectors in DataFrames.jl, like CategoricalArrays.jl, PooledArrays.jl, MappedArrays.jl, IndexedArrays.jl, NamedArrays.jl, everything will work just fine out-of-the-box; DataFrames.jl will respect your choices and you will have all the benefits that these special columns give you (of course this means that you must understand what you put into a data frame and why you did this);
  • DataFrames.jl behaves predictably - if you know Julia Base you can expect that DataFrames.jl will follow its rules; again - this is quite important when you build larger pipelines/projects, where DataFrames.jl is only a part of the whole project.

In short, as I have written in the title of one of my tutorials - DataFrames.jl is designed to be support package. Most likely when you use Julia the core of your work is ML/optimization/simulation/you name it. DataFrames.jl wants to be as non-intrusive as possible to your core workflow, while allowing you to simply do data pre/post processing.

This is a different philosophy to e.g. R/Python where for many users data frame is a core of their work (i.e. everything revolves around processing data frames). And this approach (to be a lightweight wrapper around vectors with names) influenced DataFrames.jl design a lot.


This was a rash comment from @bkamins. Even if a column has NaN DataFrames will complain

ArgumentError: currently for numeric values NaN and -0.0 in their real or imaginary components are not allowed. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.

1 Like

@mostafa1342004 - how would you expect -0.0 and NaN to be handled in joins? This behavior is intentional to safeguard users against unexpected and incorrect results.


I’ve split this thread as it’s unrelated to the question that was asked in DataFrame: Cannot have duplicated names for indices. I understand that InMemoryDatasets uses a different philosophy than DataFrames here — and that’s great — but let’s please not make usage questions about either of these packages a battleground on their respective design philosophies.