Dataframe parses differently if data is passed in columns vs as an array

Is it expected behavior that the DataFrames constructor will return different results if the data is passed in by column or as a matrix? In the former case the type of the column is more precisely inferred. In the latter case no type inference appears to happen.

I found an example from 2017 of creating a DataFrame from a matrix. The types were properly inferred in that example. When I execute the example on Julia 1.6.1 the column types are all Any.

Here’s a MWE. In the first case the type of the first column is correctly inferred to be string, the second Float64, and the last Float64?.

In the second case, where the DataFrame is constructed from a matrix, the type of all columns is Any.

julia> a = [
           :Name :Radius :SemiDiameter
           "this" 1.0      4.0
           "that" 2.0      missing
       ]
3Γ—3 Matrix{Any}:
 :Name    :Radius   :SemiDiameter
 "this"  1.0       4.0
 "that"  2.0        missing

julia> df1 = DataFrame(
              Name = ["this", "that"],
              Radius = [1.0, 2.0],
              SemiDiameter = [4.0, missing]
              )
2Γ—3 DataFrame
 Row β”‚ Name    Radius   SemiDiameter 
     β”‚ String  Float64  Float64?     
─────┼───────────────────────────────
   1 β”‚ this        1.0           4.0
   2 β”‚ that        2.0     missing   

julia> df2 = DataFrame(a[2:end,:],Symbol.(a[1,:]))
2Γ—3 DataFrame
 Row β”‚ Name  Radius  SemiDiameter 
     β”‚ Any   Any     Any          
─────┼────────────────────────────
   1 β”‚ this  1.0     4.0
   2 β”‚ that  2.0     missing      


julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7702P 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
Environment:
  JULIA_NUM_THREADS = 64
  JULIA_EDITOR = code

In df1 you are already providing concrete types. If you did the following (as you do with a Matrix) you would also get Any as eltype:

julia> DataFrame(Name = Any["this", "that"],
                 Radius = Any[1.0, 2.0],
                 SemiDiameter = Any[4.0, missing])
2Γ—3 DataFrame
 Row β”‚ Name  Radius  SemiDiameter
     β”‚ Any   Any     Any
─────┼────────────────────────────
   1 β”‚ this  1.0     4.0
   2 β”‚ that  2.0     missing

The reason is that now DataFrame constructor takes eltype of the source to construct columns. I find this behavior natural, as otherwise (if types were narrowed down) you could have problems (I will show them later). First let me show how to narrow down the eltype. It is quite easy using the identity function and broadcasting it:

julia> df = DataFrame(Name = Any["this", "that"],
                 Radius = Any[1.0, 2.0],
                 SemiDiameter = Any[4.0, missing])
2Γ—3 DataFrame
 Row β”‚ Name  Radius  SemiDiameter
     β”‚ Any   Any     Any
─────┼────────────────────────────
   1 β”‚ this  1.0     4.0
   2 β”‚ that  2.0     missing

julia> identity.(df)
2Γ—3 DataFrame
 Row β”‚ Name    Radius   SemiDiameter
     β”‚ String  Float64  Float64?
─────┼───────────────────────────────
   1 β”‚ this        1.0           4.0
   2 β”‚ that        2.0     missing

So now the example of the problem:

julia> df = DataFrame([1 2; missing 3], :auto)
2Γ—2 DataFrame
 Row β”‚ x1       x2
     β”‚ Int64?   Int64?
─────┼─────────────────
   1 β”‚       1       2
   2 β”‚ missing       3

julia> df2 = identity.(df) # narrow down eltype
2Γ—2 DataFrame
 Row β”‚ x1       x2
     β”‚ Int64?   Int64
─────┼────────────────
   1 β”‚       1      2
   2 β”‚ missing      3

julia> df2[2, 2] = missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64

and you get an error because eltype is too narrow although you might have assumed that missing will be allowed as your source container allowed it for all columns; doing the same operation on df works:

julia> df[2, 2] = missing
missing

julia> df
2Γ—2 DataFrame
 Row β”‚ x1       x2
     β”‚ Int64?   Int64?
─────┼──────────────────
   1 β”‚       1        2
   2 β”‚ missing  missing
5 Likes

Thank you, that solves the problem.

I have never seen the identity function used in this way and it’s mysterious to me how it works. Could you explain the mechanism?

Broadcasted identity, comprehensions, and map dynamically determine the eltype of the resulting container by adjusting it to the result of the transformation:

julia> x = Any[1, 2, 3]
3-element Vector{Any}:
 1
 2
 3

julia> identity.(x)
3-element Vector{Int64}:
 1
 2
 3

julia> [v for v in x]
3-element Vector{Int64}:
 1
 2
 3

julia> map(identity, x)
3-element Vector{Int64}:
 1
 2
 3

A different behavior happens with indexing, which reuses eltype of source:

julia> x[1:2]
2-element Vector{Any}:
 1
 2
2 Likes