Is it expected behavior that the DataFrames constructor will return different results if the data is passed in by column or as a matrix? In the former case the type of the column is more precisely inferred. In the latter case no type inference appears to happen.
I found an example from 2017 of creating a DataFrame from a matrix. The types were properly inferred in that example. When I execute the example on Julia 1.6.1 the column types are all Any.
Hereβs a MWE. In the first case the type of the first column is correctly inferred to be string, the second Float64, and the last Float64?.
In the second case, where the DataFrame is constructed from a matrix, the type of all columns is Any.
julia> a = [
:Name :Radius :SemiDiameter
"this" 1.0 4.0
"that" 2.0 missing
]
3Γ3 Matrix{Any}:
:Name :Radius :SemiDiameter
"this" 1.0 4.0
"that" 2.0 missing
julia> df1 = DataFrame(
Name = ["this", "that"],
Radius = [1.0, 2.0],
SemiDiameter = [4.0, missing]
)
2Γ3 DataFrame
Row β Name Radius SemiDiameter
β String Float64 Float64?
ββββββΌβββββββββββββββββββββββββββββββ
1 β this 1.0 4.0
2 β that 2.0 missing
julia> df2 = DataFrame(a[2:end,:],Symbol.(a[1,:]))
2Γ3 DataFrame
Row β Name Radius SemiDiameter
β Any Any Any
ββββββΌββββββββββββββββββββββββββββ
1 β this 1.0 4.0
2 β that 2.0 missing
julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD EPYC 7702P 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
Environment:
JULIA_NUM_THREADS = 64
JULIA_EDITOR = code
In df1
you are already providing concrete types. If you did the following (as you do with a Matrix
) you would also get Any
as eltype
:
julia> DataFrame(Name = Any["this", "that"],
Radius = Any[1.0, 2.0],
SemiDiameter = Any[4.0, missing])
2Γ3 DataFrame
Row β Name Radius SemiDiameter
β Any Any Any
ββββββΌββββββββββββββββββββββββββββ
1 β this 1.0 4.0
2 β that 2.0 missing
The reason is that now DataFrame
constructor takes eltype
of the source to construct columns. I find this behavior natural, as otherwise (if types were narrowed down) you could have problems (I will show them later). First let me show how to narrow down the eltype
. It is quite easy using the identity
function and broadcasting it:
julia> df = DataFrame(Name = Any["this", "that"],
Radius = Any[1.0, 2.0],
SemiDiameter = Any[4.0, missing])
2Γ3 DataFrame
Row β Name Radius SemiDiameter
β Any Any Any
ββββββΌββββββββββββββββββββββββββββ
1 β this 1.0 4.0
2 β that 2.0 missing
julia> identity.(df)
2Γ3 DataFrame
Row β Name Radius SemiDiameter
β String Float64 Float64?
ββββββΌβββββββββββββββββββββββββββββββ
1 β this 1.0 4.0
2 β that 2.0 missing
So now the example of the problem:
julia> df = DataFrame([1 2; missing 3], :auto)
2Γ2 DataFrame
Row β x1 x2
β Int64? Int64?
ββββββΌβββββββββββββββββ
1 β 1 2
2 β missing 3
julia> df2 = identity.(df) # narrow down eltype
2Γ2 DataFrame
Row β x1 x2
β Int64? Int64
ββββββΌββββββββββββββββ
1 β 1 2
2 β missing 3
julia> df2[2, 2] = missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
and you get an error because eltype
is too narrow although you might have assumed that missing
will be allowed as your source container allowed it for all columns; doing the same operation on df
works:
julia> df[2, 2] = missing
missing
julia> df
2Γ2 DataFrame
Row β x1 x2
β Int64? Int64?
ββββββΌββββββββββββββββββ
1 β 1 2
2 β missing missing
5 Likes
Thank you, that solves the problem.
I have never seen the identity function used in this way and itβs mysterious to me how it works. Could you explain the mechanism?
Broadcasted identity
, comprehensions, and map
dynamically determine the eltype
of the resulting container by adjusting it to the result of the transformation:
julia> x = Any[1, 2, 3]
3-element Vector{Any}:
1
2
3
julia> identity.(x)
3-element Vector{Int64}:
1
2
3
julia> [v for v in x]
3-element Vector{Int64}:
1
2
3
julia> map(identity, x)
3-element Vector{Int64}:
1
2
3
A different behavior happens with indexing, which reuses eltype
of source:
julia> x[1:2]
2-element Vector{Any}:
1
2
2 Likes