DataFrames type assignment inconsistency


#1

I ran into this strange behavior and I’m not sure if it’s a problem with DataFrames or my interpretation with how it works.

When i generate my dataframe with an integer array and a float array, i get two float columns. But if i generate my dataframe with an integer array, float array and string array, i get an integer, float and string column.

Dataframes changes the type of data it assigned to the first column!

I’m still investigating, but I’m wondering if someone understands this behavior.

Thanks,

julia> x=[1,2,3,4,5]
5-element Array{Int64,1}:
 1
 2
 3
 4
 5

julia> df = DataFrame([x,[1.0,2.0,3.0,4.0,5.0]], [:A, :B])
5×2 DataFrame
¦ Row ¦ A       ¦ B       ¦
¦     ¦ Float64 ¦ Float64 ¦
+-----+---------+---------¦
¦ 1   ¦ 1.0     ¦ 1.0     ¦
¦ 2   ¦ 2.0     ¦ 2.0     ¦
¦ 3   ¦ 3.0     ¦ 3.0     ¦
¦ 4   ¦ 4.0     ¦ 4.0     ¦
¦ 5   ¦ 5.0     ¦ 5.0     ¦

df = DataFrame([x,[1.0,2.0,3.0,4.0,5.0],["1","2","3","4","5"]], [:A, :B, :C])
5×3 DataFrame
¦ Row ¦ A     ¦ B       ¦ C      ¦
¦     ¦ Int64 ¦ Float64 ¦ String ¦
+-----+-------+---------+--------¦
¦ 1   ¦ 1     ¦ 1.0     ¦ 1      ¦
¦ 2   ¦ 2     ¦ 2.0     ¦ 2      ¦
¦ 3   ¦ 3     ¦ 3.0     ¦ 3      ¦
¦ 4   ¦ 4     ¦ 4.0     ¦ 4      ¦
¦ 5   ¦ 5     ¦ 5.0     ¦ 5      ¦

#2

This is because your intermediate array [x,[1.0,2.0,3.0,4.0,5.0]] uses a Type promotion in its construction. Rather than Vector{Union{Vector{Int64}, Vector{Float64}} it just becomes Vector{Vector{Float64}}

However it can’t do that with a vector of strings, so it just reverts to Vector{Vector{Any}}.

Any[x,[1.0,2.0,3.0,4.0,5.0]] works, as does Vector{<:Number}[x,[1.0,2.0,3.0,4.0,5.0]]


#3

Thanks very much!

Is DataFrames attempting to avoid using Any for some (possibly related to efficiency) reason ?


#4

No, this is outside of DataFrames. Julia evaluates the argument [x,[1.0,2.0,3.0,4.0,5.0]] first, then DataFrames deals with it.

DataFrames has lots of constructors you can use, though. So if that one doesn’t suit your needs there are other ones that might be better suited.


#5

Thx again, that’s very helpful. I’ve just started using DataFrames and one of the things that I found a bit tricky is setting up a dataframe dynamically based on data I’m reading from a file, i.e. without knowing number of columns or column types ahead of time.


#6

I would suggest making an empty DataFrame, df = DataFrame() and then adding columns iteratively with

df[var] = x