Converting a Matrix to a DataFrame

I am struggling to convert a Matrix where I have the first row as header and possible NA values in a DataFrame.

Actually I did find at least two ways, and the resulting df seems ok, but in reality the individual columns are either Vector or SubArray, instead of DataArray, and this then cause me problems when I try to operate on them (e.g. to remove the NA with dropna() )

So, this is my code:

using DataFrames, DataFramesMeta

# Original data in Matrix format..
m = ["A"       "B"        "C"
     1         10         100;
     nothing   nothing    nothing;
     3         30         300;
     4         40         400]
# Converting to Dataframe..
h = [Symbol(c) for c in m[1,:]]
vals = m[2:end, :]
vals2 = [vals[:,c] for c in 1:size(vals)[2]]
df1 = DataFrame(vals2, h)
# alternative that produce subarrays:  df =  DataFrame(Any[@view m[2:end, i] for i::Int64 in 1:size(m, 2)], Symbol.(m[1, :]))
# Cleaning to get NA values..
for row in eachrow(df1)
  for name in names(df1)
    if row[name] == nothing
        row[name] = NA
    end
  end
end

Now, the problem is the type of dataframe columns, as they are Vectors:

typeof(df1[:A])
Array{Any,1}

And some functions don’t work with them:

dropna(df1[:A])
4-element Array{Any,1}:
 1  
  NA
 3  
 4

For comparison, when I create a df by scratch:

df2 = DataFrame(
    A = [1, 2, 3, 4],
    B = [10, 20, 30, 40]
)
df2[2,:A] = NA 
df2[2,:B] = NA

The type is a DataArray and dropna() works without problems:

typeof(df2[:A])
DataArrays.DataArray{Int64,1}
dropna(df2[:A])
3-element Array{Int64,1}:
 1
 3
 4

I solved with:

matnms2df(mat, headerstrs) = convert(DataFrame,
     Dict(zip(headerstrs,[mat[:,i] for i in 1:size(mat,2)])))
df4 = matnms2df(vals,h)
for row in eachrow(df4)
  for name in names(df4)
    if row[name] == nothing
        row[name] = NA
    end
  end
end
typeof(df4[:A])
DataArrays.DataArray{Any,1}

(from Convert Julia array to dataframe - Stack Overflow)

Still I don’t understand why if you construct the DataFrame in a different way than using a Dict the type of the columns change.

1 Like

Thanks for sharing your solution.

Can I ask how you ended up in a situation where you have a string/float mixed type array in the first place?

Typically you don’t want to do that, having strings in an otherwise numerical array like that breaks a lot of functionality. Here’s my work around for this…

using DataFrames
m = ["A"       "B"        "C"
     1         10         100;
     nothing   nothing    nothing;
     3         30         300;
     4         40         400]
m[ m .== nothing ] .= missing
Names = m[1,:]
Numbers = m[2:end,:]

newdf = convert(DataFrame, Numbers);
rename!(newdf, names(newdf) .=> Symbol.(Names[:]))

dropmissing(newdf)

Note that your code will create a very inefficient data frame since its columns will have element type Any. Here’s an alternative:

DataFrame([something.(col, missing) for col in eachcol(m[2:end, :])], Symbol.(m[1, :]))

Or slightly more efficient (by avoiding an intermediate copy of the data):

DataFrame([something.(@view(m[2:end, i]), missing) for i in 1:size(m, 2)], Symbol.(m[1, :]))
1 Like