Converting columns in Dataframe from Int to Float type

question

#1

Hello,

Below is the dataframe which i am working on and i would like to know if there is any alternative way to convert [:Age, :Salary] columns which is Int64 to Float64

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1   │ "France"  │ 44  │ 72000  │ "No"      │
│ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
│ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
│ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
│ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
│ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
│ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │
│ 8   │ "France"  │ 48  │ 79000  │ "Yes"     │
│ 9   │ "Germany" │ 50  │ 83000  │ "No"      │
│ 10  │ "France"  │ 37  │ 67000  │ "Yes"     │

>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age  │ Salary  │ Purchased │
├─────┼───────────┼──────┼─────────┼───────────┤
│ 1   │ "France"  │ 44.0 │ 72000.0 │ "No"      │
│ 2   │ "Spain"   │ 27.0 │ 48000.0 │ "Yes"     │
│ 3   │ "Germany" │ 30.0 │ 54000.0 │ "No"      │
│ 4   │ "Spain"   │ 38.0 │ 61000.0 │ "No"      │
│ 5   │ "Germany" │ 40.0 │ NA      │ "Yes"     │
│ 6   │ "France"  │ 35.0 │ 58000.0 │ "Yes"     │
│ 7   │ "Spain"   │ NA   │ 52000.0 │ "No"      │
│ 8   │ "France"  │ 48.0 │ 79000.0 │ "Yes"     │
│ 9   │ "Germany" │ 50.0 │ 83000.0 │ "No"      │
│ 10  │ "France"  │ 37.0 │ 67000.0 │ "Yes"     │

Currently i am using the above method to convert from Int to Float while reading. And as you can see if i work on the other datasets with lots of columns, it will get tedious for sure. Kindly let me know if there is any efficient way to achieve this.

Thank You.


#2

If the data file being read in by readtable has decimal points in the float column, then the column in the resulting DataFrame will automatically be constructed as a DataVector{Float64}.

e.g.
test.csv:

X,Y
1,1.0
2,2.0
julia> x = readtable("/tmp/test.csv")
2×2 DataFrames.DataFrame
│ Row │ X │ Y   │
│ 1   │ 1 │ 1.0 │
│ 2   │ 2 │ 2.0 │

julia> eltypes(x)
2-element Array{Type,1}:
 Int64
 Float64

#3

Dataframe which i have been working on has two Int64 Columns [:Age, :Salary] which i want to be read as float hence i use
>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);

So if i have large set of columns with Int64 datatype i would like that to be read in as Float64 in efficient way instead of specifying all the datatype in the order in eltypes while reading data using readtable


#4

Why would you want them to be read as floats?


#6

Here is my solution:

using DataFrames

c1 = ["France", "Spain", "Germany", "Spain", "Germany", "France", "Spain"];
c2 = @data [44, 27, 30, 38, 40, 35, NA];
c3 = @data [72000, 48000, 54000, 61000, NA, 58000, 52000];
c4 = ["No", "Yes", "No", "No", "Yes", "Yes", "No"];

data1 = DataFrame(Country = c1, Age = c2, Salary = c3, Purchased = c4)

# Convert to Float64
data1[:Salary] = data1[:Salary] * 1.
data1[:Age] = data1[:Age] * 1.

# alternatively
data1[:Age] = map(x -> isna(x) ? NA : convert(Float64, x), data1[:Age])
data1[:Salary] = map(x -> isna(x) ? NA : convert(Float64, x), data1[:Salary])

# View Result
data1
# 7×4 DataFrames.DataFrame
# │ Row │ Country   │ Age  │ Salary  │ Purchased │
# ├─────┼───────────┼──────┼─────────┼───────────┤
# │ 1   │ "France"  │ 44.0 │ 72000.0 │ "No"      │
# │ 2   │ "Spain"   │ 27.0 │ 48000.0 │ "Yes"     │
# │ 3   │ "Germany" │ 30.0 │ 54000.0 │ "No"      │
# │ 4   │ "Spain"   │ 38.0 │ 61000.0 │ "No"      │
# │ 5   │ "Germany" │ 40.0 │ NA      │ "Yes"     │
# │ 6   │ "France"  │ 35.0 │ 58000.0 │ "Yes"     │
# │ 7   │ "Spain"   │ NA   │ 52000.0 │ "No"      │

# Convert back to Int64
data1[:Age] = map(x -> isna(x) ? NA : convert(Int64, x), data1[:Age])
data1[:Salary] = map(x -> isna(x) ? NA : convert(Int64, x), data1[:Salary])

#View Result
data1
# 7×4 DataFrames.DataFrame
# │ Row │ Country   │ Age │ Salary │ Purchased │
# ├─────┼───────────┼─────┼────────┼───────────┤
# │ 1   │ "France"  │ 44  │ 72000  │ "No"      │
# │ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
# │ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
# │ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
# │ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
# │ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
# │ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │

#7

@alasaadstat Thank You for the alternative solution.

I wondering if we could read it as float from readtable function. Instead of

>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);

OR even the below solution

> data1 = readtable("Data1.csv")

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1   │ "France"  │ 44  │ 72000  │ "No"      │
│ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
│ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
│ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
│ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
│ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
│ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │
│ 8   │ "France"  │ 48  │ 79000  │ "Yes"     │
│ 9   │ "Germany" │ 50  │ 83000  │ "No"      │
│ 10  │ "France"  │ 37  │ 67000  │ "Yes"     │
>for c = eachcol(data1)
  if eltype(c[2]) <: Integer
    data1[c[1]] = data1[c[1]] .* 1.0
  end
end
10×4 DataFrames.DataFrame
│ Row │ Country   │ Age  │ Salary  │ Purchased │
├─────┼───────────┼──────┼─────────┼───────────┤
│ 1   │ "France"  │ 44.0 │ 72000.0 │ "No"      │
│ 2   │ "Spain"   │ 27.0 │ 48000.0 │ "Yes"     │
│ 3   │ "Germany" │ 30.0 │ 54000.0 │ "No"      │
│ 4   │ "Spain"   │ 38.0 │ 61000.0 │ "No"      │
│ 5   │ "Germany" │ 40.0 │ NA      │ "Yes"     │
│ 6   │ "France"  │ 35.0 │ 58000.0 │ "Yes"     │
│ 7   │ "Spain"   │ NA   │ 52000.0 │ "No"      │
│ 8   │ "France"  │ 48.0 │ 79000.0 │ "Yes"     │
│ 9   │ "Germany" │ 50.0 │ 83000.0 │ "No"      │
│ 10  │ "France"  │ 37.0 │ 67000.0 │ "Yes"     │

#8

@dpsanders
The reason on why i want to convert is to remove the NA type and replace them with mean of the column.

function nameanfunc!{DataFrame}(dtfrm::DataFrame)
  for i = 1:size(dtfrm,2)
      if eltype(dtfrm[:,i]) <: AbstractFloat
        fillnamean(dtfrm, i)
      end
  end
end

function fillnamean{DataFrame}(dtfr::DataFrame, ind::Int64)
  @inbounds dtfr[dtfr[:,ind].na, ind] = mean(dropna(dtfr[:,ind]))
end

If i use the above code it will not replace the NA type with mean as mean is of Float64 and and column is of DataArray{Int64,1} type.
One alternative method would be instead of

dtfr[dtfr[:,ind].na, ind] = mean(dropna(dtfr[:,ind]))

in fillnamean function i wil have to use this

eltype((dtfr[:,ind])) <:AbstractFloat ? mean(dropna(dtfr[:,ind])): trunc(Int64, mean(dropna(dtfr[:,ind])))


#9

How about this? Read everything as-is and just convert the types that need to be:

using DataArrays
using DataFrames

data1 = readtable("Data1.csv")
data1[:Salary] = convert(DataVector{Float64}, data1[:Salary])

#10

@joshbode. Thank you. Its Alternative method to multiplying by 1.0.