Converting columns in Dataframe from Int to Float type

Hello,

Below is the dataframe which i am working on and i would like to know if there is any alternative way to convert [:Age, :Salary] columns which is Int64 to Float64

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1   │ "France"  │ 44  │ 72000  │ "No"      │
│ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
│ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
│ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
│ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
│ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
│ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │
│ 8   │ "France"  │ 48  │ 79000  │ "Yes"     │
│ 9   │ "Germany" │ 50  │ 83000  │ "No"      │
│ 10  │ "France"  │ 37  │ 67000  │ "Yes"     │

>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age  │ Salary  │ Purchased │
├─────┼───────────┼──────┼─────────┼───────────┤
│ 1   │ "France"  │ 44.0 │ 72000.0 │ "No"      │
│ 2   │ "Spain"   │ 27.0 │ 48000.0 │ "Yes"     │
│ 3   │ "Germany" │ 30.0 │ 54000.0 │ "No"      │
│ 4   │ "Spain"   │ 38.0 │ 61000.0 │ "No"      │
│ 5   │ "Germany" │ 40.0 │ NA      │ "Yes"     │
│ 6   │ "France"  │ 35.0 │ 58000.0 │ "Yes"     │
│ 7   │ "Spain"   │ NA   │ 52000.0 │ "No"      │
│ 8   │ "France"  │ 48.0 │ 79000.0 │ "Yes"     │
│ 9   │ "Germany" │ 50.0 │ 83000.0 │ "No"      │
│ 10  │ "France"  │ 37.0 │ 67000.0 │ "Yes"     │

Currently i am using the above method to convert from Int to Float while reading. And as you can see if i work on the other datasets with lots of columns, it will get tedious for sure. Kindly let me know if there is any efficient way to achieve this.

Thank You.

If the data file being read in by readtable has decimal points in the float column, then the column in the resulting DataFrame will automatically be constructed as a DataVector{Float64}.

e.g.
test.csv:

X,Y
1,1.0
2,2.0
julia> x = readtable("/tmp/test.csv")
2×2 DataFrames.DataFrame
│ Row │ X │ Y   │
│ 1   │ 1 │ 1.0 │
│ 2   │ 2 │ 2.0 │

julia> eltypes(x)
2-element Array{Type,1}:
 Int64
 Float64

Dataframe which i have been working on has two Int64 Columns [:Age, :Salary] which i want to be read as float hence i use
>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);

So if i have large set of columns with Int64 datatype i would like that to be read in as Float64 in efficient way instead of specifying all the datatype in the order in eltypes while reading data using readtable

Why would you want them to be read as floats?

Here is my solution:

using DataFrames

c1 = ["France", "Spain", "Germany", "Spain", "Germany", "France", "Spain"];
c2 = @data [44, 27, 30, 38, 40, 35, NA];
c3 = @data [72000, 48000, 54000, 61000, NA, 58000, 52000];
c4 = ["No", "Yes", "No", "No", "Yes", "Yes", "No"];

data1 = DataFrame(Country = c1, Age = c2, Salary = c3, Purchased = c4)

# Convert to Float64
data1[:Salary] = data1[:Salary] * 1.
data1[:Age] = data1[:Age] * 1.

# alternatively
data1[:Age] = map(x -> isna(x) ? NA : convert(Float64, x), data1[:Age])
data1[:Salary] = map(x -> isna(x) ? NA : convert(Float64, x), data1[:Salary])

# View Result
data1
# 7×4 DataFrames.DataFrame
# │ Row │ Country   │ Age  │ Salary  │ Purchased │
# ├─────┼───────────┼──────┼─────────┼───────────┤
# │ 1   │ "France"  │ 44.0 │ 72000.0 │ "No"      │
# │ 2   │ "Spain"   │ 27.0 │ 48000.0 │ "Yes"     │
# │ 3   │ "Germany" │ 30.0 │ 54000.0 │ "No"      │
# │ 4   │ "Spain"   │ 38.0 │ 61000.0 │ "No"      │
# │ 5   │ "Germany" │ 40.0 │ NA      │ "Yes"     │
# │ 6   │ "France"  │ 35.0 │ 58000.0 │ "Yes"     │
# │ 7   │ "Spain"   │ NA   │ 52000.0 │ "No"      │

# Convert back to Int64
data1[:Age] = map(x -> isna(x) ? NA : convert(Int64, x), data1[:Age])
data1[:Salary] = map(x -> isna(x) ? NA : convert(Int64, x), data1[:Salary])

#View Result
data1
# 7×4 DataFrames.DataFrame
# │ Row │ Country   │ Age │ Salary │ Purchased │
# ├─────┼───────────┼─────┼────────┼───────────┤
# │ 1   │ "France"  │ 44  │ 72000  │ "No"      │
# │ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
# │ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
# │ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
# │ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
# │ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
# │ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │

@alasaadstat Thank You for the alternative solution.

I wondering if we could read it as float from readtable function. Instead of

>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);

OR even the below solution

> data1 = readtable("Data1.csv")

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1   │ "France"  │ 44  │ 72000  │ "No"      │
│ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
│ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
│ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
│ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
│ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
│ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │
│ 8   │ "France"  │ 48  │ 79000  │ "Yes"     │
│ 9   │ "Germany" │ 50  │ 83000  │ "No"      │
│ 10  │ "France"  │ 37  │ 67000  │ "Yes"     │
>for c = eachcol(data1)
  if eltype(c[2]) <: Integer
    data1[c[1]] = data1[c[1]] .* 1.0
  end
end
10×4 DataFrames.DataFrame
│ Row │ Country   │ Age  │ Salary  │ Purchased │
├─────┼───────────┼──────┼─────────┼───────────┤
│ 1   │ "France"  │ 44.0 │ 72000.0 │ "No"      │
│ 2   │ "Spain"   │ 27.0 │ 48000.0 │ "Yes"     │
│ 3   │ "Germany" │ 30.0 │ 54000.0 │ "No"      │
│ 4   │ "Spain"   │ 38.0 │ 61000.0 │ "No"      │
│ 5   │ "Germany" │ 40.0 │ NA      │ "Yes"     │
│ 6   │ "France"  │ 35.0 │ 58000.0 │ "Yes"     │
│ 7   │ "Spain"   │ NA   │ 52000.0 │ "No"      │
│ 8   │ "France"  │ 48.0 │ 79000.0 │ "Yes"     │
│ 9   │ "Germany" │ 50.0 │ 83000.0 │ "No"      │
│ 10  │ "France"  │ 37.0 │ 67000.0 │ "Yes"     │

@dpsanders
The reason on why i want to convert is to remove the NA type and replace them with mean of the column.

function nameanfunc!{DataFrame}(dtfrm::DataFrame)
  for i = 1:size(dtfrm,2)
      if eltype(dtfrm[:,i]) <: AbstractFloat
        fillnamean(dtfrm, i)
      end
  end
end

function fillnamean{DataFrame}(dtfr::DataFrame, ind::Int64)
  @inbounds dtfr[dtfr[:,ind].na, ind] = mean(dropna(dtfr[:,ind]))
end

If i use the above code it will not replace the NA type with mean as mean is of Float64 and and column is of DataArray{Int64,1} type.
One alternative method would be instead of

dtfr[dtfr[:,ind].na, ind] = mean(dropna(dtfr[:,ind]))

in fillnamean function i wil have to use this

eltype((dtfr[:,ind])) <:AbstractFloat ? mean(dropna(dtfr[:,ind])): trunc(Int64, mean(dropna(dtfr[:,ind])))

How about this? Read everything as-is and just convert the types that need to be:

using DataArrays
using DataFrames

data1 = readtable("Data1.csv")
data1[:Salary] = convert(DataVector{Float64}, data1[:Salary])

@joshbode. Thank you. Its Alternative method to multiplying by 1.0.