Converting columns in Dataframe from Int to Float type

Saran_S · June 11, 2017, 11:46am

Hello,

Below is the dataframe which i am working on and i would like to know if there is any alternative way to convert [:Age, :Salary] columns which is Int64 to Float64

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1   │ "France"  │ 44  │ 72000  │ "No"      │
│ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
│ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
│ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
│ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
│ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
│ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │
│ 8   │ "France"  │ 48  │ 79000  │ "Yes"     │
│ 9   │ "Germany" │ 50  │ 83000  │ "No"      │
│ 10  │ "France"  │ 37  │ 67000  │ "Yes"     │

>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age  │ Salary  │ Purchased │
├─────┼───────────┼──────┼─────────┼───────────┤
│ 1   │ "France"  │ 44.0 │ 72000.0 │ "No"      │
│ 2   │ "Spain"   │ 27.0 │ 48000.0 │ "Yes"     │
│ 3   │ "Germany" │ 30.0 │ 54000.0 │ "No"      │
│ 4   │ "Spain"   │ 38.0 │ 61000.0 │ "No"      │
│ 5   │ "Germany" │ 40.0 │ NA      │ "Yes"     │
│ 6   │ "France"  │ 35.0 │ 58000.0 │ "Yes"     │
│ 7   │ "Spain"   │ NA   │ 52000.0 │ "No"      │
│ 8   │ "France"  │ 48.0 │ 79000.0 │ "Yes"     │
│ 9   │ "Germany" │ 50.0 │ 83000.0 │ "No"      │
│ 10  │ "France"  │ 37.0 │ 67000.0 │ "Yes"     │

Currently i am using the above method to convert from Int to Float while reading. And as you can see if i work on the other datasets with lots of columns, it will get tedious for sure. Kindly let me know if there is any efficient way to achieve this.

Thank You.

joshbode · June 11, 2017, 3:05pm

If the data file being read in by readtable has decimal points in the float column, then the column in the resulting DataFrame will automatically be constructed as a DataVector{Float64}.

e.g.
test.csv:

X,Y
1,1.0
2,2.0

julia> x = readtable("/tmp/test.csv")
2×2 DataFrames.DataFrame
│ Row │ X │ Y   │
│ 1   │ 1 │ 1.0 │
│ 2   │ 2 │ 2.0 │

julia> eltypes(x)
2-element Array{Type,1}:
 Int64
 Float64

Saran_S · June 11, 2017, 3:40pm

Dataframe which i have been working on has two Int64 Columns [:Age, :Salary] which i want to be read as float hence i use
>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);

So if i have large set of columns with Int64 datatype i would like that to be read in as Float64 in efficient way instead of specifying all the datatype in the order in eltypes while reading data using readtable

dpsanders · June 11, 2017, 4:15pm

Why would you want them to be read as floats?

alasaadstat · June 11, 2017, 4:38pm

Here is my solution:

using DataFrames

c1 = ["France", "Spain", "Germany", "Spain", "Germany", "France", "Spain"];
c2 = @data [44, 27, 30, 38, 40, 35, NA];
c3 = @data [72000, 48000, 54000, 61000, NA, 58000, 52000];
c4 = ["No", "Yes", "No", "No", "Yes", "Yes", "No"];

data1 = DataFrame(Country = c1, Age = c2, Salary = c3, Purchased = c4)

# Convert to Float64
data1[:Salary] = data1[:Salary] * 1.
data1[:Age] = data1[:Age] * 1.

# alternatively
data1[:Age] = map(x -> isna(x) ? NA : convert(Float64, x), data1[:Age])
data1[:Salary] = map(x -> isna(x) ? NA : convert(Float64, x), data1[:Salary])

# View Result
data1
# 7×4 DataFrames.DataFrame
# │ Row │ Country   │ Age  │ Salary  │ Purchased │
# ├─────┼───────────┼──────┼─────────┼───────────┤
# │ 1   │ "France"  │ 44.0 │ 72000.0 │ "No"      │
# │ 2   │ "Spain"   │ 27.0 │ 48000.0 │ "Yes"     │
# │ 3   │ "Germany" │ 30.0 │ 54000.0 │ "No"      │
# │ 4   │ "Spain"   │ 38.0 │ 61000.0 │ "No"      │
# │ 5   │ "Germany" │ 40.0 │ NA      │ "Yes"     │
# │ 6   │ "France"  │ 35.0 │ 58000.0 │ "Yes"     │
# │ 7   │ "Spain"   │ NA   │ 52000.0 │ "No"      │

# Convert back to Int64
data1[:Age] = map(x -> isna(x) ? NA : convert(Int64, x), data1[:Age])
data1[:Salary] = map(x -> isna(x) ? NA : convert(Int64, x), data1[:Salary])

#View Result
data1
# 7×4 DataFrames.DataFrame
# │ Row │ Country   │ Age │ Salary │ Purchased │
# ├─────┼───────────┼─────┼────────┼───────────┤
# │ 1   │ "France"  │ 44  │ 72000  │ "No"      │
# │ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
# │ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
# │ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
# │ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
# │ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
# │ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │

Saran_S · June 11, 2017, 4:50pm

@alasaadstat Thank You for the alternative solution.

I wondering if we could read it as float from readtable function. Instead of

>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);

OR even the below solution

> data1 = readtable("Data1.csv")

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1   │ "France"  │ 44  │ 72000  │ "No"      │
│ 2   │ "Spain"   │ 27  │ 48000  │ "Yes"     │
│ 3   │ "Germany" │ 30  │ 54000  │ "No"      │
│ 4   │ "Spain"   │ 38  │ 61000  │ "No"      │
│ 5   │ "Germany" │ 40  │ NA     │ "Yes"     │
│ 6   │ "France"  │ 35  │ 58000  │ "Yes"     │
│ 7   │ "Spain"   │ NA  │ 52000  │ "No"      │
│ 8   │ "France"  │ 48  │ 79000  │ "Yes"     │
│ 9   │ "Germany" │ 50  │ 83000  │ "No"      │
│ 10  │ "France"  │ 37  │ 67000  │ "Yes"     │

>for c = eachcol(data1)
  if eltype(c[2]) <: Integer
    data1[c[1]] = data1[c[1]] .* 1.0
  end
end

10×4 DataFrames.DataFrame
│ Row │ Country   │ Age  │ Salary  │ Purchased │
├─────┼───────────┼──────┼─────────┼───────────┤
│ 1   │ "France"  │ 44.0 │ 72000.0 │ "No"      │
│ 2   │ "Spain"   │ 27.0 │ 48000.0 │ "Yes"     │
│ 3   │ "Germany" │ 30.0 │ 54000.0 │ "No"      │
│ 4   │ "Spain"   │ 38.0 │ 61000.0 │ "No"      │
│ 5   │ "Germany" │ 40.0 │ NA      │ "Yes"     │
│ 6   │ "France"  │ 35.0 │ 58000.0 │ "Yes"     │
│ 7   │ "Spain"   │ NA   │ 52000.0 │ "No"      │
│ 8   │ "France"  │ 48.0 │ 79000.0 │ "Yes"     │
│ 9   │ "Germany" │ 50.0 │ 83000.0 │ "No"      │
│ 10  │ "France"  │ 37.0 │ 67000.0 │ "Yes"     │

Saran_S · June 11, 2017, 5:01pm

@dpsanders
The reason on why i want to convert is to remove the NA type and replace them with mean of the column.

function nameanfunc!{DataFrame}(dtfrm::DataFrame)
  for i = 1:size(dtfrm,2)
      if eltype(dtfrm[:,i]) <: AbstractFloat
        fillnamean(dtfrm, i)
      end
  end
end

function fillnamean{DataFrame}(dtfr::DataFrame, ind::Int64)
  @inbounds dtfr[dtfr[:,ind].na, ind] = mean(dropna(dtfr[:,ind]))
end

If i use the above code it will not replace the NA type with mean as mean is of Float64 and and column is of DataArray{Int64,1} type.
One alternative method would be instead of

dtfr[dtfr[:,ind].na, ind] = mean(dropna(dtfr[:,ind]))

in fillnamean function i wil have to use this

eltype((dtfr[:,ind])) <:AbstractFloat ? mean(dropna(dtfr[:,ind])): trunc(Int64, mean(dropna(dtfr[:,ind])))

joshbode · June 11, 2017, 8:46pm

How about this? Read everything as-is and just convert the types that need to be:

using DataArrays
using DataFrames

data1 = readtable("Data1.csv")
data1[:Salary] = convert(DataVector{Float64}, data1[:Salary])

Saran_S · June 12, 2017, 1:57am

@joshbode. Thank you. Its Alternative method to multiplying by 1.0.

Topic		Replies	Views
How to convert the String15 datatype in DataFrames.jl to Float64 General Usage type , dataframes , convert	9	1044	March 12, 2023
How to parse/convert integers in DataFrame to float numbers New to Julia dataframes	30	1954	March 19, 2021
Foolproof method for converting to Float64 New to Julia dataframes , convert	10	2005	May 28, 2021
Type conversion driving me crazy Data dataframes	10	1157	September 14, 2022
DataFrames: convert column data type Data type , dataframes , convert	65	28242	May 11, 2023

Converting columns in Dataframe from Int to Float type

Related topics