Hello,
Below is the dataframe which i am working on and i would like to know if there is any alternative way to convert [:Age, :Salary]
columns which is Int64 to Float64
10×4 DataFrames.DataFrame
│ Row │ Country │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1 │ "France" │ 44 │ 72000 │ "No" │
│ 2 │ "Spain" │ 27 │ 48000 │ "Yes" │
│ 3 │ "Germany" │ 30 │ 54000 │ "No" │
│ 4 │ "Spain" │ 38 │ 61000 │ "No" │
│ 5 │ "Germany" │ 40 │ NA │ "Yes" │
│ 6 │ "France" │ 35 │ 58000 │ "Yes" │
│ 7 │ "Spain" │ NA │ 52000 │ "No" │
│ 8 │ "France" │ 48 │ 79000 │ "Yes" │
│ 9 │ "Germany" │ 50 │ 83000 │ "No" │
│ 10 │ "France" │ 37 │ 67000 │ "Yes" │
>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);
10×4 DataFrames.DataFrame
│ Row │ Country │ Age │ Salary │ Purchased │
├─────┼───────────┼──────┼─────────┼───────────┤
│ 1 │ "France" │ 44.0 │ 72000.0 │ "No" │
│ 2 │ "Spain" │ 27.0 │ 48000.0 │ "Yes" │
│ 3 │ "Germany" │ 30.0 │ 54000.0 │ "No" │
│ 4 │ "Spain" │ 38.0 │ 61000.0 │ "No" │
│ 5 │ "Germany" │ 40.0 │ NA │ "Yes" │
│ 6 │ "France" │ 35.0 │ 58000.0 │ "Yes" │
│ 7 │ "Spain" │ NA │ 52000.0 │ "No" │
│ 8 │ "France" │ 48.0 │ 79000.0 │ "Yes" │
│ 9 │ "Germany" │ 50.0 │ 83000.0 │ "No" │
│ 10 │ "France" │ 37.0 │ 67000.0 │ "Yes" │
Currently i am using the above method to convert from Int to Float while reading. And as you can see if i work on the other datasets with lots of columns, it will get tedious for sure. Kindly let me know if there is any efficient way to achieve this.
Thank You.
If the data file being read in by readtable
has decimal points in the float column, then the column in the resulting DataFrame
will automatically be constructed as a DataVector{Float64}
.
e.g.
test.csv
:
X,Y
1,1.0
2,2.0
julia> x = readtable("/tmp/test.csv")
2×2 DataFrames.DataFrame
│ Row │ X │ Y │
│ 1 │ 1 │ 1.0 │
│ 2 │ 2 │ 2.0 │
julia> eltypes(x)
2-element Array{Type,1}:
Int64
Float64
Dataframe which i have been working on has two Int64 Columns [:Age, :Salary]
which i want to be read as float hence i use
>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);
So if i have large set of columns with Int64 datatype i would like that to be read in as Float64 in efficient way instead of specifying all the datatype in the order in eltypes
while reading data using readtable
Why would you want them to be read as floats?
Here is my solution:
using DataFrames
c1 = ["France", "Spain", "Germany", "Spain", "Germany", "France", "Spain"];
c2 = @data [44, 27, 30, 38, 40, 35, NA];
c3 = @data [72000, 48000, 54000, 61000, NA, 58000, 52000];
c4 = ["No", "Yes", "No", "No", "Yes", "Yes", "No"];
data1 = DataFrame(Country = c1, Age = c2, Salary = c3, Purchased = c4)
# Convert to Float64
data1[:Salary] = data1[:Salary] * 1.
data1[:Age] = data1[:Age] * 1.
# alternatively
data1[:Age] = map(x -> isna(x) ? NA : convert(Float64, x), data1[:Age])
data1[:Salary] = map(x -> isna(x) ? NA : convert(Float64, x), data1[:Salary])
# View Result
data1
# 7×4 DataFrames.DataFrame
# │ Row │ Country │ Age │ Salary │ Purchased │
# ├─────┼───────────┼──────┼─────────┼───────────┤
# │ 1 │ "France" │ 44.0 │ 72000.0 │ "No" │
# │ 2 │ "Spain" │ 27.0 │ 48000.0 │ "Yes" │
# │ 3 │ "Germany" │ 30.0 │ 54000.0 │ "No" │
# │ 4 │ "Spain" │ 38.0 │ 61000.0 │ "No" │
# │ 5 │ "Germany" │ 40.0 │ NA │ "Yes" │
# │ 6 │ "France" │ 35.0 │ 58000.0 │ "Yes" │
# │ 7 │ "Spain" │ NA │ 52000.0 │ "No" │
# Convert back to Int64
data1[:Age] = map(x -> isna(x) ? NA : convert(Int64, x), data1[:Age])
data1[:Salary] = map(x -> isna(x) ? NA : convert(Int64, x), data1[:Salary])
#View Result
data1
# 7×4 DataFrames.DataFrame
# │ Row │ Country │ Age │ Salary │ Purchased │
# ├─────┼───────────┼─────┼────────┼───────────┤
# │ 1 │ "France" │ 44 │ 72000 │ "No" │
# │ 2 │ "Spain" │ 27 │ 48000 │ "Yes" │
# │ 3 │ "Germany" │ 30 │ 54000 │ "No" │
# │ 4 │ "Spain" │ 38 │ 61000 │ "No" │
# │ 5 │ "Germany" │ 40 │ NA │ "Yes" │
# │ 6 │ "France" │ 35 │ 58000 │ "Yes" │
# │ 7 │ "Spain" │ NA │ 52000 │ "No" │
@alasaadstat Thank You for the alternative solution.
I wondering if we could read it as float from readtable
function. Instead of
>data1 = readtable("Data1.csv", eltypes = [String,Float64,Float64,String]);
OR even the below solution
> data1 = readtable("Data1.csv")
10×4 DataFrames.DataFrame
│ Row │ Country │ Age │ Salary │ Purchased │
├─────┼───────────┼─────┼────────┼───────────┤
│ 1 │ "France" │ 44 │ 72000 │ "No" │
│ 2 │ "Spain" │ 27 │ 48000 │ "Yes" │
│ 3 │ "Germany" │ 30 │ 54000 │ "No" │
│ 4 │ "Spain" │ 38 │ 61000 │ "No" │
│ 5 │ "Germany" │ 40 │ NA │ "Yes" │
│ 6 │ "France" │ 35 │ 58000 │ "Yes" │
│ 7 │ "Spain" │ NA │ 52000 │ "No" │
│ 8 │ "France" │ 48 │ 79000 │ "Yes" │
│ 9 │ "Germany" │ 50 │ 83000 │ "No" │
│ 10 │ "France" │ 37 │ 67000 │ "Yes" │
>for c = eachcol(data1)
if eltype(c[2]) <: Integer
data1[c[1]] = data1[c[1]] .* 1.0
end
end
10×4 DataFrames.DataFrame
│ Row │ Country │ Age │ Salary │ Purchased │
├─────┼───────────┼──────┼─────────┼───────────┤
│ 1 │ "France" │ 44.0 │ 72000.0 │ "No" │
│ 2 │ "Spain" │ 27.0 │ 48000.0 │ "Yes" │
│ 3 │ "Germany" │ 30.0 │ 54000.0 │ "No" │
│ 4 │ "Spain" │ 38.0 │ 61000.0 │ "No" │
│ 5 │ "Germany" │ 40.0 │ NA │ "Yes" │
│ 6 │ "France" │ 35.0 │ 58000.0 │ "Yes" │
│ 7 │ "Spain" │ NA │ 52000.0 │ "No" │
│ 8 │ "France" │ 48.0 │ 79000.0 │ "Yes" │
│ 9 │ "Germany" │ 50.0 │ 83000.0 │ "No" │
│ 10 │ "France" │ 37.0 │ 67000.0 │ "Yes" │
@dpsanders
The reason on why i want to convert is to remove the NA
type and replace them with mean of the column.
function nameanfunc!{DataFrame}(dtfrm::DataFrame)
for i = 1:size(dtfrm,2)
if eltype(dtfrm[:,i]) <: AbstractFloat
fillnamean(dtfrm, i)
end
end
end
function fillnamean{DataFrame}(dtfr::DataFrame, ind::Int64)
@inbounds dtfr[dtfr[:,ind].na, ind] = mean(dropna(dtfr[:,ind]))
end
If i use the above code it will not replace the NA
type with mean as mean is of Float64 and and column is of DataArray{Int64,1} type.
One alternative method would be instead of
dtfr[dtfr[:,ind].na, ind] = mean(dropna(dtfr[:,ind]))
in fillnamean function i wil have to use this
eltype((dtfr[:,ind])) <:AbstractFloat ? mean(dropna(dtfr[:,ind])): trunc(Int64, mean(dropna(dtfr[:,ind])))
How about this? Read everything as-is and just convert the types that need to be:
using DataArrays
using DataFrames
data1 = readtable("Data1.csv")
data1[:Salary] = convert(DataVector{Float64}, data1[:Salary])
@joshbode. Thank you. Its Alternative method to multiplying by 1.0.