Problem changing Variables from Any to Real

I tried running this

using XLSX
df = DataFrame(XLSX.readtable("C:/Users/brett/OneDrive/Documents/FemData.xlsx", "Sheet1")...)

for name in names(df)
       df[!, name] = identity.(df[:, name])
end

print(df)

It changed all the values from ‘Any’ to ‘Real’, except that one was changed to ‘Floating’, so I can’t use it for regression models. How do I change this last value to real?

43×4 DataFrame
│ Row │ Female_Employment │ D_Fem_Employment │ Crude_Birth_Rate │ D_Crude_Birth_Rate │
│     │ Any               │ Any              │ Any              │ Any                │
├─────┼───────────────────┼──────────────────┼──────────────────┼────────────────────┤
│ 1   │ 49.95             │ 15.7             │ 0.0195195        │ -0.0127389         │
│ 2   │ 50.925            │ 15.5             │ 0.0363279        │ -0.0129032         │
│ 3   │ 52.775            │ 15.3             │ 0.0341071        │ 0.0130719          │
│ 4   │ 54.575            │ 15.5             │ 0.0306917        │ 0                  │
│ 5   │ 56.25             │ 15.5             │ 0.0293333        │ -0.0129032         │
│ 6   │ 57.9              │ 15.3             │ -0.00863558      │ -0.0130719         │
│ 7   │ 57.4              │ 15.1             │ 0.010453         │ -0.00662252        │
│ 8   │ 58                │ 15               │ 0.0176724        │ 0                  │
│ 9   │ 59.025            │ 15               │ 0.0245659        │ -0.0133333         │
│ 10  │ 60.475            │ 14.8             │ 0.0301778        │ -0.00675676        │
│ 11  │ 62.3              │ 14.7             │ 0.0156501        │ -0.0204082         │
│ 12  │ 63.275            │ 14.4             │ 0.0209403        │ 0.00694444         │
│ 13  │ 64.6              │ 14.5             │ 0.0116099        │ 0.0344828          │
│ 14  │ 65.35             │ 15               │ 0.00841622       │ 0.02               │
│ 15  │ 65.9              │ 15.3             │ -0.0121396       │ -0.0261438         │
│ 16  │ 65.1              │ 14.9             │ -0.0134409       │ -0.0134228         │
│ 17  │ 64.225            │ 14.7             │ -0.00544959      │ -0.0748299         │
│ 18  │ 63.875            │ 13.6             │ 0.00587084       │ -0.0367647         │
│ 19  │ 64.25             │ 13.1             │ 0.00933852       │ -0.0381679         │
│ 20  │ 64.85             │ 12.6             │ 0.00578258       │ -0.047619          │
│ 21  │ 65.225            │ 12               │ 0.014565         │ -0.0333333         │
│ 22  │ 66.175            │ 11.6             │ 0.0162448        │ -0.0172414         │
│ 23  │ 67.25             │ 11.4             │ -0.0175439       │ -0.0175439         │
│ 24  │ 68.025            │ 11.2             │ 0.00955531       │ -0.0446429         │
│ 25  │ 68.675            │ 10.7             │ 0.000728067      │ 0                  │
│ 26  │ 68.725            │ 10.7             │ 0.00618407       │ -0.0186916         │
│ 27  │ 69.15             │ 10.5             │ 0.00686913       │ 0.00952381         │
│ 28  │ 69.625            │ 10.6             │ 0.00861759       │ -0.00943396        │
│ 29  │ 70.225            │ 10.5             │ -0.00427198      │ 0.00952381         │
│ 30  │ 69.925            │ 10.6             │ 0.00286021       │ 0.0283019          │
│ 31  │ 70.125            │ 10.9             │ 0.0142602        │ 0.0275229          │
│ 32  │ 71.125            │ 11.2             │ -0.00667838      │ 0.00892857         │
│ 33  │ 70.65             │ 11.3             │ -0.00920028      │ 0                  │
│ 34  │ 70                │ 11.3             │ -0.00392857      │ -0.0176991         │
│ 35  │ 69.725            │ 11.1             │ 0.000358551      │ -0.00900901        │
│ 36  │ 69.75             │ 11               │ 0.00681004       │ 0                  │
│ 37  │ 70.225            │ 11               │ 0.00569598       │ -0.0181818         │
│ 38  │ 70.625            │ 10.8             │ -0.00176991      │ 0                  │
│ 39  │ 70.5              │ 10.8             │ -0.00744681      │ -0.00925926        │
│ 40  │ 69.975            │ 10.7             │ 0.0096463        │ -0.00934579        │
│ 41  │ 70.65             │ 10.6             │ 0.009908         │ -0.0283019         │
│ 42  │ 71.35             │ 10.3             │ 0.00175193       │ -0.0194175         │
│ 43  │ 71.475            │ 10.1             │ -0.0175439       │ -0.0748299         │
43×4 DataFrame
│ Row │ Female_Employment │ D_Fem_Employment │ Crude_Birth_Rate │ D_Crude_Birth_Rate │
│     │ Real              │ Real             │ Float64          │ Real               │
├─────┼───────────────────┼──────────────────┼──────────────────┼────────────────────┤
│ 1   │ 49.95             │ 15.7             │ 0.0195195        │ -0.0127389         │
│ 2   │ 50.925            │ 15.5             │ 0.0363279        │ -0.0129032         │
│ 3   │ 52.775            │ 15.3             │ 0.0341071        │ 0.0130719          │
│ 4   │ 54.575            │ 15.5             │ 0.0306917        │ 0                  │
│ 5   │ 56.25             │ 15.5             │ 0.0293333        │ -0.0129032         │
│ 6   │ 57.9              │ 15.3             │ -0.00863558      │ -0.0130719         │
│ 7   │ 57.4              │ 15.1             │ 0.010453         │ -0.00662252        │
│ 8   │ 58                │ 15               │ 0.0176724        │ 0                  │
│ 9   │ 59.025            │ 15               │ 0.0245659        │ -0.0133333         │
│ 10  │ 60.475            │ 14.8             │ 0.0301778        │ -0.00675676        │
│ 11  │ 62.3              │ 14.7             │ 0.0156501        │ -0.0204082         │
│ 12  │ 63.275            │ 14.4             │ 0.0209403        │ 0.00694444         │
│ 13  │ 64.6              │ 14.5             │ 0.0116099        │ 0.0344828          │
│ 14  │ 65.35             │ 15               │ 0.00841622       │ 0.02               │
│ 15  │ 65.9              │ 15.3             │ -0.0121396       │ -0.0261438         │
│ 16  │ 65.1              │ 14.9             │ -0.0134409       │ -0.0134228         │
│ 17  │ 64.225            │ 14.7             │ -0.00544959      │ -0.0748299         │
│ 18  │ 63.875            │ 13.6             │ 0.00587084       │ -0.0367647         │
│ 19  │ 64.25             │ 13.1             │ 0.00933852       │ -0.0381679         │
│ 20  │ 64.85             │ 12.6             │ 0.00578258       │ -0.047619          │
│ 21  │ 65.225            │ 12               │ 0.014565         │ -0.0333333         │
│ 22  │ 66.175            │ 11.6             │ 0.0162448        │ -0.0172414         │
│ 23  │ 67.25             │ 11.4             │ -0.0175439       │ -0.0175439         │
│ 24  │ 68.025            │ 11.2             │ 0.00955531       │ -0.0446429         │
│ 25  │ 68.675            │ 10.7             │ 0.000728067      │ 0                  │
│ 26  │ 68.725            │ 10.7             │ 0.00618407       │ -0.0186916         │
│ 27  │ 69.15             │ 10.5             │ 0.00686913       │ 0.00952381         │
│ 28  │ 69.625            │ 10.6             │ 0.00861759       │ -0.00943396        │
│ 29  │ 70.225            │ 10.5             │ -0.00427198      │ 0.00952381         │
│ 30  │ 69.925            │ 10.6             │ 0.00286021       │ 0.0283019          │
│ 31  │ 70.125            │ 10.9             │ 0.0142602        │ 0.0275229          │
│ 32  │ 71.125            │ 11.2             │ -0.00667838      │ 0.00892857         │
│ 33  │ 70.65             │ 11.3             │ -0.00920028      │ 0                  │
│ 34  │ 70                │ 11.3             │ -0.00392857      │ -0.0176991         │
│ 35  │ 69.725            │ 11.1             │ 0.000358551      │ -0.00900901        │
│ 36  │ 69.75             │ 11               │ 0.00681004       │ 0                  │
│ 37  │ 70.225            │ 11               │ 0.00569598       │ -0.0181818         │
│ 38  │ 70.625            │ 10.8             │ -0.00176991      │ 0                  │
│ 39  │ 70.5              │ 10.8             │ -0.00744681      │ -0.00925926        │
│ 40  │ 69.975            │ 10.7             │ 0.0096463        │ -0.00934579        │
│ 41  │ 70.65             │ 10.6             │ 0.009908         │ -0.0283019         │
│ 42  │ 71.35             │ 10.3             │ 0.00175193       │ -0.0194175         │
│ 43  │ 71.475            │ 10.1             │ -0.0175439       │ -0.0748299         │

If there isn’t a way to pass types to the readtable function then one way to solve your problem is with convert. For example:

df = DataFrame(a=Any[])
for i = 1:10
    push!(df.a, i)
end

julia> df
10×1 DataFrame
│ Row │ a   │
│     │ Any │
├─────┼─────┤
│ 1   │ 1   │
│ 2   │ 2   │
│ 3   │ 3   │
│ 4   │ 4   │
│ 5   │ 5   │
│ 6   │ 6   │
│ 7   │ 7   │
│ 8   │ 8   │
│ 9   │ 9   │
│ 10  │ 10  │

df.a = convert.(Int, df.a)

julia> df
10×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │
│ 4   │ 4     │
│ 5   │ 5     │
│ 6   │ 6     │
│ 7   │ 7     │
│ 8   │ 8     │
│ 9   │ 9     │
│ 10  │ 10    │

This is really not a good option though in my opinion. I would rather save the file as a .csv and then use CSV.jl with DataFrames.jl as it is generally pretty good about parsing the columns. Otherwise, I would try a different package for working with .xlsx files. Although I’ve not personally worked with it, I would probably try ExcelFiles.jl which is part of the Queryverse family of packages and my bet would be that it handles parsing well. I used Querverse quite a bit when I first started working with Julia but I’ve gotten away from it as my command of the language has increased (I’m assuming you are fairly new to Julia so this might be a route that will serve you well, as it did me).

I deleted the last 2 columns because they weren’t important, and changed the names so that the second column is Crude_Birth_Rate, but it’s still saying that I have a floating variable.

Here’s the problem. I tried to save and change the file, then I tried restarting atom.


using XLSX
df = DataFrame(XLSX.readtable("C:/Users/brett/Downloads/FemData.xlsx", "Sheet1")...)

for name in names(df)
       df[!, name] = identity.(df[:, name])
end

print(df)

A= df.Female_Employment
B= df.Crude_Birth_Rate

df1=  DataFrame(A= df.Female_Employment, B= df.Crude_Birth_Rate)

using GLM
ols = lm(@formula(Female_Employment ~ Crude_Birth_Rate), df)

r2(ols)

After

using XLSX
df = DataFrame(XLSX.readtable("C:/Users/brett/Downloads/FemData.xlsx", "Sheet1")...)

I get the error message

UndefVarError: DataFrame not defined

You need to remember to have using DataFrames again.

Also, Real is an abstract type. You want something to be of type Float. that’s a good thing.

OK, the dataframe part is sorted out. So I need to change the variables from Any to Float64, how do I change them to Float64 instead of Real?

df = DataFrame(XLSX.readtable("C:/Users/brett/Downloads/FemData.xlsx", "Sheet1")...)

for name in names(df)
       df[!, name] = identity.(df[:, name])
end

I see. Yeah this is less than ideal for a user. It’s a bummer that identity.(df[!, name]) doesn’t get the array to be a concrete type.

I would take the columns that are of type Real and convert them to Float64 via df[!, name] = Float64.(df[!, name]). However obviously that will throw errors for Strings and missing values.

Realistically this should be handled by XLSX.jl handling the types better and not returning arrays of type Any.

If anyone knows a function that does

x = Any[1.0, 2.0, 3.0, 4]
new_x = foo(x)

where new_x is the correct type, maybe they can chime in.

In the meantime, OP, you can use this instead of identity.(x) above.

function new_vector_correct_type(x)
         T = reduce(promote_type, typeof.(x))
         if T == Any
            return identity.(x)
        else
            return T.(x)
        end
    end
end
1 Like

I think what is “correct” is case-dependent. But, if you want promote_type instead of promote_typejoin as used in broadcasting etc., you can use BangBang.push!!

julia> using BangBang

julia> foldl(push!!, Any[1.0, 2.0, 3.0, 4]; init=Union{}[])
4-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0

Does name include variable names, because otherwise I’m not working with strings, and there aren’t any missing values, although there are values coded as 9 or 999 in the next thing I’m doing.

That’s the problem. you don’t want to use Float64.(df[!, name]) where df[!, name] is a vector that includes strings or missing values.

Unfortunately right now there isn’t a great solution other than what I posted above. We can try and make a contribution to XLSX or some other data-oriented package to make this more convenient.

I think a bunch of your problems would be solved by just opening up Excel and saving your xlsx files as csv instead - CSV.jl has very good support for type inference and is well integrated with DataFrames.

1 Like