DataFrames: convert column data type

Was looking around, but didn’t find an answer.

Specifically, I have a DataFrame and one column has data type Int64; it has the unique values 0 and 1 (meaning obviously false and true). What is the easiest quickest way to have the column including its entries converted to the boolean data type?

I thought about a kind of loop, but are there already existing functions available that I should know of?

More generally, please explain how to best tackle this (especially conceptually), if using a self-made solution.

My thought then might be to take the whole array/column, check every value, make a new array based on set conditions (if 0, make false; if 1, make true, etc.), mutate or add the new array into the dataframe.

2 Likes

df.int_col .== 1 will return a BitArray column

4 Likes

Very useful. Thanks.

A more general way to do this is (assuming the column is called x)

df[!,:x] = convert.(Bool,df[!,:x])
12 Likes

Thanks for the reply.

I suppose I can post this here, since it concerns a similar issue.

There is a dataframe. It has a String column with missing values. Its values are actually integers.

What is the most direct and easiest way to convert this whole column of String (with missing) to one of Int64 (with missing)?

I thought of your generic way, @tbeason, but it seems it requires more in this case. I thought there was a function to convert such strings to int, but I could be mistaken.

3 Likes

I’m getting some progress. I found the function parse().

Unfortunately parse doesn’t work with missings. You are looking for passmissing from Missings.jl.

julia> df.col = passmissing(parse).(Int, df.col)
2 Likes

Thanks.

I believe that you could just have

df[:,:x] = convert.(Bool,df[!,:x])

(notice that : instead of !) to avoid making two copies.

1 Like

My mistake, it has to be the other way around:

df[!,:x] = convert.(Bool,df[:,:x])

When I do
df[!,:x] = convert.(Int64,df[:,:x])
I get

ERROR: MethodError: Cannot 'convert' and object of type String to an object of type Int64

How to get this resolved ?

1 Like

You want parse.(Int64, df[:, :x])

3 Likes

My column is called
Id set as String, need to change it to Int64
and my column have 10 digit codes as entries

df[!,:Id] = parse.(Int64,df[:,:Id])

I get
ERROR: ArgumentError: invalid base 10 digit 'I' in "Id"

Could you copy and paste par of your column into this thread? You probably want tryparse which will return nothing if parse finds a column like "Id209.4", which can’t be parsed as a float.

my cloumns are:

Id_internal   |     Date
7483947898    |     2020-11-28
7475629104    |     2021-01-23
7384881913    |     2020-12-28

Both columns are set as integers but I would want to push the first into an Int64 and the second into a Date

Your problem is that the first row of your data frame is "Id_internal" and "Date".

How are you reading in your data? Perhaps you can change it so that your data doesn’t accidentally include the names of your variables.

Did you do my tryparse idea? That should fix it in the meantime. You can also do

df = df[2:end, :]

to get rid of the first row.

I tried
df[:2,:Id_internal ] = tryparse.(Int64,df[:2,:Id_internal ])
and
df[:2,:Id_internal ] = parse.(Int64,df[:2,:Id_internal ])

both gave me
ERROR: setindex! not defined for WeakRefStrings.StringArray{String,1}

I read my data as follow:
df_all = CSV.File("file.csv", delim = '\t' |> DataFrame
I then I create a df with what I need
df = df_all[[:Id_internal, :Date]]

What is the version of CSV.jl and DataFrames.jl you are using?

That is indeed a very odd error message. To be honest I don’t know exactly why you are getting it. But note that you should be writing df[:, :Id_internal], not df[:2, :Id_internal]

cc @quinnj for why the user might have gotten such an odd error. I can’t replicate it.

This is old, deprecated, syntax. Its a concern that people are still finding this syntax in tutorials. Can you please post a link to the guide you are using to learn DataFrames?