Type conversion driving me crazy

I’ve wasted half a day in frustration just trying to clean up some simple data. Basically, it’s a bunch of numbers as strings that I’ve bought into a data frame and now just want to remove missing values, spaces, remove commas in the numbers, and convert all these columns to integers.

By the time I’ve received these columns into the dataframe Julia has created the following column types:

Vector{Union{Missing, String15}}
Vector{Union{Missing, String15}}
PooledArrays.PooledVector{Union{Missing, String15}, UInt32, Vector{UInt32}}
PooledArrays.PooledVector{Union{Missing, String15}, UInt32, Vector{UInt32}}
PooledArrays.PooledVector{Union{Missing, String15}, UInt32, Vector{UInt32}}
PooledArrays.PooledVector{Union{Missing, String15}, UInt32, Vector{UInt32}}
Vector{Union{Missing, String15}}
Vector{Union{Missing, String15}}
Vector{Union{Missing, String15}}
Vector{Union{Missing, String15}}
Vector{Union{Missing, String15}}
Vector{Union{Missing, String15}}
PooledArrays.PooledVector{Union{Missing, String7}, UInt32, Vector{UInt32}}
PooledArrays.PooledVector{Union{Missing, String7}, UInt32, Vector{UInt32}}
PooledArrays.PooledVector{Union{Missing, String15}, UInt32, Vector{UInt32}}
PooledArrays.PooledVector{Union{Missing, String15}, UInt32, Vector{UInt32}}
Vector{Union{Missing, Int64}}

But they’re all just strings of numbers, so it should not be hard.

I have the following code:

for (name, col) in pairs(eachcol(df))
     coalesce.(df[:,name],0)
     replace.(df[:,name], "," => "")
end

but it’s generating the following error when it hits the replace line:

MethodError: no method matching similar(::Int64, ::Type{Any})

Can anyone help me?

julia> replace(0, ","=>"")
ERROR: MethodError: no method matching similar(::Int64, ::Type{Any})

julia> replace("0", ","=>"")
"0"

For sake of example, suppose we have the dataframe

julia> df = DataFrame(fullname = ["Jones, Alice", "Smith, Bob", missing], id = [1,2,3])
3×2 DataFrame
 Row │ fullname      id    
     │ String?       Int64 
─────┼─────────────────────
   1 │ Jones, Alice      1
   2 │ Smith, Bob        2
   3 │ missing           3

Then I’m now assuming the first line of the for-loop body is trying to do this:

julia> for name in names(df)
           df[!,name] = coalesce.(df[:,name],0)
       end

julia> df
3×2 DataFrame
 Row │ fullname      id    
     │ Any           Int64 
─────┼─────────────────────
   1 │ Jones, Alice      1
   2 │ Smith, Bob        2
   3 │ 0                 3

Note how the missing is now an Int64 but the rest of the “fullname” column are Strings, for the column type is now Any to hold both strings and integers.

But now this is an error (again, interpreting your 2nd loop-body code):

julia> for name in names(df)
           df[!,name] = replace.(df[:,name], "," => "")
       end
ERROR: MethodError: no method matching similar(::Int64, ::Type{Any})

since you’re effectively trying to do, in the third row,

replace(0, "," => "")

and replace doesn’t have a method for doing string replacement on integers (which kinda makes sense). Hope that helps.

2 Likes

So, this:

should probably be:

coalesce.(df[:,name], "0")
1 Like

I think this is the issue as the non {int} columns do the replace fine. Changing the coalesce substitution to a “0” string doesn’t help in this case, as the Vector{Union{Missing, Int64}} column actually didn’t have any missing values to coalesce. So I suppose I need to either skip that column or convert it to string first… Hrmmm

The earlier comments are correct, but do not solve your issue fully. The reason is that coalesce. and replace. parts will allocate new vectors and will not change the data in a data frame.
If you want to change the data frame do e.g.:

for (name, col) in pairs(eachcol(df))
     df[!, name] = replace.(coalesce.(col, "0"), "," => "")
end

or e.g.

transform!(df, All() .=> ByRow(x -> replace(coalesce(x, "0"), "," => "")), renamecols=false)
2 Likes

Thanks. I was wrestling with this a while, but this seems to work. I’ve slightly added to it, to parse the column to an integer:

for (name, col) in pairs(eachcol(df))
    df[!, name] = parse.(Int,replace.(coalesce.(col, 0), "," => ""))
end

Quick follow up question in case anyone’s still reading this…

If I only want to do this on columns meeting a certain criteria (for example if they have the word “numbers” in the column name, and only the first ten of these columns, like df[ :, r"numbers"][:,1:10], how can I limit it to only those columns?

names(df, r"numbers")[1:10] will give you the list of column names you want.

As a side note - you see so many element types because of performance optimizations that CSV.jl, which it seems you use, by default turns on. If you write pool=false and stringtype=String keyword arguments in your CSV.read call you turn off these optimizations and all columns will be Vector{Union{Missing, String}}.

4 Likes

Sorry to drag this on. I though I knew what I was doing but I clearly don’t…

I solved the above problem with the one line:

transform!(df names(df, r"unit")[1:16] .=> ByRow(x -> parse(Int,replace(coalesce(x, "0"), "," => ""))), renamecols=false) 

This put the relevant columns in integer format and allowed me to do some combine(groupby on these figures, as I’d hoped.

However, I’m now facing a harder to understand conundrum. I’ve made a new “long” data frame by stacking some these columns using stack(), which worked wonderfully, but it put them back in Any Type. I tried to do what I did above with this column, but check out the errors I get:

IssL.amount_issued
34606-element Vector{Any}:

parse.(Int,IssL.amount_issued)
ERROR: MethodError: no method matching parse(::Type{Int64}, ::Int64)

Int.(IssL.amount_issued)
MethodError: no method matching Int64(::Missing)

Int.(coalesce.(IssL.amount_issued,0))
ERROR: MethodError: no method matching Int64(::String7)

So, somehow there are strings in there again. And check this out…

a = findall(x->typeof(x)==Int64, IssL.amount_issued)
25112-element Vector{Int64}:

b = findall(x->typeof(x)==Int32, IssL.amount_issued)
7230-element Vector{Int64}:

c = findall(x->typeof(x)==String7, IssL.amount_issued)
2178-element Vector{Int64}:

d = findall(x->typeof(x)==Missing, IssL.amount_issued)
86-element Vector{Int64}:

So, according to Julia this new column of “Any” contains Int64, Int32, missing and strings, but look at the return for a,b,c and even d above. They’re all Vector{Int64}, and they look just like integers on my screen, even the missing! How can they be all these types too?

Am I going crazy? Whatever the cause, it’s stopping me doing further combine(groupby()) operations on this new data frame, which is returning:

combine(groupby(IssL,:project_id),:cctype,:year,:amount_issued => sum)
ERROR: MethodError: no method matching +(::Int64, ::String7)

How can I turn them back into Ints?

Well it appears I was selecting columns that were nor ints and my long table included elements that had not been converted. I changed the original solution to parse to Int all non-already-int columns only, with an addition to the one line solution:

transform!(df, names(df, r"numbers").=> ByRow(x -> typeof(x)!=Int ? parse(Int,replace(coalesce(x, "0"), "," => "")) : x), renamecols=false)