How to parse/convert integers in DataFrame to float numbers

Hi.

I am a beginner in Julia. I am trying to get data from a single row and perform statistics on it like taking the mean, median, etc. However, the error I am getting is telling me that I am trying to perform calculations on an incompatible data type. I tried to use a for-if nest to see and convert the non-float data to float:

begin
	algeria = df[df."Country/Region" .== "Algeria", 4:end]
	
	for i = 4:size(algeria, 2)
    	if eltype(algeria[!, i]) .!= Float64
        	algeria[!, i] = parse.(Float64, algeria[!, i])
    	end
	end
	
	Statistics.mean(eachcol(algeria))
end

But the error persists.

This is my full error:

MethodError: no method matching parse(::Type{Float64}, ::Array{Union{Missing, Int64},1})

Closest candidates are:

parse(::Type{T}, !Matched::AbstractString; kwargs...) where T<:Real at parse.jl:376

I think that the code you are running and the code in this snippet are not the same.

The line

    	if eltype(algeria[!, i]) .!= Float64

should actually error, I think. And the fact that it doesn’t is odd.

On the other hand, the call parse.(Float64, algeria[!, i]) should not error, but is actually throwing the error that you just showed.

Here is an MWE that does what you want

julia> begin 
       df = DataFrame()
       N = 100
       df."Country/Region" = fill("Algeria", N)
       df.x_float = rand(N)
       df.x_string = [string.(rand(N-1)); missing]
       
       for i in 2:size(df, 2)
           v = df[!, i]
           if eltype(v) != Float64
               df[!, i] = passmissing(parse).(Float64, v)
           end
       end
       
       mean(eachcol(df[:, 2:end]))
       end

Note that you wan to use passmissing(parse) instead of just parse to deal with missing values properly.

No, dot operators are equivalent to scalar operators if all of the arguments are scalars. That is, it works for the same reason that sqrt.(4) is 2.0 (and sqrt.(4) .== 2.0 is true).

2 Likes

My Dataframe is Parquet in Arrow so idk if df = DataFrame() is soething I want.
What is N? Why is it 100?
Why are you taking a random float of N?

Do you mind explaining your code?

My Dataframe is Parquet in Arrow so idk if df = DataFrame() is soething I want.

Just to be clear, and I think this has been mentioned in other threads, once you read something into memory, it is all the same. A DataFrame is a DataFrame, and it doesn’t matter whether it came from Parquet, CSV, or was created in the code like I did above. Parquet in Arrow does not mean anything once you have a DataFrame.

My code creates a DataFrame with 100 rows, which is why N is 100.

rand(N) just creates a vector of length N of random numbers.

1 Like

Why are you creating a dataframe with 100 rows? I already have a full dataframe…

I’m merely creating a small example do show you how the code works. That’s all. 100 was just an arbitrary number.

passmissing is not defined

the error is telling you that you can’t parse Int64. parse is used to process text (String) to number (float or int). In this case you probably want:

algeria[!, i] = float.(algeria[!, i])

MethodError: no method matching AbstractFloat(::String)

can you paste the result of:

describe(df)

Describe isn’t defined. How can I import it?

it’s from DataFrames.jl, are you not using DataFrames? if you only imported, just do DataFrames.describe

I want to see the eltype column, in its entirety.

edit: if it’s too long, I guess unique(DataFrames.describe(df).eltype) will do.

Union{Missing, Int64}
Union{Missing, String}
Union{Missing, Float64}

ok, so you must be including some columns you shouldn’t be including by doing

for i = 4:size(algeria, 2)

The issue is, it looks like you’re parsing columns that are not meant to be number. You can try this:

if eltype(algeria[!, i]) .!= Float64
    try
        algeria[!, i] = float.(algeria[!, i])
    catch
        println(i)
    end
end

then once you find that ith column is offending, df[!, i] and see what’s the content of this column and if it’s actually intended to be understood as number.

The only columns that contain String are: Symbol("Province/State") and Symbol("Country/Region"), which I don’t think can be understood as numbers to begin with.

The if block goes inside the for loop or not?

yes.

Also, I think it would be easier if you can share the data and provide a complete example script includes parsing etc. I guess you data is this? https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

Yes, that is my data. And here is what I did:

begin
	algeria = df[df."Country/Region" .== "Algeria", 4:end]
	findall(eltype.(df[!, i] for i = 1:size(df, 2)) .!= Float64)
	
	# for i in 2:size(df, 2)
	# 	v = df[!, i]
	# 		if eltype(v) != Float64
	# 			df[!, i] = passmissing(parse).(Float64, v)
	# 	end
	# end
	
	for i = 4:size(algeria, 2)
		if eltype(algeria[!, i]) .!= Float64
    		try
        		algeria[!, i] = float.(algeria[!, i])
    		catch
        		println(i)
    		end
		end
	end
	
	Statistics.mean(eachcol(algeria))
end

Its throwing this error:

MethodError: no method matching +(::Float64, ::String)

Closest candidates are:

+(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:538

+(::Float64, !Matched::Float64) at float.jl:401

+(!Matched::ChainRulesCore.One, ::Any) at /home/onur/.julia/packages/ChainRulesCore/7d1hl/src/differential_arithmetic.jl:94