How to change form using InMemoryDatasets package?

akshdfyehd · March 13, 2023, 8:34am

I’ve got a dataset from kaggle, which is relate to travel, and there is a column called “accommodation cost” which have some value like these:

so there are three types, how to change the one with dollar sign and one with USD to just numbers? For example, $1800 => 1800, 1800USD =>1800.
Here are the link for raw datasets:
https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv
and below is the way to extract these data:

julia> using InMemoryDatasets,DLMReader,Chain
julia> import Downloads
julia> data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv")
julia> data=filereader(data)

I would like to use InMemoryDatasets package, but any other package are okay.
Thanks for any help!

nilshg · March 13, 2023, 9:41am

I don’t think this is package specific, I don’t really know InMemoryDatasets but I guess a column there is some sort of vector, so you extract that vector from the data, replace the characters you don’t want and parse the result:

julia> x = ["100", "\$100", "100 USD"]
3-element Vector{String}:
 "100"
 "\$100"
 "100 USD"

julia> parse.(Int, replace.(x, "\$" => "", " USD" => ""))
3-element Vector{Int64}:
 100
 100
 100

rocco_sprmnt21 · March 13, 2023, 5:14pm

considering also the presence of some missing, I have not found a fully functional way to solve the problem.

here two different functions useful for the purpose. Which do an explicit check on missinbg


tocurr(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d+)( )?(USD)?"=>s"\1", ","=>"")) : e 
tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e 

df1=transform!(df, "Transportation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Transportation cost"=>ByRow(tocurr2)=>:tr_cost)

df1===df2
df1=transform!(df, "Accommodation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Accommodation cost"=>ByRow(tocurr2)=>:tr_cost)

df1===df2

rocco_sprmnt21 · March 13, 2023, 10:40pm

using Missing

it looks like just what was “missing” to get the functional answer

tocurr3(e)=passmissing(parse)(Int,passmissing(replace)(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2"))

df3=transform!(df, "Accommodation cost"=>ByRow(tocurr3)=>:ac_cost)

akshdfyehd · March 15, 2023, 7:42am

Hi, thank you so much for the reply!!! But I have met some problems, so first I got this:

I have tried the solution of yours, but it seems can’t solve the comma in $1,500, I have tried to replace comma but still errors, do you have any idea how to solve this?

This is the code I get data:

using CSV,DataFrames
df=CSV.read("C:/Users/lyao387/.julia/Travel details dataset.csv", DataFrame;dateformat="mm/dd/yyyy")
acost=select(df,:11)

and this is the link for data:

really appreciate any advices, thanks a lot!!

rocco_sprmnt21 · March 15, 2023, 8:36am

you could use this, which also handles missing values as well as comma ones

julia> acost=select(df,:11)
139×1 DataFrame
 Row │ Accommodation cost 
     │ String15?
─────┼────────────────────
   1 │ 1200
   2 │ 800
   3 │ 1000
   4 │ 2000
   5 │ 700
   6 │ 1500
   7 │ 500
   8 │ 900
   9 │ 1200
  10 │ 2500
  11 │ 1000
  12 │ 800
  13 │ 3000
  14 │ 1400
  15 │ 600
  16 │ 900
  17 │ $900
  18 │ $1,500
  ⋮  │         ⋮
 122 │ 1300
 123 │ 700
 124 │ 1200
 125 │ 900
 126 │ 400
 127 │ 800
 128 │ missing
 129 │ 5000
 130 │ 7000
 131 │ 3000
 132 │ 6000
 133 │ 4000
 134 │ 8000
 135 │ 2500
 136 │ 5000
 137 │ 2000
 138 │ 6000
 139 │ 7000
          103 rows omitted

julia> tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e
tocurr2 (generic function with 1 method)

julia> tocurr2.(df."Accommodation cost")
139-element Vector{Union{Missing, Int64}}:
 1200
  800
 1000
 2000
  700
 1500
  500
  900
 1200
 2500
 1000
  800
 3000
 1400
  600
  900
  900
 1500
 1200
 1200
    ⋮
  900
  500
 1300
  700
 1200
  900
  400
  800
     missing
 5000
 7000
 3000
 6000
 4000
 8000
 2500
 5000
 2000
 6000
 7000

rocco_sprmnt21 · March 15, 2023, 8:48am

this way to complete the way without regular expressions

using Missings
passmissing(x->parse.(Int, replace.(x, "\$" => "", " USD" => "", ","=>""))).(df."Accommodation cost")

akshdfyehd · March 16, 2023, 6:37am

Thank you!!!

Topic		Replies	Views
How to change the entire column form for different situation? General Usage question , inmemorydatasets	2	431	March 14, 2023
Use joins to change country names General Usage question , package , inmemorydatasets	7	799	April 23, 2023
Cleaning Data New to Julia	3	319	May 17, 2023
Problem in reading data with DLMreader package General Usage question , package	0	127	May 19, 2023
Split country column into a column with city and a column with country General Usage question , package , dataframes , inmemorydatasets	5	308	March 28, 2023

How to change form using InMemoryDatasets package?

Related topics