How to change form using InMemoryDatasets package?

I’ve got a dataset from kaggle, which is relate to travel, and there is a column called “accommodation cost” which have some value like these:

so there are three types, how to change the one with dollar sign and one with USD to just numbers? For example, $1800 => 1800, 1800USD =>1800.
Here are the link for raw datasets:
https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv
and below is the way to extract these data:

julia> using InMemoryDatasets,DLMReader,Chain
julia> import Downloads
julia> data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv")
julia> data=filereader(data)

I would like to use InMemoryDatasets package, but any other package are okay.
Thanks for any help!

I don’t think this is package specific, I don’t really know InMemoryDatasets but I guess a column there is some sort of vector, so you extract that vector from the data, replace the characters you don’t want and parse the result:

julia> x = ["100", "\$100", "100 USD"]
3-element Vector{String}:
 "100"
 "\$100"
 "100 USD"

julia> parse.(Int, replace.(x, "\$" => "", " USD" => ""))
3-element Vector{Int64}:
 100
 100
 100
1 Like

considering also the presence of some missing, I have not found a fully functional way to solve the problem.

here two different functions useful for the purpose. Which do an explicit check on missinbg


tocurr(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d+)( )?(USD)?"=>s"\1", ","=>"")) : e 
tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e 

df1=transform!(df, "Transportation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Transportation cost"=>ByRow(tocurr2)=>:tr_cost)

df1===df2
df1=transform!(df, "Accommodation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Accommodation cost"=>ByRow(tocurr2)=>:tr_cost)

df1===df2
1 Like

using Missing

it looks like just what was “missing” to get the functional answer

tocurr3(e)=passmissing(parse)(Int,passmissing(replace)(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2"))

df3=transform!(df, "Accommodation cost"=>ByRow(tocurr3)=>:ac_cost)
1 Like

Hi, thank you so much for the reply!!! But I have met some problems, so first I got this:


I have tried the solution of yours, but it seems can’t solve the comma in $1,500, I have tried to replace comma but still errors, do you have any idea how to solve this?

This is the code I get data:

using CSV,DataFrames
df=CSV.read("C:/Users/lyao387/.julia/Travel details dataset.csv", DataFrame;dateformat="mm/dd/yyyy")
acost=select(df,:11)

and this is the link for data:

really appreciate any advices, thanks a lot!!

you could use this, which also handles missing values as well as comma ones

julia> acost=select(df,:11)
139×1 DataFrame
 Row │ Accommodation cost 
     │ String15?
─────┼────────────────────
   1 │ 1200
   2 │ 800
   3 │ 1000
   4 │ 2000
   5 │ 700
   6 │ 1500
   7 │ 500
   8 │ 900
   9 │ 1200
  10 │ 2500
  11 │ 1000
  12 │ 800
  13 │ 3000
  14 │ 1400
  15 │ 600
  16 │ 900
  17 │ $900
  18 │ $1,500
  ⋮  │         ⋮
 122 │ 1300
 123 │ 700
 124 │ 1200
 125 │ 900
 126 │ 400
 127 │ 800
 128 │ missing
 129 │ 5000
 130 │ 7000
 131 │ 3000
 132 │ 6000
 133 │ 4000
 134 │ 8000
 135 │ 2500
 136 │ 5000
 137 │ 2000
 138 │ 6000
 139 │ 7000
          103 rows omitted

julia> tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e
tocurr2 (generic function with 1 method)

julia> tocurr2.(df."Accommodation cost")
139-element Vector{Union{Missing, Int64}}:
 1200
  800
 1000
 2000
  700
 1500
  500
  900
 1200
 2500
 1000
  800
 3000
 1400
  600
  900
  900
 1500
 1200
 1200
    ⋮
  900
  500
 1300
  700
 1200
  900
  400
  800
     missing
 5000
 7000
 3000
 6000
 4000
 8000
 2500
 5000
 2000
 6000
 7000
1 Like

this way to complete the way without regular expressions

using Missings
passmissing(x->parse.(Int, replace.(x, "\$" => "", " USD" => "", ","=>""))).(df."Accommodation cost")
1 Like

Thank you!!!