How to change form using InMemoryDatasets package?

I’ve got a dataset from kaggle, which is relate to travel, and there is a column called “accommodation cost” which have some value like these:

so there are three types, how to change the one with dollar sign and one with USD to just numbers? For example, $1800 => 1800, 1800USD =>1800.
Here are the link for raw datasets:
https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv
and below is the way to extract these data:

julia> using InMemoryDatasets,DLMReader,Chain
julia> import Downloads
julia> data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv")
julia> data=filereader(data)

I would like to use InMemoryDatasets package, but any other package are okay.
Thanks for any help!

I don’t think this is package specific, I don’t really know InMemoryDatasets but I guess a column there is some sort of vector, so you extract that vector from the data, replace the characters you don’t want and parse the result:

julia> x = ["100", "\$100", "100 USD"]
3-element Vector{String}:
 "100"
 "\$100"
 "100 USD"

julia> parse.(Int, replace.(x, "\$" => "", " USD" => ""))
3-element Vector{Int64}:
 100
 100
 100

considering also the presence of some missing, I have not found a fully functional way to solve the problem.

here two different functions useful for the purpose. Which do an explicit check on missinbg


tocurr(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d+)( )?(USD)?"=>s"\1", ","=>"")) : e 
tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e 

df1=transform!(df, "Transportation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Transportation cost"=>ByRow(tocurr2)=>:tr_cost)

df1===df2
df1=transform!(df, "Accommodation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Accommodation cost"=>ByRow(tocurr2)=>:tr_cost)

df1===df2

using Missing

it looks like just what was “missing” to get the functional answer

tocurr3(e)=passmissing(parse)(Int,passmissing(replace)(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2"))

df3=transform!(df, "Accommodation cost"=>ByRow(tocurr3)=>:ac_cost)

Hi, thank you so much for the reply!!! But I have met some problems, so first I got this:


I have tried the solution of yours, but it seems can’t solve the comma in $1,500, I have tried to replace comma but still errors, do you have any idea how to solve this?

This is the code I get data:

using CSV,DataFrames
df=CSV.read("C:/Users/lyao387/.julia/Travel details dataset.csv", DataFrame;dateformat="mm/dd/yyyy")
acost=select(df,:11)

and this is the link for data:

really appreciate any advices, thanks a lot!!

you could use this, which also handles missing values as well as comma ones

julia> acost=select(df,:11)
139×1 DataFrame
 Row │ Accommodation cost 
     │ String15?
─────┼────────────────────
   1 │ 1200
   2 │ 800
   3 │ 1000
   4 │ 2000
   5 │ 700
   6 │ 1500
   7 │ 500
   8 │ 900
   9 │ 1200
  10 │ 2500
  11 │ 1000
  12 │ 800
  13 │ 3000
  14 │ 1400
  15 │ 600
  16 │ 900
  17 │ $900
  18 │ $1,500
  ⋮  │         ⋮
 122 │ 1300
 123 │ 700
 124 │ 1200
 125 │ 900
 126 │ 400
 127 │ 800
 128 │ missing
 129 │ 5000
 130 │ 7000
 131 │ 3000
 132 │ 6000
 133 │ 4000
 134 │ 8000
 135 │ 2500
 136 │ 5000
 137 │ 2000
 138 │ 6000
 139 │ 7000
          103 rows omitted

julia> tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e
tocurr2 (generic function with 1 method)

julia> tocurr2.(df."Accommodation cost")
139-element Vector{Union{Missing, Int64}}:
 1200
  800
 1000
 2000
  700
 1500
  500
  900
 1200
 2500
 1000
  800
 3000
 1400
  600
  900
  900
 1500
 1200
 1200
    ⋮
  900
  500
 1300
  700
 1200
  900
  400
  800
     missing
 5000
 7000
 3000
 6000
 4000
 8000
 2500
 5000
 2000
 6000
 7000

this way to complete the way without regular expressions

using Missings
passmissing(x->parse.(Int, replace.(x, "\$" => "", " USD" => "", ","=>""))).(df."Accommodation cost")

Thank you!!!