I’ve got a dataset from kaggle, which is relate to travel, and there is a column called “accommodation cost” which have some value like these:
so there are three types, how to change the one with dollar sign and one with USD to just numbers? For example, $1800 => 1800, 1800USD =>1800.
Here are the link for raw datasets:
https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv
and below is the way to extract these data:
julia> using InMemoryDatasets,DLMReader,Chain
julia> import Downloads
julia> data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv")
julia> data=filereader(data)
I would like to use InMemoryDatasets package, but any other package are okay.
Thanks for any help!
nilshg
March 13, 2023, 9:41am
2
I don’t think this is package specific, I don’t really know InMemoryDatasets
but I guess a column there is some sort of vector, so you extract that vector from the data, replace the characters you don’t want and parse the result:
julia> x = ["100", "\$100", "100 USD"]
3-element Vector{String}:
"100"
"\$100"
"100 USD"
julia> parse.(Int, replace.(x, "\$" => "", " USD" => ""))
3-element Vector{Int64}:
100
100
100
1 Like
considering also the presence of some missing, I have not found a fully functional way to solve the problem.
here two different functions useful for the purpose. Which do an explicit check on missinbg
tocurr(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d+)( )?(USD)?"=>s"\1", ","=>"")) : e
tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e
df1=transform!(df, "Transportation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Transportation cost"=>ByRow(tocurr2)=>:tr_cost)
df1===df2
df1=transform!(df, "Accommodation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Accommodation cost"=>ByRow(tocurr2)=>:tr_cost)
df1===df2
1 Like
using Missing
it looks like just what was “missing” to get the functional answer
tocurr3(e)=passmissing(parse)(Int,passmissing(replace)(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2"))
df3=transform!(df, "Accommodation cost"=>ByRow(tocurr3)=>:ac_cost)
1 Like
Hi, thank you so much for the reply!!! But I have met some problems, so first I got this:
I have tried the solution of yours, but it seems can’t solve the comma in $1,500, I have tried to replace comma but still errors, do you have any idea how to solve this?
This is the code I get data:
using CSV,DataFrames
df=CSV.read("C:/Users/lyao387/.julia/Travel details dataset.csv", DataFrame;dateformat="mm/dd/yyyy")
acost=select(df,:11)
and this is the link for data:
really appreciate any advices, thanks a lot!!
you could use this, which also handles missing values as well as comma ones
julia> acost=select(df,:11)
139×1 DataFrame
Row │ Accommodation cost
│ String15?
─────┼────────────────────
1 │ 1200
2 │ 800
3 │ 1000
4 │ 2000
5 │ 700
6 │ 1500
7 │ 500
8 │ 900
9 │ 1200
10 │ 2500
11 │ 1000
12 │ 800
13 │ 3000
14 │ 1400
15 │ 600
16 │ 900
17 │ $900
18 │ $1,500
⋮ │ ⋮
122 │ 1300
123 │ 700
124 │ 1200
125 │ 900
126 │ 400
127 │ 800
128 │ missing
129 │ 5000
130 │ 7000
131 │ 3000
132 │ 6000
133 │ 4000
134 │ 8000
135 │ 2500
136 │ 5000
137 │ 2000
138 │ 6000
139 │ 7000
103 rows omitted
julia> tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e
tocurr2 (generic function with 1 method)
julia> tocurr2.(df."Accommodation cost")
139-element Vector{Union{Missing, Int64}}:
1200
800
1000
2000
700
1500
500
900
1200
2500
1000
800
3000
1400
600
900
900
1500
1200
1200
⋮
900
500
1300
700
1200
900
400
800
missing
5000
7000
3000
6000
4000
8000
2500
5000
2000
6000
7000
1 Like
this way to complete the way without regular expressions
using Missings
passmissing(x->parse.(Int, replace.(x, "\$" => "", " USD" => "", ","=>""))).(df."Accommodation cost")
1 Like