I’ve got a dataset from kaggle, which is relate to travel, and there is a column called “accommodation cost” which have some value like these:
so there are three types, how to change the one with dollar sign and one with USD to just numbers? For example, $1800 => 1800, 1800USD =>1800.
Here are the link for raw datasets:
https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv
and below is the way to extract these data:
julia> using InMemoryDatasets,DLMReader,Chain
julia> import Downloads
julia> data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv")
julia> data=filereader(data)
I would like to use InMemoryDatasets package, but any other package are okay.
Thanks for any help!
nilshg
March 13, 2023, 9:41am
2
I don’t think this is package specific, I don’t really know InMemoryDatasets but I guess a column there is some sort of vector, so you extract that vector from the data, replace the characters you don’t want and parse the result:
julia> x = ["100", "\$100", "100 USD"]
3-element Vector{String}:
"100"
"\$100"
"100 USD"
julia> parse.(Int, replace.(x, "\$" => "", " USD" => ""))
3-element Vector{Int64}:
100
100
100
considering also the presence of some missing, I have not found a fully functional way to solve the problem.
here two different functions useful for the purpose. Which do an explicit check on missinbg
tocurr(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d+)( )?(USD)?"=>s"\1", ","=>"")) : e
tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e
df1=transform!(df, "Transportation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Transportation cost"=>ByRow(tocurr2)=>:tr_cost)
df1===df2
df1=transform!(df, "Accommodation cost"=>ByRow(tocurr)=>:tr_cost)
df2=transform!(df, "Accommodation cost"=>ByRow(tocurr2)=>:tr_cost)
df1===df2
using Missing
it looks like just what was “missing” to get the functional answer
tocurr3(e)=passmissing(parse)(Int,passmissing(replace)(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2"))
df3=transform!(df, "Accommodation cost"=>ByRow(tocurr3)=>:ac_cost)
Hi, thank you so much for the reply!!! But I have met some problems, so first I got this:
I have tried the solution of yours, but it seems can’t solve the comma in $1,500, I have tried to replace comma but still errors, do you have any idea how to solve this?
This is the code I get data:
using CSV,DataFrames
df=CSV.read("C:/Users/lyao387/.julia/Travel details dataset.csv", DataFrame;dateformat="mm/dd/yyyy")
acost=select(df,:11)
and this is the link for data:
really appreciate any advices, thanks a lot!!
you could use this, which also handles missing values as well as comma ones
julia> acost=select(df,:11)
139×1 DataFrame
Row │ Accommodation cost
│ String15?
─────┼────────────────────
1 │ 1200
2 │ 800
3 │ 1000
4 │ 2000
5 │ 700
6 │ 1500
7 │ 500
8 │ 900
9 │ 1200
10 │ 2500
11 │ 1000
12 │ 800
13 │ 3000
14 │ 1400
15 │ 600
16 │ 900
17 │ $900
18 │ $1,500
⋮ │ ⋮
122 │ 1300
123 │ 700
124 │ 1200
125 │ 900
126 │ 400
127 │ 800
128 │ missing
129 │ 5000
130 │ 7000
131 │ 3000
132 │ 6000
133 │ 4000
134 │ 8000
135 │ 2500
136 │ 5000
137 │ 2000
138 │ 6000
139 │ 7000
103 rows omitted
julia> tocurr2(e)=!ismissing(e) ? parse(Int,replace(e, r"\$?(\d),?(\d+)( )?(USD)?"=>s"\1\2")) : e
tocurr2 (generic function with 1 method)
julia> tocurr2.(df."Accommodation cost")
139-element Vector{Union{Missing, Int64}}:
1200
800
1000
2000
700
1500
500
900
1200
2500
1000
800
3000
1400
600
900
900
1500
1200
1200
⋮
900
500
1300
700
1200
900
400
800
missing
5000
7000
3000
6000
4000
8000
2500
5000
2000
6000
7000
this way to complete the way without regular expressions
using Missings
passmissing(x->parse.(Int, replace.(x, "\$" => "", " USD" => "", ","=>""))).(df."Accommodation cost")