Use joins to change country names

Hi, I have got a dataset with a city and country column. However, there are some names in country column are actually the same place but different names, for example:


row 7 and 123, I want both country name be Australia, the way I prefer is to use package named Countries List of country codes by alpha-2, alpha-3 code (ISO 3166) and joins function(leftjoins, semijoin…) from inmemorydataset package.
I have tried to use joins but it just didn’t work:

Here is the code to get the dataset which named ds and countries named count:

using InMemoryDatasets,DLMReader, Countries
import Downloads
data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv")
data=filereader(data, quotechar='"', dtformat=Dict(3:4 .=> dateformat"m/d/y"))
data=data[completecases(data),:]
modify!(data, 11 => x -> parse.(Int, replace.(x, "\$" =>"","USD" =>"",","=>"")))
modify!(data, 13 => x -> parse.(Int, replace.(x, "\$" =>"","USD" =>"",","=>"")))
split_comma_pair(str) = collect(match(r"([^,]+),?(.*)", str))
modify!(data, :Destination => byrow(Tuple∘split_comma_pair),
                         :Destination => splitter => [:city, :country])
ds=select(data, 14:15)
count=Dataset(all_countries())

really appreciate any advices.

I think there are two kinds of problems:

  1. the space after the comma
  2. uppercase characters

split_comma_pair(str) = let m=match(r"([^,]+)(, )?(.*)", str); (first(m.captures),last(m.captures)) end
modify!(data, :Destination => byrow(split_comma_pair),
                         :Destination => splitter => [:city, :country])
ds=select(data, 14:15)
modify!(ds, :country=>byrow(uppercase))
count=Dataset(all_countries())


leftjoin(ds, count, on=:country=>:alpha3)

a more general form of the split_comma… function

split_comma_pair(str)=Tuple(match(r"([^,]+),? *(.*)", str))

1 Like

Thanks for the reply!!!

Hi,
I got the dataset which is correct, but my problem is to replace to the name, for example replace USA to united states in column 17, rest are stay untouched:


Do you know an easy way to replace all these?
really appreciate.

this is one way

acount=Dataset(all_countries())

ccn=leftjoin(ds, acount, on=:country=>:alpha3)

coalesce.(ccn[:,:common_name],ccn[:,:country])

coalesce.(ccn[:,:common_name],data[:,:country]) # to get ProperCase
1 Like

Hi, I am so sorry to bother you again, my tutor want me to use modify to do this(from inmemorydatasets package) , do you know a function that can perform the same in inmemorydataset package as coalesce? I have tried replace but seems that function can’t do the missing value. Thanks for any advices, really appreciate

try something like this

julia> ds
5Γ—2 Dataset
 Row β”‚ x1        x2       
     β”‚ identity  identity
     β”‚ Int64?    Int64?
─────┼────────────────────
   1 β”‚       -1         1
   2 β”‚       10   missing
   3 β”‚        4   missing
   4 β”‚        5        -8
   5 β”‚        1   missing

julia> modify(ds,(:x1,:x2)=>byrow((x1,x2)->ismissing(x2) ? x1 : x2)=>:new_x)
5Γ—3 Dataset
 Row β”‚ x1        x2        new_x    
     β”‚ identity  identity  identity
     β”‚ Int64?    Int64?    Int64?
─────┼──────────────────────────────
   1 β”‚       -1         1         1
   2 β”‚       10   missing        10
   3 β”‚        4   missing         4
   4 β”‚        5        -8        -8
   5 β”‚        1   missing         1

1 Like

Thank you!!!