Split country column into a column with city and a column with country

Hi, i have got a travel dataset and I need to split the destination column into two columns, one column is city and another column is country name, for example, “Sydney, Australia” will be split into sydney in city column and australia in country column.
This is the original data:

The problem is, some place only have city name and some place have both city name and country.
Any advices will be really appreciated!!!

This is the code to read the data:

using InMemoryDatasets,DLMReader
import Downloads
data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv")
data=filereader(data, quotechar='"', dtformat=Dict(3:4 .=> dateformat"m/d/y"))
data=data[completecases(data),:]

Note that none of your questions are really about working with InMemoryDatasets (or DataFrames or similar), they are generally about how to work with simple vectors so it can often be edifying to reduce the MWE to just that.

In your case:

julia> x = ["London", "London, UK"]
2-element Vector{String}:
 "London"
 "London, UK"

julia> split.(x, ", ")
2-element Vector{Vector{SubString{String}}}:
 ["London"]
 ["London", "UK"]

You then need to think about how you deal with the fact that some country names are missing - one way would be to construct a tuple where the second element is just an empty string if the country is missing:

julia> (y -> length(y) == 1 ? (only(y), "") : (y...,)).(split.(x, ", "))
2-element Vector{Tuple{SubString{String}, AbstractString}}:
 ("London", "")
 ("London", "UK")
1 Like

A regular expressions way:

julia> split_comma_pair(str) = collect(match(r"([^,]+),?(.*)", str))

Example usage with a data frame:

julia> using DataFrames

julia> df = DataFrame(destination = ["London", "London, UK"]);

julia> transform(df, :destination => ByRow(split_comma_pair) => [:city, :country])
2×3 DataFrame
 Row │ destination  city       country
     │ String       SubStrin…  SubStrin…
─────┼───────────────────────────────────
   1 │ London       London
   2 │ London, UK   London      UK

Edit: Annoyingly, InMemoryDatasets.jl has its own slightly different way of doing column transformations:

julia> ds = Dataset(:destination => ["London", "London, UK"]);

julia> modify(ds, :destination => byrow(Tuple∘split_comma_pair),
                  :destination => splitter => [:city, :country])
2×3 Dataset
 Row │ destination        city        country
     │ identity           identity    identity
     │ Tuple…?            SubStrin…?  SubStrin…?
─────┼───────────────────────────────────────────
   1 │ ("London", "")     London
   2 │ ("London", " UK")  London       UK
1 Like

Thanks!

Hi, thank you so much for the reply, Can you please explain this part? I don’t quite understand this part, really appreciate!!

Sure: r"..." is Julia’s syntax for a regular expression literal. The regular expression

r"([^,]+),?\s*(.*)"

has two capture groups (enclosed in parens): the first capture group ([^,]+) matches one or more (+) non-comma characters. After this group, it optionally matches a comma (,?) and zero or more spaces (\s*) before getting to the next capture group (.*) which matches everything remaining.

When using this regex with match, you can access the matched capture groups by indexing (or collecting).

julia> match(r"([^,]+),?\s*(.*)", "Th!s,  and, that")
RegexMatch("Th!s,  and, that", 1="Th!s", 2="and, that")

julia> ans[1], ans[2]
("Th!s", "and, that")
1 Like