Split country column into a column with city and a column with country

akshdfyehd · March 25, 2023, 7:13am

Hi, i have got a travel dataset and I need to split the destination column into two columns, one column is city and another column is country name, for example, “Sydney, Australia” will be split into sydney in city column and australia in country column.
This is the original data:

The problem is, some place only have city name and some place have both city name and country.
Any advices will be really appreciated!!!

This is the code to read the data:

using InMemoryDatasets,DLMReader
import Downloads
data=Downloads.download("https://raw.githubusercontent.com/akshdfyehd/travel/main/Travel%20details%20dataset.csv")
data=filereader(data, quotechar='"', dtformat=Dict(3:4 .=> dateformat"m/d/y"))
data=data[completecases(data),:]

nilshg · March 25, 2023, 8:50am

Note that none of your questions are really about working with InMemoryDatasets (or DataFrames or similar), they are generally about how to work with simple vectors so it can often be edifying to reduce the MWE to just that.

In your case:

julia> x = ["London", "London, UK"]
2-element Vector{String}:
 "London"
 "London, UK"

julia> split.(x, ", ")
2-element Vector{Vector{SubString{String}}}:
 ["London"]
 ["London", "UK"]

You then need to think about how you deal with the fact that some country names are missing - one way would be to construct a tuple where the second element is just an empty string if the country is missing:

julia> (y -> length(y) == 1 ? (only(y), "") : (y...,)).(split.(x, ", "))
2-element Vector{Tuple{SubString{String}, AbstractString}}:
 ("London", "")
 ("London", "UK")

Jollywatt · March 25, 2023, 11:08am

A regular expressions way:

julia> split_comma_pair(str) = collect(match(r"([^,]+),?(.*)", str))

Example usage with a data frame:

julia> using DataFrames

julia> df = DataFrame(destination = ["London", "London, UK"]);

julia> transform(df, :destination => ByRow(split_comma_pair) => [:city, :country])
2×3 DataFrame
 Row │ destination  city       country
     │ String       SubStrin…  SubStrin…
─────┼───────────────────────────────────
   1 │ London       London
   2 │ London, UK   London      UK

Edit: Annoyingly, InMemoryDatasets.jl has its own slightly different way of doing column transformations:

julia> ds = Dataset(:destination => ["London", "London, UK"]);

julia> modify(ds, :destination => byrow(Tuple∘split_comma_pair),
                  :destination => splitter => [:city, :country])
2×3 Dataset
 Row │ destination        city        country
     │ identity           identity    identity
     │ Tuple…?            SubStrin…?  SubStrin…?
─────┼───────────────────────────────────────────
   1 │ ("London", "")     London
   2 │ ("London", " UK")  London       UK

akshdfyehd · March 28, 2023, 8:10am

Thanks!

akshdfyehd · March 28, 2023, 8:20am

Hi, thank you so much for the reply, Can you please explain this part? I don’t quite understand this part, really appreciate!!

Jollywatt · March 28, 2023, 10:57am

Sure: r"..." is Julia’s syntax for a regular expression literal. The regular expression

r"([^,]+),?\s*(.*)"

has two capture groups (enclosed in parens): the first capture group ([^,]+) matches one or more (+) non-comma characters. After this group, it optionally matches a comma (,?) and zero or more spaces (\s*) before getting to the next capture group (.*) which matches everything remaining.

When using this regex with match, you can access the matched capture groups by indexing (or collecting).

julia> match(r"([^,]+),?\s*(.*)", "Th!s,  and, that")
RegexMatch("Th!s,  and, that", 1="Th!s", 2="and, that")

julia> ans[1], ans[2]
("Th!s", "and, that")

Topic		Replies	Views
Use joins to change country names General Usage question , package , inmemorydatasets	7	801	April 23, 2023
Separating a column into a variable number of possible columns Data dataframes	22	1408	October 3, 2021
Split Column in Dataframe New to Julia dataframes	5	2192	February 4, 2022
JuliaDB - split one column into n columns General Usage juliadb	1	1446	February 12, 2019
How to change the entire column form for different situation? General Usage question , inmemorydatasets	2	431	March 14, 2023

Split country column into a column with city and a column with country

Related topics