The Kaggle Titanic machine learning competition is a great starter project. Before getting to the actual categorical modeling–will this passenger survive or not?–there’s some work to be done to get the data ready to find the patterns of survival. One attribute of interest is the passenger Name, which includes what I call their Title–Master, Mr, Ms, Miss, etc.–and is of interest because it could help determine the passenger’s Age if that’s missing, and there are quite a few of those missing Ages in the dataset. I enjoy seeing other entries in their language of choice, most often R or Python, and finding the unique ways entrants work with the data.
In comparing Julia to Python there are syntactic tricks to doing similar things. In one post to the Kaggle competition, another poster’s code listed this as a Pythonic way to retrieve the passenger’s Title:
dataset[‘Title’] = dataset[‘Name’].str.split(“, “, expand=True)[1].str.split(”.”, expand=True)[0]
How could I recreate this in Julia because I want to separate the passenger Title from their Name?
TITLE
df_combined.Title = [split.(split.(df_combined.Name, ", ")[i][2], ". ")[1] for i in eachindex(df_combined.Name)];
Whereas the Python code used multiple str.split methods of the Dataframe on which it’s working my Julia implementation uses an Array comprehension with two embedded split() function calls. Let’s look at both ways of getting this done by reviewing what the code is doing.
There’s a lot going on in this line of Python code:
-
The first split tells Python to split the Name attribute where it finds the string ", " and…
-
Keep the 2nd element of that split data array (remember Python counts from 0) and then…
-
Pass that 2nd element, which includes the title and all name before the last, to another str.split…
-
That splits on “.” and keeps the 1st element (the 0th element) of that array, resulting in
-
A new entry into dataset[‘Title’]
My Julia code may not be the most elegant or even concise, but after dozens of trial and error attempts with a bunch of functions I got it to work. How am I doing it?
In Python this would be called a “list comprehension”; in Julia it’s an Array comprehension. Here’s what happens:
-
The internal split–on ", "-- is run against Name, and the 2nd element of the ith 2-dimensional array is kept and…
-
The outside split–on “.”–runs against that return value–which is the title and all name–and…
-
The 1st element of that array is kept, which is the title
Both of these solutions, Python or Julia, get the job done on pulling the title from the Name attribute and assigning it to a new attribute, Title. I find it easier to nest function calls in Julia than to chain them in Python.
If you’ve got an even easier or concise way of pulling a nested string like this I’d love to hear about it.