How to score top 3% in Kaggle’s ,,Titanic — Machine Learning from Disaster” competition
Hey, fellow machine learning enthusiasts. If you’ve ever had a try at a Kaggle competition, chances are you’re already familiar with the Titanic dataset.
This competition is about predicting whether a passenger will survive the Titanic disaster or not.
With relatively little effort it is possible to reach the top 30% of participants. Unfortunately, many of the top scorers train their model on an external dataset and thus obtain a model that classifies the test dataset with 100% accuracy. This means that you have to make an extra effort to get into the top 3%.
Aim of this article:
- Explain step by step the end-to-end data pipeline that is needed to score top 3%.
- Discuss the thought process of a Machine Learning Engineer / Data Scientist in Data Cleaning and Feature Engineering.
- Make it more difficult for future participants to stand out :)
You can find the complete code at:
1. Getting the data:
Titanic - Machine Learning from Disaster
Start here! Predict survival on the Titanic and get familiar with ML basics
In my case the train.csv as well as test.csv are located in the current working directory. Using pandas we can read and convert the .csv files to pandas dataframes:
Next, we can investigate the data to get a better understanding of the available features:
2. Exploratory Data Analysis
Typically, it is not necessary to use each available feature. Many of them do not provide additional information for the model and increase the training time unnecessarily. For this reason it is essential to explore which features should be considered.
I would argue that most important skill of a Machine Learning Engineer/Data Scientist is to be unbiased, not assume things and to ask good questions.
In the very first step it is helpful to check if and how many entries are missing in the dataset.
2.1 Question 1: Does the dataset contain missing values?
From the above plots, it can be seen that the training and similarly the test datasets contain features with missing values. The sparsest features are “age” and “cabin”. A naive approach to solve this problem would be to remove the feature completely from the dataset. However, since we do not know how much information these features provide they must be further investigated first. Maybe it makes sense to impute the missing values using sophisticated methods.
2.1 Question 1: How many passengers survived?
If you prefer plots you can define the function below to plot a bar chart:
As expected the majority of passengers in the training data died. Only 38% survived the disaster. So the training data suffers from data imbalance but it is not severe which is why I will not consider techniques like sampling to tackle the imbalance.
2.2 Question 2: Is the likelihood of survival dependent on gender?
In 1912, about 110 years ago, women were generally considered to be the weaker sex that should be protected. Based on the data, we can investigate whether more women actually survived.
Here, we can clearly see that even though the majority of passenger were men, the majority of survivors were women. The key observation here is that the survival rate for female passengers is 4 times higher than the survival rate of male passengers. This seems to confirm that the phrase “women and children first” does indeed seem to have been a rule to which men adhered.
2.3 Question 3: Could it be that the class to which a passenger belonged correlates with the probability of survival??
From the plots and tables above it becomes clear that the Pclass is an important factor to consider.
- Most passenger had class 3 tickets, yet only 24% of class 3 passengers survived.
- Almost 63% of the passenger from class 1 survived.
- Approx 50% of the class 2 passenger survived.
However, it is yet not clear weather the class or the gender is the underlying and deciding factor. Which brings another important question:
2.4 Question 4: Is the higher survival rate in Class 1 due to the class itself or to a gender distribution in which female passengers dominate?
Here, we can see that the question raised above was justified. Irrespective of the class the most important factor when it comes to surviving was gender. (At least between the two features Sex and Pclass). However, men in class 1 had a significantly higher chance of survival if they bought class 1 tickets. This just shows to say that we should keep both features as both yield insightful information that should help our model.
- Survival Rate females 1. Class: 96,8%
- Survival Rate females 2. Class: 92,1%
- Survival Rate females 3. Class: 50%
- Survival Rate male 1. Class: 36.8%
(still significantly lower than 3. class females)
2.5 Question 5: Did a passengers age influence the chance of survival?
The Histogram above shows that age follows a fairly normal distribution. Also investigating the kernel density estimate does not provide additional information except from a raise in survivors at a very young age. However, one idea might be to investigate age and sex together using a swarm plot, as it does not seem plausible that age has no influence on the chance of survival:
As expected age holds valuable information. The swarm plot above shows that a big portion of male survivors are passengers between 0 and 12 years of age. It’s also interesting to see that the oldest passenger 80 year old man survived.
Looking at the swarm plot below we can see how important how important the Pclass is when it comes to predicting the likelihood of survival. One additional piece of information from the swarm plot below is that the first-class did not have a lot of children. Maybe rich people get fewer kids in general?
2.6 Question 6: Did paying the ticket price influence the chance of survival?
Before we start to answer this question we should look at it’s basic properties the get a better intuition for the distribution we’re dealing with:
Fare does not follow a normal distribution and has a huge spike at the price range [0–100$].
The distribution is skewed to the left with 75% of the fare paid under 31% and a max paid fare of 512$. Depending on the model that I’m going to use it might make sense to normalize this feature. However, this aspect will be tackled later in the feature engineering section.
To better understand how this feature influences the survival rate, we could plot bar plots of Fare vs Survived. However, due to the large range of fares such as plot would not be useful for inferring useful information.
A more suited visualization would be to combine fares into categories and then plot the categories vs Survived.
As we can see the likelihood of survival is definitely influenced by the price paid.
- Cheap (0–25% of max Price): Surival Rate = 0.2 (aprox)
- Standard (25%-50% of max Price): Surival Rate = 0.3 (aprox)
- Expensive (50%-75% of max Price): Surival Rate = 0.45 (aprox)
- Lucury (75%-100% of max Price): Surival Rate = 0.55 (aprox)
Additionally, we can investigate the relationship between fare, sex and survived to further understand the importance of the feature:
Here, some important observations can be made:
- Irrespective of gender all passengers with a fare above 500$ survived.
- All male passenger that paid between 200–300$ died
- All female passenger that paid between 200–300$ died
This could be a feature a classifier might pick up. One thing that caught my attention is that the minimum fare paid was 0.0 $. This seems highly unlikely.
We can investigate who these people were:
At first glance, I don’t notice any particular characteristic that these passengers, who have paid nothing for their fare, have in common. Since a free ride on the Titanic intuitively makes no sense, I will replace the values for the ticket price with NAN’s for these passengers. In the Feature Engineering part that follows later, we will discuss the handling of this feature separately.
2.8 Question 8: Could the place of embarkation influence the chance of survival?
My reasoning and intuition would not make me believe that the place of embarkation matters at all, however, we must force ourselves to not make any assumptions about the data.
One idea to analyze the data is to use count plots for the 3 different locations of embarkation Southampton, Cherbourg and Queenstown:
We can see that the majority of passengers embarked from Southampton. However, only 33% survived the sinking of the titanic.
The highest survival rate of 55% is in the group of passengers that embarked from Cherbourg. While it is important to look at the data without making prior assumptions, however, common sense should be used at all times. Why should the place of embarkation influence the survival rate at all? Why is the likelihood of survival higher at Cherbourg?
As we know from above a better class increases the survival rate drastically. One indicator may be the percentage of 1. Class passengers that embarked at Cherbourg.
2.9 Question 9: Was the high number of survivors that embarked at Cherbourg due to a high number of 1. Class passengers?
The hypothesis seems to be correct.
- The majority embarked at Cherbourg were 1. class passengers.
- The majority embarked at Southampton were 3. class passengers.
However, it does not explain why the survival rate for Queenstown passengers is slightly higher than at Southampton even though the number of 1. class passenger in relation to 3. class passengers is higher at Southampton.
One hypothesis is that maybe the ratio between male and female passengers differs.
2.10 Question 10: Is gender distribution responsible for the slightly higher passenger survival rate in Queenstown compared to Southampton?
As expected twice as much male passengers embarked from Southamp were roughly the same number of male and female passengers embarked from Queenstown. This just shows the importance of the Sex feature.
2.11 Question 11: Does the number of children/siblings/spouses or parents on board influence the chance of survival?
As we can see from the plots above, the majority of passengers traveled alone. It seems that the more siblings a passenger has the less chance of survival there is.
Similar to the SibSp column, this feature contains the number of parents or children each passenger was traveling with.
Here we draw the same conclusions as for Parch as for SibSp. We can see again that small families had more chances to survive than bigger ones and passengers who traveled alone.
Later in the feature engineering part we will think about how to combine Parch and SibSp to a new feature that utilizes the information of the two.
3. Feature Engineering
3.1 Feature Name:
One feature that we did not cosider until now is the name. In theory, a person’s name should have no influence on the probability of survival, but on closer investigation we see that a title is sometimes hidden in the name, which in turn could be quite useful. However, using the different names as categorical variables does not make sense. One idea is to extract the title from the name.
Using the code above we can create a new Title feature:
If we now look at the count plot of the new title feature we see that certain titles dominate.
In this case it makes sense to group less frequent Titles together. I will substitute male titles into Mr and female Titles into Mrs:
Finally, let’s invetigate the title feature and the surival rate together
- As Expected female Titles result in a higher survival rate.
- Master and Dr slightly have a surprisingly high survival rate even tough both are male titles
- Being “just” a Mr comes with a compromised survival rate of approx 15%
- Interestingly, all 6 revenants died. Maybe they decided to accept their destiny and wanted to die with dignity
3.2 Feature Cabin and Ticket:
As we can see both features are not easy to deal with. Cabin contains a lot of NaN’s and the ticket seems not to provide any useful information.
We can try different ideas:
- Extract two leading letters to create a new feature
- Extract number of letter in ticket to create a new features
- Extract number of cabins used
- Extract Cabin letter
You can investigate the resulting dataframe train_df to see the results after feature engineering.
3.2 Feature Family Size:
As metnioned in 2.11 Parch and SibSp can be combined to a new feature to represent capture the information of both. We can calculate the size of the family ba arithmetically adding both features:
3.2 Feature Family Type:
In a similiar way we can create a new categorical feature that encodes the family size into 4 distinct groups.
3. Training a Classifier
Although we have a relatively large number of features available in relation to the size of the data set, it does not necessarily make sense to use all of them. Especially if the training time is drastically increased by features with little information content, these should not be utilized.
- We start by selecting the features we will use and isolating the target:
- -> ‘Pclass’, ‘Fare’, ‘Title’, ‘Embarked’, ‘Fam_type’, ‘Ticket_len’, ‘Ticket_2letter’
- -> Cabin will not be used and the relevant information about the feature age (which is beeing a young man) is already encoded in the title feature
- -> Sex will not be used to confuse the classifier as adult males and young boys the same sex but are really different categories
The last step before fitting a model is to prepare a pipeline that will make sure that all the preprocessing performed on the training data is also done on the test data:
Finally, you can create and upload your submission file:
3. Futher improvements
The pipeline presented here is only intended to be a starting point. Many possibilities for improvement have not even been addressed here. For example, more data imputation possibilities could be explored. Instead of the RF classifier, one could try the XGBoost classifier or, for example, combine several classifiers with the help of a voting classifier. The cabin feature also offers potential for improvement. Instead of simply extracting the first two letters, the number of cabins booked could be a possible feature, since there are passengers who have booked more than one cabin. You see, the possibilities are far from exploited.
I hope that this article will help beginners to understand the structured approach of a Machine Learning Engineer/Data Scientist. Data analysis is often time-consuming and requires asking many questions. The added value of this analysis is that the number of weak features can be reduced and the remaining ones can be used to create meaningful features. Have fun searching for better features and cracking the top 3%,