ML 101 — Improving titanic score from 0.7 to 1

4 min readApr 5, 2020

Spoiler alert: My answer to get 1 is stupid

Step-1 : Understanding the problem statement

In this post, I am going to explain my journey of improving the score from 0.72 to 0.83 (top-3 percent at the time of submission) and then from 0.83 to 1 the Titanic machine learning from disaster.

Good thing about Titanic problem it is an never ending contest. So, if you want to understand where you stand as Data scientist this problem will be of great use.

Step-2: Understanding data

Always, keep in mind this quote

If you torture data long enough, it will confess the truth
-Ronald H. Coase

Kaggle provides you 3 files, which you can find here. Features that are available for analysis

PassengerID — Id of passenger
Survived — Survival → 0 = No, 1 = Yes
Pclass — Ticket class → 1 = 1st, 2 = 2nd, 3 = 3rd
Sex — Gender of the person → 0=Female, 1 = male
Age — Age → in years
Sibsp — # of siblings or spouses aboard the Ship
Parch — # of parents or children aboard the Ship
Ticket — Ticket number
Fare — Passenger fare
Cabin — Cabin number
Embarked — Port of Embarkation → C = Cherbourg, Q = Queenstown, S = Southampton

Hypothesis: Pclass feature is correlated with Survival.

Conclusion: Yes

Explanation: I used countplot, to study correlation between ‘Pclass’ and ‘Survived’ column, which clearly showed former has effect on later.

sns.countplot(x = 'Pclass',hue = 'Survived',data=train);

Hypothesis: Gender has effect on Survival.

Conclusion: Yes. Females survive more than Males

Explanation: When used histogram, to visualize males and females survival ratio across data. Females have higher chance of survival

import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
fig,axes = plt.subplots(nrows=1,ncols=2,figsize = (10,4))
ax = sns.distplot(female_passengers[female_passengers.Survived == 1].Age.dropna(),bins=18,
                  ax = axes[0],kde=False,label = lbl_survived)
ax = sns.distplot(female_passengers[female_passengers.Survived == 0].Age.dropna(),bins=40,
                 ax = axes[0],kde=False,label = lbl_not_survived)
ax.legend()
ax.set_title('Female')
ax = sns.distplot(male_passengers[male_passengers.Survived == 1].Age.dropna(),bins=18,
                  ax = axes[1],kde=False,label = lbl_survived)
ax = sns.distplot(male_passengers[male_passengers.Survived == 0].Age.dropna(),bins=40,
                 ax = axes[1],kde=False,label = lbl_not_survived)
ax.legend()
_ = ax.set_title('Male')

Hypothesis: Name has impact on survival rate

Conclusion: Yes

Explanation: Structure of name in the dataset is “<Last Name>, <Title>. <First Name>”. While this is mostly ignored by novice data scientists, a simple analysis will give lot of useful insights. Clearly, passengers with title as “Mr” didn’t have much luck.

# Handling Title information
data = [train,test]
title = {"Mr" : 1, "Miss" : 2,"Mrs" : 3, "Master" : 4, "Rare" : 5}for dataset in data:
    dataset['title'] = dataset.title.replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr','Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['title'] = dataset.title.replace(['Mlle'],'Miss')
    dataset['title'] = dataset.title.replace(['Ms'],'Miss')
    dataset['title'] = dataset.title.replace(['Mme'],'Mrs')
    dataset['title'] = dataset['title'].map(title)
    del dataset['Name']
    
    dataset.title.fillna(0,inplace=True)sns.countplot(x = 'title',hue = 'Survived',data=train);

While this is one version of analysis I have presented, there are many articles that showed connection between first name and survival. While I myself used it in my model, I am leaving that for you.

Missing values: Train data have null values in Age, Cabin and Embarked column. While there are many techniques to fill missing values. I used median for age column since the spread in that column is more. Filled most frequent value for Embarked and filled ‘S’ for cabin when it is null.

# Handling Age information
data = [train, test]for dataset in data:
    missing_age = train.Age.median()
    dataset.Age.fillna(missing_age,inplace = True)
    dataset.Age = dataset.Age.astype(int)

Step-3: Model generation

I tried RandomForest Classifer, SGDClassifer and XGBoost, with RandomForest giving me best accuracy.

Feature scaling: This is used to normalize the features. This will not be having any impact on accuracy but will be helpful for your model to converge faster.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

To know how our model is performing we need to have evaluation strategy. I have opted for 10-fold cross validation and default Out of bag score provided by sklearn model class.

randomforest_classifier = RandomForestClassifier(criterion='gini',min_samples_leaf=5,min_samples_split=2,n_estimators=400,random_state=42)
randomforest_classifier.fit(X_train_scaled,y_train)
y_pred = randomforest_classifier.predict(X_test_scaled)
acc_random_forest = round((randomforest_classifier.score(X_test_scaled,y_test)*100),2)
acc_random_forest

This gave me a cross-validation score mean around 87.5 on train data. Final prediction gave me a score of 82.7%.

Hyper-parameter tuning: This is the process of choosing right parameters to your learning algorithm can perform better. You can use “GridSearchCV” to get the best parameters to your model. Code can be seen below:

#Hyper-parameter tuning
from sklearn.ensemble import RandomForestClassifierparam_grid = {
    "criterion" : ['gini','entropy'],
    "min_samples_leaf" : [1,5,8,10],
    "min_samples_split" : [2,4,10,12,16],
    "n_estimators" : [100,200,400,800]
}
randomforest_classifier = RandomForestClassifier(random_state=42,oob_score=True)clf = GridSearchCV(estimator=randomforest_classifier,n_jobs=-1,param_grid=param_grid)
clf.fit(X_train_scaled,y_train)
clf.best_params_

Conclusion

There cannot be any one write answer when it comes to modelling. One suggestion is to read lot of posts and check public Kaggle notebooks if you are stuck with something.

Coming to the question of how can we achieve a perfect score for this problem. There is no ML model that achieves perfect score, if you still want to achieve 1 (your model can be god if this is achievable) , you can get passenger list from here and can use them directly for predicting the results.

Interestingly, I didn’t find any public notebooks or posts from candidates that scored 1.

Hope, this article helped you.You can get complete code here

ML 101 — Improving titanic score from 0.7 to 1

Step-1 : Understanding the problem statement

Step-2: Understanding data

Step-3: Model generation

Conclusion

Written by Shankar Y Bhavani