ML 101 — Improving titanic score from 0.7 to 1
Spoiler alert: My answer to get 1 is stupid
Step-1 : Understanding the problem statement
In this post, I am going to explain my journey of improving the score from 0.72 to 0.83 (top-3 percent at the time of submission) and then from 0.83 to 1 the Titanic machine learning from disaster.
Good thing about Titanic problem it is an never ending contest. So, if you want to understand where you stand as Data scientist this problem will be of great use.
Step-2: Understanding data
Always, keep in mind this quote
If you torture data long enough, it will confess the truth
-Ronald H. Coase
Kaggle provides you 3 files, which you can find here. Features that are available for analysis
PassengerID —
Id of passengerSurvived —
Survival → 0 = No, 1 = YesPclass —
Ticket class → 1 = 1st, 2 = 2nd, 3 = 3rdSex —
Gender of the person → 0=Female, 1 = maleAge —
Age → in yearsSibsp —
# of siblings or spouses aboard the ShipParch —
# of parents or children aboard the ShipTicket —
Ticket numberFare —
Passenger fareCabin —
Cabin numberEmbarked —
Port of Embarkation → C = Cherbourg, Q = Queenstown, S = Southampton
Hypothesis: Pclass feature is correlated with Survival.
Conclusion: Yes
Explanation: I used countplot, to study correlation between ‘Pclass’ and ‘Survived’ column, which clearly showed former has effect on later.
sns.countplot(x = 'Pclass',hue = 'Survived',data=train);
Hypothesis: Gender has effect on Survival.
Conclusion: Yes. Females survive more than Males
Explanation: When used histogram, to visualize males and females survival ratio across data. Females have higher chance of survival
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
fig,axes = plt.subplots(nrows=1,ncols=2,figsize = (10,4))
ax = sns.distplot(female_passengers[female_passengers.Survived == 1].Age.dropna(),bins=18,
ax = axes[0],kde=False,label = lbl_survived)
ax = sns.distplot(female_passengers[female_passengers.Survived == 0].Age.dropna(),bins=40,
ax = axes[0],kde=False,label = lbl_not_survived)
ax.legend()
ax.set_title('Female')
ax = sns.distplot(male_passengers[male_passengers.Survived == 1].Age.dropna(),bins=18,
ax = axes[1],kde=False,label = lbl_survived)
ax = sns.distplot(male_passengers[male_passengers.Survived == 0].Age.dropna(),bins=40,
ax = axes[1],kde=False,label = lbl_not_survived)
ax.legend()
_ = ax.set_title('Male')
Hypothesis: Name has impact on survival rate
Conclusion: Yes
Explanation: Structure of name in the dataset is “<Last Name>, <Title>. <First Name>”. While this is mostly ignored by novice data scientists, a simple analysis will give lot of useful insights. Clearly, passengers with title as “Mr” didn’t have much luck.
# Handling Title information
data = [train,test]
title = {"Mr" : 1, "Miss" : 2,"Mrs" : 3, "Master" : 4, "Rare" : 5}for dataset in data:
dataset['title'] = dataset.title.replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr','Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['title'] = dataset.title.replace(['Mlle'],'Miss')
dataset['title'] = dataset.title.replace(['Ms'],'Miss')
dataset['title'] = dataset.title.replace(['Mme'],'Mrs')
dataset['title'] = dataset['title'].map(title)
del dataset['Name']
dataset.title.fillna(0,inplace=True)sns.countplot(x = 'title',hue = 'Survived',data=train);
While this is one version of analysis I have presented, there are many articles that showed connection between first name and survival. While I myself used it in my model, I am leaving that for you.
Missing values: Train data have null values in Age, Cabin and Embarked column. While there are many techniques to fill missing values. I used median for age column since the spread in that column is more. Filled most frequent value for Embarked and filled ‘S’ for cabin when it is null.
# Handling Age information
data = [train, test]for dataset in data:
missing_age = train.Age.median()
dataset.Age.fillna(missing_age,inplace = True)
dataset.Age = dataset.Age.astype(int)
Step-3: Model generation
I tried RandomForest Classifer, SGDClassifer and XGBoost, with RandomForest giving me best accuracy.
Feature scaling: This is used to normalize the features. This will not be having any impact on accuracy but will be helpful for your model to converge faster.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
To know how our model is performing we need to have evaluation strategy. I have opted for 10-fold cross validation and default Out of bag score provided by sklearn model class.
randomforest_classifier = RandomForestClassifier(criterion='gini',min_samples_leaf=5,min_samples_split=2,n_estimators=400,random_state=42)
randomforest_classifier.fit(X_train_scaled,y_train)
y_pred = randomforest_classifier.predict(X_test_scaled)
acc_random_forest = round((randomforest_classifier.score(X_test_scaled,y_test)*100),2)
acc_random_forest
This gave me a cross-validation score mean around 87.5 on train data. Final prediction gave me a score of 82.7%.
Hyper-parameter tuning: This is the process of choosing right parameters to your learning algorithm can perform better. You can use “GridSearchCV” to get the best parameters to your model. Code can be seen below:
#Hyper-parameter tuning
from sklearn.ensemble import RandomForestClassifierparam_grid = {
"criterion" : ['gini','entropy'],
"min_samples_leaf" : [1,5,8,10],
"min_samples_split" : [2,4,10,12,16],
"n_estimators" : [100,200,400,800]
}
randomforest_classifier = RandomForestClassifier(random_state=42,oob_score=True)clf = GridSearchCV(estimator=randomforest_classifier,n_jobs=-1,param_grid=param_grid)
clf.fit(X_train_scaled,y_train)
clf.best_params_
Conclusion
There cannot be any one write answer when it comes to modelling. One suggestion is to read lot of posts and check public Kaggle notebooks if you are stuck with something.
Coming to the question of how can we achieve a perfect score for this problem. There is no ML model that achieves perfect score, if you still want to achieve 1 (your model can be god if this is achievable) , you can get passenger list from here and can use them directly for predicting the results.
Interestingly, I didn’t find any public notebooks or posts from candidates that scored 1.
Hope, this article helped you.You can get complete code here