A Hasty Attempt at the Titanic Kaggle Competition

Intro

Kaggle has hosted the titanic competition for a few years now. It gives aspiring data scientists the opportunity to test some basic classification approaches on a real dataset that’s familiar to almost anyone.

Motivation

My main goal with this attempt is to see how well I can do in a short amount of time. I’m not going to spend more than a day (or a few hours) on this in part becuase I think it seems kind of silly to try to eek out each bit of predictive power from the data when this is really just meant to be an interesting practice exercise.

from __future__ import division, print_function

# Basic imports. I'm using python 2

import numpy as np
import pandas as pd

import re

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import (
    GridSearchCV, train_test_split, cross_val_score
)
from sklearn.ensemble import (
    AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier,
    RandomForestClassifier
)
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pd.set_option('display.max_columns', None)

%matplotlib inline

train = pd.read_csv('data/train.csv', index_col=0)
train.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

Feature Engineering

This is one of the most important and often overlooked parts of the modeling process. Unfortunately, as part of my effort to get respectable results as quickly as possible, I’m not going to spend as much time working on it either…

Rather than messing with my modeling approach, I could probably significantly improve my score by just making/using better, more predictive, features.

columns_to_investigate = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

fig, axes = plt.subplots(
    len(columns_to_investigate), 1, figsize=(8, 20)
)

survived_mask = (train['Survived'] == 1)

for ax, col in zip(axes, columns_to_investigate):
    ax.hist(
        train.loc[survived_mask, col].fillna(-1), bins=30,
        normed=True, color='g', label='Survived', alpha=0.4
    )
    ax.hist(
        train.loc[~survived_mask, col].fillna(-1), bins=30,
        normed=True, color='r', label='Died', alpha=0.4
    )
    ax.set_title(col)
    ax.legend()

fig.tight_layout()

Some Exploratory Histograms

png

Building a cleaning pipeline

title_re = re.compile(r' ([A-Za-z]+)\.')
title_replacements = {'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'}

cabin_letter_re = re.compile(r'([A-Z]+)')
cabin_num_re = re.compile(r'(\d+)')


def add_title_name_len(df):
    """
    Adds Title col astype int from Name
    Drops the name
    """
    df['Title'] = df['Name'].str.extract(title_re)
    df['Title'] = df['Title'].replace([
        'Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major',
        'Rev', 'Sir', 'Jonkheer', 'Dona'
    ], 'Rare')
    df['Title'] = (
        df['Title']
        .map(title_replacements)
        .map({'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Rare': 5})
        .fillna(0)
        .astype(int)
    )
    df['NameLen'] = df['Name'].str.len()
    df.drop('Name', axis=1, inplace=True)

def clean_sex(df):
    """self-explanatory"""
    df['Sex'] = df['Sex'].fillna(df['Sex'].mode())
    df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})

def add_family_size(df):
    df['FamilySize'] = df['SibSp'] + df['Parch']

def classify_ages(df):
    """
    Adds MissingAge column
    Makes Age an int column with categories from 1 to 5 ordered by age
    """
    age_cuts = [15, 20, 30, 40, 50, 60]
    df['MissingAge'] = 0.
    df.loc[(df['Age'].isnull()), 'MissingAge'] = 1.
    df['Age'] = pd.cut(
        df['Age'], bins=age_cuts, labels=range(1, len(age_cuts))
    ).astype(float)
    df['Age'] = df['Age'].fillna(0).astype(int)

def clean_fare(df):
    """
    Adds a MissingFare column
    Fills missing fares with 0s
    Adds a quantile for fare
    """
    cuts = [4, 7, 10, 20, 50, 100, 200]
    # Don't expect this to be missing
    df['Fare'] = df['Fare'].fillna(0)
    df['QuantFare'] = pd.cut(
        df['Fare'], cuts, labels=range(1, len(cuts))
    ).astype(float)

def clean_embarked(df):
    """
    Fill the very few missing with mode
    Maps to integers
    """
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode())
    df['Embarked'] = (
        df['Embarked']
        .fillna(df['Embarked'].mode()[0])
        .map({'S': 0, 'C': 1, 'Q': 2}).astype(int)
    )

def clean_cabin(df):
    """
    Adds MissingCabin column
    Adds a CabinLetter column with letters mapped to ints
    Adds a CabinNumber column of type ints
    Drops original Cabin column
    """
    df['MissingCabin'] = 0
    df.loc[df['Cabin'].isnull(), 'MissingCabin'] = 1
    df.loc[df['Cabin'].isnull(), 'Cabin'] = ''
    df['CabinLetter'] = (
        df['Cabin'].str.extract(cabin_letter_re).fillna('')
    )
    df['CabinNumber'] = (
        df['Cabin'].str.extract(cabin_num_re)
        .astype(float)
        .fillna(0)
        .astype(int)
    )
    df.drop('Cabin', axis=1, inplace=True)

def categorize(df, columns):
    """
    Given a df and list of columns, converts those columns to dummies
    concats them with the df and drops the original columns.
    """
    for col in columns:
        dummy_df = pd.get_dummies(df[col], prefix=col, drop_first=True)
        df.drop(col, axis=1, inplace=True)
        df = pd.concat([df, dummy_df], axis=1)
    return df

def data_prep(df):
    add_title_name_len(df)
    clean_sex(df)
    classify_ages(df)
    add_family_size(df)
    df.drop('Ticket', axis=1, inplace=True)
    clean_fare(df)
    # On second thought, try dropping this. There's not reason to think it's predictive
#     clean_embarked(df)
    df.drop('Embarked', axis=1, inplace=True)
    clean_cabin(df)
    return categorize(df, ['Title', 'Age', 'QuantFare', 'CabinLetter'])

train = pd.read_csv('data/train.csv', index_col=0)
train = data_prep(train)
train.head()

	Survived	Pclass	Sex	SibSp	Parch	Fare	NameLen	MissingAge	FamilySize	MissingCabin	CabinNumber	Title_2	Title_3	Age_1	Age_2	Age_3	Age_4	Age_5	QuantFare_2.0	QuantFare_3.0	QuantFare_4.0	QuantFare_5.0	QuantFare_6.0	CabinLetter_A	CabinLetter_B	CabinLetter_C	CabinLetter_D	CabinLetter_E	CabinLetter_F	CabinLetter_G	CabinLetter_T
PassengerId
1	0	3	1	1	0	7.2500	23	0.0	1	1	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
2	1	1	0	1	0	71.2833	51	0.0	1	0	85	0	0	0	0	1	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0
3	1	3	0	0	0	7.9250	22	0.0	0	1	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
4	1	1	0	1	0	53.1000	44	0.0	1	0	123	0	0	0	0	1	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0
5	0	3	1	0	0	8.0500	24	0.0	0	1	0	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0

Predictor Correlations

plt.figure(figsize=(10, 10));

sns.heatmap(train.iloc[:, :11].corr(), square=True, annot=True);

png

Investigating Feature Usefulness

mini_df = train.iloc[:, :10].copy()

columns_to_investigate = mini_df.columns[1:10]

fig, axes = plt.subplots(9, 1, figsize=(10, 30))

survived_mask = (mini_df['Survived'] == 1)

for i, (col, ax) in enumerate(zip(columns_to_investigate, axes)):
#     print(i, col, ax)
    ax.hist(
        mini_df.loc[survived_mask, col], bins=30,
        normed=True, color='g', label='Survived', alpha=0.4
    )
    ax.hist(
        mini_df.loc[~survived_mask, col], bins=30,
        normed=True, color='r', label='Died', alpha=0.4
    )
    ax.set_title(col)
    ax.legend()
fig.tight_layout()

png

One thing of note: missing data seems to almost always be a bad sign. It’s probably correlated with wealth in some way since the lower class “less important” individuals are probably less likely to have complete information available.

y_train = train.pop('Survived').values
X_train = train.values

Parameter selection via grid search CV

I’m going to try out the following models:

RandomForestClassifier
AdaBoostClassifier
BaggingClassifier
GradientBoostingClassifier
SupportVectorClassifier

Of course by fitting all of these hyperparameters to the training data, I run the risk of starting to meta overfit the training data by overtuning the hyperparameters to just the training data. That’s not something I’m particularly worried about with this project though.

Generally in cases where there is a n_estimators parameter, I use a low learning rate with a larger number of estimators or iterations to limit the risk of overfitting.

# A running dictionary of all my models, so I can keep track of everything
all_ensembles = dict()

Random Forest

rfc = RandomForestClassifier()

rfc_param_grid = {
    'n_estimators': [500],
    'max_features': np.arange(3, 7, 1, dtype=int),
    'min_samples_split': [4, 5, 6]
}

rfc_g = GridSearchCV(
    rfc, rfc_param_grid, scoring='accuracy', cv=5, n_jobs=-1, verbose=1
)
rfc_g.fit(X_train, y_train)

print(rfc_g.best_score_)

print(rfc_g.best_params_)

0.810325476992
{'max_features': 3, 'min_samples_split': 6, 'n_estimators': 500}

rf_model = RandomForestClassifier(
    n_estimators=3000, max_features=3, min_samples_split=6
)

rf_scores = (
    cross_val_score(
        rf_model, X_train, y_train, scoring='accuracy', verbose=1, n_jobs=-1
    )
)

rf_score = rf_scores.mean()
print(rf_score)
# 0.828

rf_model.fit(X_train, y_train)

all_ensembles.update({'rf': (rf_model, rf_score)})

AdaBoost Classifier

abc = AdaBoostClassifier()

abc_param_grid = {
    'n_estimators': [1000],
    'learning_rate': [0.01, 0.1, 0.3]
}

abc_g = GridSearchCV(
    abc, abc_param_grid, scoring='accuracy', cv=5, n_jobs=-1, verbose=1
)
abc_g.fit(X_train, y_train)

print(abc_g.best_score_)
# 0.801346

print(abc_g.best_params_)
# {'learning_rate': 0.1, 'n_estimators': 1000}

abc_model = AdaBoostClassifier(
    n_estimators=2000, learning_rate=0.1
)

# Past 1500 estimators doesn't help as much here

abc_scores = (
    cross_val_score(
        abc_model, X_train, y_train, scoring='accuracy', verbose=1, n_jobs=-1
    )
)

abc_score = abc_scores.mean()
print(abc_score)
# 0.795

abc_model.fit(X_train, y_train)

all_ensembles.update({'abc': (abc_model, abc_score)})

Bagging

bag = BaggingClassifier()

bag_param_grid = {
    'base_estimator': [None],
    'n_estimators': [10, 25, 30, 40],
    'max_samples': [0.2, 0.3, 0.5, 0.7, 1.],
    'max_features': [0.95, 1.]
}

bag_param = GridSearchCV(
    bag, bag_param_grid, scoring='accuracy', n_jobs=-1, cv=5,
    verbose=1
)

bag_param.fit(X_train, y_train)

print(bag_param.best_score_)
# 0.8226

print(bag_param.best_params_)
# {'base_estimator': None,
#  'max_features': 0.95,
#  'max_samples': 0.5,
#  'n_estimators': 25}

bag_model = BaggingClassifier(
    n_estimators=25, max_features=1., max_samples=0.3
)

bag_scores = (
    cross_val_score(
        bag_model, X_train, y_train, scoring='accuracy', verbose=1, cv=5
    )
)

bag_score = bag_scores.mean()
print(bag_score)
# 0.802

bag_model.fit(X_train, y_train)

all_ensembles.update({'bagging': (bag_model, bag_score)})

GradientBoosting

gb_classifier = GradientBoostingClassifier()

gb_param_grid = {
    'learning_rate': [0.1],
    'n_estimators': [3000],
    'max_depth': [3, 5, 7],
    'subsample': [0.5, 0.75, 1]
}

gb_param = GridSearchCV(
    gb_classifier, gb_param_grid, scoring='accuracy', n_jobs=-1, cv=5,
    verbose=1
)

gb_param.fit(X_train, y_train)

print(gb_param.best_score_)
# 0.82267

print(gb_param.best_params_)
# {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 3000, 'subsample': 0.5}

gb_model = GradientBoostingClassifier(
    learning_rate=0.1, max_depth=7, n_estimators=3000, subsample=0.95
)

gb_scores = (
    cross_val_score(
        gb_model, X_train, y_train, scoring='accuracy', verbose=1, cv=5, n_jobs=-1
    )
)

gb_score = gb_scores.mean()
print(gb_score)
# 0.80

gb_model.fit(X_train, y_train)

all_ensembles.update({'gb': (gb_model, gb_score)})

Support Vector Machine Classifier

svc = SVC()

svc_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('svc', svc)
])


svc_grid_params = [{
        'svc__kernel': ['poly'],
        'svc__degree': np.arange(1, 8, dtype=int),
        'svc__C': np.logspace(-2, 3, 8)
    },
    {
        'svc__kernel': ['rbf'],
        'svc__gamma': np.logspace(-4, 2, 8),
        'svc__C': np.logspace(-2, 2, 10)
    }
]

svc_grid = GridSearchCV(
    svc_pipe, param_grid=svc_grid_params, n_jobs=-1,
    verbose=2
)
svc_grid.fit(X_train, y_train)

print(svc_grid.best_score_)
# 0.818

print(svc_grid.best_params_)
# {'svc__C': 1.6681005372000592,
#  'svc__gamma': 0.26826957952797248,
#  'svc__kernel': 'rbf'}

svc_model = SVC(
    C=30., kernel='rbf', gamma=0.005
)

svc_scores = (
    cross_val_score(
        svc_model, X_train, y_train, scoring='accuracy', verbose=1, cv=5
    )
)

svc_score = svc_scores.mean()
print(svc_score)
# 0.773

svc_model.fit(X_train, y_train)

all_ensembles.update({'svc': (svc_model, svc_score)})

XGBoost

import xgboost as xgb
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_train, y_train)

def add_empty_missing_cols(df, all_columns):
    for col in all_columns:
        if col not in df.columns and col != 'Survived':
            df[col] = 0.
    return df

# Preparing the test data
test = pd.read_csv('data/test.csv', index_col=0)
test = data_prep(test)
test = add_empty_missing_cols(test, train.columns)

X_test = test.values

xgb_model = xgb.XGBClassifier(
    max_depth=3, n_estimators=3000, learning_rate=0.005
)

xgb_score = cross_val_score(xgb_model, X_train, y_train).mean()
print(xgb_score)
# On mini test from within training
# 0.8226

all_ensembles.update({'xgb': (xgb_model, xgb_score)})

xgb_model.fit(X_train, y_train)

y_test_pred = xgb_model.predict(X_test)

final_results = pd.DataFrame(
    {'PassengerId': test.index.values, 'Survived': y_test_pred}
)

final_results.head()

	PassengerId	Survived
0	892	0
1	893	1
2	894	0
3	895	0
4	896	1

final_results.to_csv('data/final_result.csv', index=False)

# Got a 76 on the Kaggle submission

Ensembling

def ultra_ensembled_model(fit_estimators, X):
    """
    This isn't really serious, but it will be interesting to see how
    it does.
    """
    score_normalization = np.sum([v[1] for v in fit_estimators.values()])
    predictions = np.zeros((X.shape[0], len(fit_estimators)))
    for i, (name, (estimator, score)) in enumerate(fit_estimators.iteritems()):
        weight = score / score_normalization
        try:
            predictions[:, i] = estimator.predict(X) * weight
        except:
            print("ERROR:", name)
    return (predictions.sum(axis=1) > 0.5).astype(int)

y_pred = ultra_ensembled_model(all_ensembles, X_train)

acc = (y_pred == y_train).sum() / y_pred.size

print("Accuracy:", acc)

# Accuracy: 0.90
# This is on the training data, which is what it was fit to, so there's
# probably a lot of overfitting

y_test_pred = ultra_ensembled_model(all_ensembles, X_test)

final_results = pd.DataFrame(
    {'PassengerId': test.index.values, 'Survived': y_test_pred}
)
final_results.head()

	PassengerId	Survived
0	892	0
1	893	1
2	894	0
3	895	0
4	896	1

Conclusion

In the end, my best score was from the ensemble, where I got a score of 78%. While I do think it’s possible to improve that, it will most likely involve a lot of time sitting and thinking really carefully about feature engineering–which obviously isn’t possible in a short amount of time.

How are people getting >90% accuracy?

I’m extremely skeptical of the submissions in which people get accuracies of anything greater than ~85-90%. It just doesn’t seem remotely possible to be able to predict whether someone would live or die that night only on the basis of their income/class, gender, age, and name with that much accuracy. There are just too many other factors at play. Any score that high was achieved by other means (i.e. fitting the test data through repeated submissions).