Titanic Challenge

Published 2025-05-09 • Updated 2025-10-11

I’m working on “Titanic - Machine Learning from Disaster”, Kaggle’s first/starting challenge: predicting passenger survival on the Titanic. It seems fun and practical (if a bit morbid) but it’s driving me a little crazy.

First: benchmarks! Leaderboard stats:

count 16172.000000
mean  0.767122
std   0.072317
min   0.000000
25%   0.765550
50%   0.775110
75%   0.777510
max   1.000000

The first few hundred scores are 1.0s, which I’m assuming are not actually using an ML model, or are training with test data. Top 5% is a score of 0.79655, so my goal is 0.80. It’s interesting that the 50th and 95th percentiles scores are so close; it seems like small incremental gains are the name of the game.

My best score so far (using a bunch of feature engineering and XG Boosting) is 0.77033, putting me squarely in the 35th percentile. I think I’m missing something obvious because, looking at similar public solutions, that approach should be getting me at least 0.79. So I’ll go back to the basics and try to figure out what’s going wrong.

Note that, to get that best score, I was using an LLM (o3) pretty heavily. I’ve seen it hallucinate some basic things so that’s one reason for going back to the basics here. By the way, I need to think more about coding with AI; there seem to be a lot of pitfalls, and using AI well (especially with projects where learning is a higher priority than getting a certain result) requires intentionality.

data

The data is split into two sets, train.csv (shape (891, 12)) and test.csv (shape (418, 11)). The columns are equivalent, with the test set missing the label, a boolean called Survived.

The features are:

PassengerId : int64, unique ID; not a predictor
Pclass      : int64, in (1, 2, 3) for upper, middle, lower class
Name        : object, passenger name
Sex         : object, "male" or "female"
Age         : float64, age in years
SibSp       : int64, # of siblings/spouses aboard the Titanic
Parch       : int64, # of parents/children aboard the Titanic
Ticket      : object, ticket number
Fare        : float64, passenger fare
Cabin       : object, cabin number
Embarked    : object, ("C", "Q", "S") for ports of embarkation

the very basic approach

No pipelines and no fancy processing; just dropping columns, imputing medians or modes, and some basic mapping.

Processing function:

def process(dataset: pd.DataFrame) -> pd.DataFrame:
    dataset["Age"] = dataset["Age"].fillna(dataset["Age"].median())
    dataset["Fare"] = dataset["Fare"].fillna(dataset["Fare"].median())
    dataset["Embarked"] = dataset["Embarked"].fillna(dataset["Embarked"].mode())
    dataset["Sex"] = dataset["Sex"].map({"male": 0, "female": 1})
    dataset["Embarked"] = dataset["Embarked"].map({"S": 0, "C": 1, "Q": 2})

    dataset = dataset.drop("Name", axis=1)
    dataset = dataset.drop("Cabin", axis=1)
    dataset = dataset.drop("Ticket", axis=1)

    return dataset

With a RandomForestClassifier(n_estimators=500) on the test set, I get a score of 0.76555.

Notable problems with this approach (among many):

Not using name, cabin, or ticket data at all
The mapped sex and embarked variables are presumably being treated as continuous rather than categorical (question: is this a problem for decision tree based classifiers?)
Using direct medians from the entire data set for age and fare is not the best approach; it would probably be better to group by class
Speaking of class, that’s being treated as a continuous variable too (same question)

Fitting and outputting:

model = RandomForestClassifier(n_estimators=500)

model.fit(data.drop("Survived", axis=1), data["Survived"])

test_X = pd.read_csv("test.csv")
y_pred = model.predict(process(test_X))
out_df = pd.DataFrame({"PassengerId": test_X["PassengerId"], "Survived": y_pred})
out_df.to_csv("output_basic.csv", index=False)

fixing the continuous/categorical problem

i.e., one hot encoding.

def process(dataset: pd.DataFrame) -> pd.DataFrame:
    dataset["Age"] = dataset["Age"].fillna(dataset["Age"].median())
    dataset["Fare"] = dataset["Fare"].fillna(dataset["Fare"].median())
    dataset["Embarked"] = dataset["Embarked"].fillna(dataset["Embarked"].mode())

    # splitting and dropping sex column
    dataset["Male"] = (dataset["Sex"] == "male").astype("int8")
    dataset["Female"] = (dataset["Sex"] == "female").astype("int8")
    dataset = dataset.drop("Sex", axis=1)

    # splitting and dropping embarked column
    dataset["EmbarkedS"] = (dataset["Embarked"] == "S").astype("int8")
    dataset["EmbarkedQ"] = (dataset["Embarked"] == "Q").astype("int8")
    dataset["EmbarkedC"] = (dataset["Embarked"] == "C").astype("int8")
    dataset = dataset.drop("Embarked", axis=1)

    dataset = dataset.drop("Name", axis=1)
    dataset = dataset.drop("Cabin", axis=1)
    dataset = dataset.drop("Ticket", axis=1)

    return dataset

Using the same RandomForestClassifier(n_estimators=500), I get a score of 0.78468. Hmm… I really don’t know what’s up with what I was doing before.

The following problems (“areas of opportunity”) remain:

Not using name, cabin, or ticket data
Using something smarter than direct medians from the entire data set for age and fare

Using the dropped data sounds more interesting right now, so on to that.

being smarter with name, cabin, and ticket

First up: Name

The main approach I’ve seen is getting a Title from the name column, like Mr., Mrs., Dr., etc. I have no idea what else to do with a bunch of names so I’ll just do that.

Approach:

use a regex pattern with re (I haven’t learned how to make one myself so I asked o3)
use a map because, looking at the unique titles, “Miss”, “Ms”, and “Mlle” should probably go together because they mean the same thing
fill random other titles with “Other”

TITLE_MAP = {
    "Mr"     : "Mr",
    "Mrs"    : "Mrs",
    "Mme"    : "Mrs",
    "Miss"   : "Miss",
    "Ms"     : "Miss",
    "Mlle"   : "Miss",
    "Master" : "Master",
    "Dr"     : "Dr",
    # everything else → "Other"
}

title_re = re.compile(r",\s*([^\. ]+)")

def extract_title(name: str) -> str:
    return title_re.search(name).group(1)

Next: Cabin

Deck information is retrievable from Cabin data. This excellent notebook examines the deck layout on the Titanic and identifies 4 main groups:

Decks A, B, and C were only for 1st class passengers (there’s a first class passenger in deck T who’s grouped here as well)
D and E were for all classes
F and G were for 2nd and 3rd class
M: deck information isn’t identifiable from Cabin

Implementation:

def extract_deck(cabin: str) -> str:
    if type(cabin) is not str:
        return "M"
    out = ''.join(re.findall(r'[a-zA-Z]', cabin))
    if out in "ABCT":
        return "ABC"
    if out in "DE":
        return "DE"
    if out in "FG":
        return "FG"
    return "M"

Finally: Ticket

Passengers who traveled together (families, friends, maids, etc) traveled on the same ticket. Assigning a ticket frequency (i.e., how many people are on that passenger’s ticket) provides a measure of group size, a good predictor of survival.

The idea and implementation are both from the aforementioned notebook.

dataset['Ticket_Freq'] = dataset.groupby('Ticket')['Ticket'].transform('count')

New score with RandomForestClassifier(n_estimators=500): 0.75119. hmm…

Okay, the problem might be that the one hot encoding added too many columns, and I need to tweak hyperparameters a little. Trying again with RandomForestClassifier(n_estimators=1000, max_depth=5): 0.78708. Better, but I was expecting a bigger jump. That’s 90th percentile though.

I haven’t been doing any CV so far; maybe it’s time to do a hyperparameter search sweep.

tuning hyperparameters

I performed a grid search across some of the primary random forest parameters:

param_grid = {
    "max_depth":          [None, 3, 4, 5,6,7,8],
    "n_estimators":       [300, 500, 800, 1200, 1500],
    "max_leaf_nodes":     [None, 5, 8, 10, 12, 14]
}

I’m not sure how max_depth and max_leaf_nodes interact/overlap, so I tested both. The best parameter set was max_depth=7, max_leaf_nodes=None, and n_estimators=800, with a training CV score of 0.83726. Unfortunately, the test score didn’t reflect that improvement, and dropped to 0.77272.

It’s surprising to me that tuning hyperparameters can cause not just a big jump (~1.5 points), but a negative one. I guess it’s in part because the training set (~900 samples) and test set (~400) are relatively small.

It seems like further feature engineering will be required; that’ll include both changing current features (notably age and fare) and adding new ones.

By the way, I considered using the full dataset (train + test) for median information, but I think that would lead to leakage? I would probably get a better score, but I don’t want to overfit to the test set. Maybe that’s just what the top ~2%-3% are doing? That may be something to keep in mind if I start going crazy trying to get that 0.80.

more feature engineering

I’m realizing there are some problems with the features.

I’ve been trying to keep the train and test sets totally separate; that is, I’m changing and adding features in my process function in a way that’s totally blind to the data in the other set. That’s problematic because some of my features rely on data found also in the other set; for example, ticket frequency groups ticket holders together to get an idea of group size, but if groups are split between the train and test sets, they’ll appear smaller than they really are.

I think I need to be a little more intentional about using both datasets together.

Ticket_Freq is the main one. I’ll also use the medians of the full dataset for Age and Fare.

Ticket_Freq adjustment:

passenger_min = min(data["PassengerId"])-1
passenger_max = max(data["PassengerId"])

dataset["Ticket_Freq"] = (full_dataset.groupby("Ticket")
             ["Ticket"].transform("count")[passenger_min :
             passenger_max])

It’s a little messy but it works. data is whatever set is passed to the function — wait. That should be dataset, not data. But when I fix that, my score drops by ~1.2 points. I had gotten a 0.79665 and now I’m back down.

How did I get my best score yet with a bug?? The test set was using completely wrong Ticket_Freq information.

I’m gonna assume it’s a fluke, but that’s confusing. At any rate, I’ll use the fixed version.

I’m also adding FamilySize and FarePerPerson features, which are what they sound like. Fare is for the ticket, which is shared for groups, so people with a grouped ticket have a greater Fare than expected. Implementation:

# adding FarePerPerson
dataset["FarePerPerson"] = dataset["Fare"] / dataset["Ticket_Freq"]

# adding Family_Size
dataset["Family_Size"] = dataset["Parch"] + dataset["SibSp"] + 1

As a result, I’m dropping Fare.

I also played around with binning Ticket_Freq to help with generalization but it didn’t seem to really help. For now I’m consistently getting ~78.5, and I’ll content myself with that. I wonder whether further improvement depends more on tuning to the test set than having a general model with strong features.

Summary of what I learned while working on this:

an 80% accurate model can be very good depending on the context; getting from 75% to 80% can be a big jump
when things just don’t seem to be working, start again from the basics, and iteratively add complexity
it’s tricky to tune hyperparameters of models with relatively small train and test sets
feature engineering is critical; don’t be afraid to use outside knowledge to inform your features