Monday, January 16, 2017

My first Kaggle submission - Titanic Survival Rate

If you don't not it yet, Kaggle is a platform that organizes data-oriented competition. There are small and large sets of data, and there are competitions for fun, for swag or for money. If you are interested in data-science, wanting to learn or to improve, or simply looking for programming challenges, this is a good place to start.

The competitions can originate from a specific requester, but can also be more general: for example there is currently a competition to improve lung cancer detection. There are also introductory competitions meant for beginners.

When I registered on Kaggle, five years ago, this was to access some large datasets to play with in R, but I have not tried any competition. 

I made my first submission in the Titanic: Machine Learning from Disaster competition. This is one of the competitions for learning (no ranking, no swag, no money). In it, you are presented with a training dataset that contains several features as well as if the person survived or not. The goal is to build a model that will predict in the testing set whether the person survived or died, based on the same feature. 

My notebook is publicly available here. Ultimately I opted for Python instead of R as I wanted to play with pandas, numpy and scikit-learn. Pandas is a library providing data frames to Python (among other), numpy an octave-like interface, structures and functions, and scikit-learn various algorithms revolving around machine learning, be them machine learning (kNN, SVM ...), preprocessing or encoders. 

I am quite pleased as my refined model has an accuracy of around 80%. Some of the missing data are imputed using a random process, so the repeated accuracy can change.

Feel free to comment on it!

What did I learn?

A lot! But more specifically ...

Exploration and visualization are important

This was actually my biggest mistake: I started the competition without really looking at the data, neither exploring it nor visualizing it. As a result, my first models' accuracy were below 75%. Looking at various examples and blogs, I saw people splitting the data and making it tell as much as possible, revealing in the process which features were important and which ones had less significance.

Get to know the context

For example, in 1912, during sea accidents, the principle of "women and children first" was still in place, which reflects in the data with a survival rate higher for women and children than for adult male. 

Another example is looking at the ship's plans: the second and third classes were far from the level with the lifeboat, which may also account for the lower survival rate of these populations. This could also have been provided by the cabin number, which started with the deck letter. Unfortunately that information was seldom provided in the dataset.

Understand your tools, and shun away from "easy" simplicity

Initially, I used a LabelEncoder from sklearn.preprocessing,  which translates categorical variable into an integer per category, for example "Black","White","Green","Black","Green" would become 0,1,2,0,2. This has the advantage of being quick, but unfortunately, this also makes possible things such as computing the average of "Black" and "Green" as "White", which makes no sense. I switched to a simple dummy encoding, i.e. converting all the factors to a 0/1 variable, which improved the accuracy.