Monday, March 27, 2017

Machine Learning, Donald Trump and Reddit

An excellent article by Trevor Martin on using Latent Semantic Analysis on the Reddit r/The_Donald comments.

(Source: Data Elixir)






Tuesday, January 31, 2017

My review of "Le Dé d'Einstein et le Chat de Schrödinger: Quand Deux Génies s'Affrontent"

Le dé d'Einstein et le chat de Schrödinger : Quand deux génies s'affrontent (Hors collection)Le dé d'Einstein et le chat de Schrödinger : Quand deux génies s'affrontent by Paul Halpern
My rating: 4 of 5 stars

Einstein is known as the father of Relativity, as well as of the famous formula that links mass to energy. Schrödinger is known for his cat "thought experiment" and for his wave equation. But did you know that Einstein was also one of the fathers of quantum mechanics? And that both men looked at finding an equation to unify gravity and electromagnetism?

This book is not a biography of either men, though aspects of their lives are presented. It presents how both collaborated, how their friendship fell and how they resumed their collaboration, as well as how their beliefs fitted the evolving physics.

View all my reviews

Monday, January 16, 2017

My first Kaggle submission - Titanic Survival Rate

If you don't not it yet, Kaggle is a platform that organizes data-oriented competition. There are small and large sets of data, and there are competitions for fun, for swag or for money. If you are interested in data-science, wanting to learn or to improve, or simply looking for programming challenges, this is a good place to start.

The competitions can originate from a specific requester, but can also be more general: for example there is currently a competition to improve lung cancer detection. There are also introductory competitions meant for beginners.

When I registered on Kaggle, five years ago, this was to access some large datasets to play with in R, but I have not tried any competition. 

I made my first submission in the Titanic: Machine Learning from Disaster competition. This is one of the competitions for learning (no ranking, no swag, no money). In it, you are presented with a training dataset that contains several features as well as if the person survived or not. The goal is to build a model that will predict in the testing set whether the person survived or died, based on the same feature. 

My notebook is publicly available here. Ultimately I opted for Python instead of R as I wanted to play with pandas, numpy and scikit-learn. Pandas is a library providing data frames to Python (among other), numpy an octave-like interface, structures and functions, and scikit-learn various algorithms revolving around machine learning, be them machine learning (kNN, SVM ...), preprocessing or encoders. 

I am quite pleased as my refined model has an accuracy of around 80%. Some of the missing data are imputed using a random process, so the repeated accuracy can change.

Feel free to comment on it!

What did I learn?

A lot! But more specifically ...

Exploration and visualization are important

This was actually my biggest mistake: I started the competition without really looking at the data, neither exploring it nor visualizing it. As a result, my first models' accuracy were below 75%. Looking at various examples and blogs, I saw people splitting the data and making it tell as much as possible, revealing in the process which features were important and which ones had less significance.

Get to know the context

For example, in 1912, during sea accidents, the principle of "women and children first" was still in place, which reflects in the data with a survival rate higher for women and children than for adult male. 

Another example is looking at the ship's plans: the second and third classes were far from the level with the lifeboat, which may also account for the lower survival rate of these populations. This could also have been provided by the cabin number, which started with the deck letter. Unfortunately that information was seldom provided in the dataset.

Understand your tools, and shun away from "easy" simplicity

Initially, I used a LabelEncoder from sklearn.preprocessing,  which translates categorical variable into an integer per category, for example "Black","White","Green","Black","Green" would become 0,1,2,0,2. This has the advantage of being quick, but unfortunately, this also makes possible things such as computing the average of "Black" and "Green" as "White", which makes no sense. I switched to a simple dummy encoding, i.e. converting all the factors to a 0/1 variable, which improved the accuracy.