Sunday, May 14, 2017

Web Application Hacker's Handbook

Here is a review by Jean: https://www.goodreads.com/review/show/1896919107

Monday, March 27, 2017

Machine Learning, Donald Trump and Reddit

An excellent article by Trevor Martin on using Latent Semantic Analysis on the Reddit r/The_Donald comments.

(Source: Data Elixir)






Tuesday, January 31, 2017

My review of "Le Dé d'Einstein et le Chat de Schrödinger: Quand Deux Génies s'Affrontent"

Le dé d'Einstein et le chat de Schrödinger : Quand deux génies s'affrontent (Hors collection)Le dé d'Einstein et le chat de Schrödinger : Quand deux génies s'affrontent by Paul Halpern
My rating: 4 of 5 stars

Einstein is known as the father of Relativity, as well as of the famous formula that links mass to energy. Schrödinger is known for his cat "thought experiment" and for his wave equation. But did you know that Einstein was also one of the fathers of quantum mechanics? And that both men looked at finding an equation to unify gravity and electromagnetism?

This book is not a biography of either men, though aspects of their lives are presented. It presents how both collaborated, how their friendship fell and how they resumed their collaboration, as well as how their beliefs fitted the evolving physics.

View all my reviews

Monday, January 16, 2017

My first Kaggle submission - Titanic Survival Rate

If you don't not it yet, Kaggle is a platform that organizes data-oriented competition. There are small and large sets of data, and there are competitions for fun, for swag or for money. If you are interested in data-science, wanting to learn or to improve, or simply looking for programming challenges, this is a good place to start.

The competitions can originate from a specific requester, but can also be more general: for example there is currently a competition to improve lung cancer detection. There are also introductory competitions meant for beginners.

When I registered on Kaggle, five years ago, this was to access some large datasets to play with in R, but I have not tried any competition. 

I made my first submission in the Titanic: Machine Learning from Disaster competition. This is one of the competitions for learning (no ranking, no swag, no money). In it, you are presented with a training dataset that contains several features as well as if the person survived or not. The goal is to build a model that will predict in the testing set whether the person survived or died, based on the same feature. 

My notebook is publicly available here. Ultimately I opted for Python instead of R as I wanted to play with pandas, numpy and scikit-learn. Pandas is a library providing data frames to Python (among other), numpy an octave-like interface, structures and functions, and scikit-learn various algorithms revolving around machine learning, be them machine learning (kNN, SVM ...), preprocessing or encoders. 

I am quite pleased as my refined model has an accuracy of around 80%. Some of the missing data are imputed using a random process, so the repeated accuracy can change.

Feel free to comment on it!

What did I learn?

A lot! But more specifically ...

Exploration and visualization are important

This was actually my biggest mistake: I started the competition without really looking at the data, neither exploring it nor visualizing it. As a result, my first models' accuracy were below 75%. Looking at various examples and blogs, I saw people splitting the data and making it tell as much as possible, revealing in the process which features were important and which ones had less significance.

Get to know the context

For example, in 1912, during sea accidents, the principle of "women and children first" was still in place, which reflects in the data with a survival rate higher for women and children than for adult male. 

Another example is looking at the ship's plans: the second and third classes were far from the level with the lifeboat, which may also account for the lower survival rate of these populations. This could also have been provided by the cabin number, which started with the deck letter. Unfortunately that information was seldom provided in the dataset.

Understand your tools, and shun away from "easy" simplicity

Initially, I used a LabelEncoder from sklearn.preprocessing,  which translates categorical variable into an integer per category, for example "Black","White","Green","Black","Green" would become 0,1,2,0,2. This has the advantage of being quick, but unfortunately, this also makes possible things such as computing the average of "Black" and "Green" as "White", which makes no sense. I switched to a simple dummy encoding, i.e. converting all the factors to a 0/1 variable, which improved the accuracy.

Monday, December 26, 2016

Linux, GPU, games and Google Chrome

Recently, I have been having a few issues with a few games: once in a while, I like to play a bit of FPS to de-stress, and my frame rate was just abysmally low, what used to be a good 60 FPS went down to 20-30, leaving me with barely playable games, although these were great under Linux Mint 17.

After a bit of searching, I have found something interesting: if Google Chrome is running, the frame rate will be bad. If it is not, my games are back to normal. As Google Chrome uses the GPU for various tasks, I guess this was either due to a conflict (the two applications fighting for GPU resources) or Google Chrome setting some parameters that are detrimental to the games. Looking at Google Chrome's GPU status shows that a few features are either disabled or not available.






Tuesday, December 20, 2016

Upgrade from Mint 18 to Mint 18.1

A few days ago, there were news that Mint 18.1 was out and ready for install. This morning, my update manager prompted me to install the MintUpdate package, a sign that the new version is ready for prime time.

The release notes do not show anything that would affect me, and so far, so good.

The overall feeling is that nothing really changed: visually this is still the same, the system does not seem to be faster or slower.

A good point: the upgrade did not remove my PPA and other additional depots. It was one of the many little things that made me cringe during the update from 17 to 18. It is not very hard to put back (all the information is in ~/Upgrade-Backup/APT/sources.list.d) though.

Comment if you have had any issue.

Monday, September 26, 2016

Pairi Daiza

Here are some pictures from my walk in Pairi Daiza.