Monday, March 27, 2017

Machine Learning, Donald Trump and Reddit

An excellent article by Trevor Martin on using Latent Semantic Analysis on the Reddit r/The_Donald comments.

(Source: Data Elixir)

Tuesday, January 31, 2017

My review of "Le Dé d'Einstein et le Chat de Schrödinger: Quand Deux Génies s'Affrontent"

Le dé d'Einstein et le chat de Schrödinger : Quand deux génies s'affrontent (Hors collection)Le dé d'Einstein et le chat de Schrödinger : Quand deux génies s'affrontent by Paul Halpern
My rating: 4 of 5 stars

Einstein is known as the father of Relativity, as well as of the famous formula that links mass to energy. Schrödinger is known for his cat "thought experiment" and for his wave equation. But did you know that Einstein was also one of the fathers of quantum mechanics? And that both men looked at finding an equation to unify gravity and electromagnetism?

This book is not a biography of either men, though aspects of their lives are presented. It presents how both collaborated, how their friendship fell and how they resumed their collaboration, as well as how their beliefs fitted the evolving physics.

View all my reviews

Monday, January 16, 2017

My first Kaggle submission - Titanic Survival Rate

If you don't not it yet, Kaggle is a platform that organizes data-oriented competition. There are small and large sets of data, and there are competitions for fun, for swag or for money. If you are interested in data-science, wanting to learn or to improve, or simply looking for programming challenges, this is a good place to start.

The competitions can originate from a specific requester, but can also be more general: for example there is currently a competition to improve lung cancer detection. There are also introductory competitions meant for beginners.

When I registered on Kaggle, five years ago, this was to access some large datasets to play with in R, but I have not tried any competition. 

I made my first submission in the Titanic: Machine Learning from Disaster competition. This is one of the competitions for learning (no ranking, no swag, no money). In it, you are presented with a training dataset that contains several features as well as if the person survived or not. The goal is to build a model that will predict in the testing set whether the person survived or died, based on the same feature. 

My notebook is publicly available here. Ultimately I opted for Python instead of R as I wanted to play with pandas, numpy and scikit-learn. Pandas is a library providing data frames to Python (among other), numpy an octave-like interface, structures and functions, and scikit-learn various algorithms revolving around machine learning, be them machine learning (kNN, SVM ...), preprocessing or encoders. 

I am quite pleased as my refined model has an accuracy of around 80%. Some of the missing data are imputed using a random process, so the repeated accuracy can change.

Feel free to comment on it!

What did I learn?

A lot! But more specifically ...

Exploration and visualization are important

This was actually my biggest mistake: I started the competition without really looking at the data, neither exploring it nor visualizing it. As a result, my first models' accuracy were below 75%. Looking at various examples and blogs, I saw people splitting the data and making it tell as much as possible, revealing in the process which features were important and which ones had less significance.

Get to know the context

For example, in 1912, during sea accidents, the principle of "women and children first" was still in place, which reflects in the data with a survival rate higher for women and children than for adult male. 

Another example is looking at the ship's plans: the second and third classes were far from the level with the lifeboat, which may also account for the lower survival rate of these populations. This could also have been provided by the cabin number, which started with the deck letter. Unfortunately that information was seldom provided in the dataset.

Understand your tools, and shun away from "easy" simplicity

Initially, I used a LabelEncoder from sklearn.preprocessing,  which translates categorical variable into an integer per category, for example "Black","White","Green","Black","Green" would become 0,1,2,0,2. This has the advantage of being quick, but unfortunately, this also makes possible things such as computing the average of "Black" and "Green" as "White", which makes no sense. I switched to a simple dummy encoding, i.e. converting all the factors to a 0/1 variable, which improved the accuracy.

Monday, December 26, 2016

Linux, GPU, games and Google Chrome

Recently, I have been having a few issues with a few games: once in a while, I like to play a bit of FPS to de-stress, and my frame rate was just abysmally low, what used to be a good 60 FPS went down to 20-30, leaving me with barely playable games, although these were great under Linux Mint 17.

After a bit of searching, I have found something interesting: if Google Chrome is running, the frame rate will be bad. If it is not, my games are back to normal. As Google Chrome uses the GPU for various tasks, I guess this was either due to a conflict (the two applications fighting for GPU resources) or Google Chrome setting some parameters that are detrimental to the games. Looking at Google Chrome's GPU status shows that a few features are either disabled or not available.

Tuesday, December 20, 2016

Upgrade from Mint 18 to Mint 18.1

A few days ago, there were news that Mint 18.1 was out and ready for install. This morning, my update manager prompted me to install the MintUpdate package, a sign that the new version is ready for prime time.

The release notes do not show anything that would affect me, and so far, so good.

The overall feeling is that nothing really changed: visually this is still the same, the system does not seem to be faster or slower.

A good point: the upgrade did not remove my PPA and other additional depots. It was one of the many little things that made me cringe during the update from 17 to 18. It is not very hard to put back (all the information is in ~/Upgrade-Backup/APT/sources.list.d) though.

Comment if you have had any issue.

Monday, September 26, 2016

Pairi Daiza

Here are some pictures from my walk in Pairi Daiza.

Thursday, September 22, 2016

From Linux Mint 17.3 to 18

Linux Mint 18 has been out for the last few months, and I finally found the time to upgrade my computer.

The upgrade

As suggested, I started by using the Mint 18 ISO as a live USB, simply to make sure that every piece of equipment was supported. My biggest concern was my video card (NVIDIA GeForce 9600 GT, purchased eight years ago). The Live USB went without a hitch and the nouveau driver perfectly managed the card. However, I did not plan to use it: I had too much issues under Fedora and Linux Mint 17.2. Though it was good to know that, if need be, I could use that driver while shopping for a new video card.
The test with the Live USB being okay, I proceeded with the instructions provided by the Linux Mint team. Starting with the "take a backup." That step is often overlooked, but I really recommend it, especially that a TB external hard drive costs less than 100€. As a matter of fact, I have the habit of taking a weekly backup - usually on Friday evening, and whenever I import pictures from my camera. 

The check phase went fine, then the download - which I let run overnight. In the morning, I only had to go for the mintupgrade upgrade command, which performs the actual upgrade.

There, things were a bit less clear cut: several errors and tracebacks appeared related to mono, but it seems okay and can be ignored. However, when the process finished, several packages were reported has not being upgraded due to errors. I reran the upgrade process and the same result appeared. Here are the packages that were not updated:
  • cron
  • cups-browsed
  • cups-daemon
  • samba
  • rsyslog
  • ubuntu-minimal
  • irqbalance
  • acpid
  • avahi-daemon
  • avahi-utils
  • bluez
  • bluetooth
  • cups-core-drivers
  • cups
  • printer-driver-hpcups
  • hplip
  • printer-driver-postscript-hp
  • bluez-cups
  • gnome-bluetooth
  • gnome-user-share
  • libnss-mdns:amd64
  • nvidia-340
  • nvidia-340-uvm
  • printer-driver-gutenprint
  • printer-driver-splix
  • pulseaudio-module-bluetooth
Looking at the messages, it was pretty much the same for each: either the service could not be/was not restarted and the dpkg --configure failed (for example bluez), or the package depended on such a package (for example bluetooth). I manually ran the corresponding service xx stop / dpkg --configure and everything went fine.
Lastly, when I restarted, I had an issue when I logged in, with the following error message, before disconnecting me.
GLib-CRITICAL: g_key_file_free: assertion 'key_file != NULL' failed
After a few searches, it seems this is a known issue, and a "sudo apt-get purge cinnamon nemo && sudo apt-get install cinnamon" at the console later, I was back in business.

The first half-day

So far, so good. After the first restart, I reapplied the intel-microcode proprietary driver, and I re-added all my repositories and PPA (Google Chrome, Julia, Sagemath and Darktable), which were lost during the upgrade. This is not a major issue, and this was quickly corrected, but a minor annoyance, especially if you have a lot of PPAs and repositories. There is also a pro in not porting over the PPAs: some of the applications may be linked against the older versions of the libraries and might not work anymore after the upgrade, possibly resulting in broken dependencies and other errors.

I had to reboot once, to apply both the microcode driver, but also a kernel update that popped up and was not taken during the upgrade process. Not too bad.

Visually, this version is very clear, and the Mint-X theme is very readable. It is pleasant and easy on the eyes, and while this is something that never struck me as an issue with the 17.3, going to 18 makes a ton of difference.

During my first use, I was surprised: the active application appears highlighted in green in the task bar, which I thought was to request the user's attention. After a minute or so, I got used to it.

From a performance standpoint, I feel it is about the same as my previous Linux Mint 17.3 install. However, my machine is about eight years old and probably not the snappiest thing on earth. 

The aim of my upgrading was to be able to install some more recent applications, especially Julia and Jupyter. For the former, I opted to re-install the PPA instead. For Jupyter, unfortunately, still nothing in the official package. PIP install it is then.


Everything considered, I am very pleased: while there were a few hiccups, the upgrade went without any major issues and the few kinks I had after reboot were easily fixed. This new version is way clearer and visually easier on the eyes.

The upgrade process is still a bit too chaotic, and the hiccups along the way can be issues for people new to Linux. It is to be noted that upgrading a live system is not the way recommended by the Mint team, which favors using a fresh install.