Wednesday, April 16, 2014

Global warming, python and statistics part 2

In the previous post, we stopped after establishing the long term trends for the high and low temperatures. This gave a very general overview of how the temperatures are evolving on average over a long period of time, in this case a bit more than a century.

Let's refine a bit and determine how the temperature evolves for the same day of the year, namely January the first, April the first, July the first and October the first, for the various years in the dataset.
The temperature curve for each day in the year is quite specific and separated from the other days, both for the high and low temperatures. The weather on earth is a cycle of period about 1 year. 

Now, let's detrend the original data: this means removing the long term growing trend to have a data set composed uniquely of the short term variations. We will also plot the curve representing the averages over a year for each day of the year (all the January 1st, all the January 2nd ...)


For the averages over a year, the detrended datasets were used. To account for the loss of the average in the detrending process, the initial dataset average has been added back. A sinusoidal fitting has been added for the high and low temperature curves. As it appears, that fits quite well.

At this point, we have (a) a long term linear trend and (b) a yearly cycle. These information are useful to give an idea of how temperatures evolve, but they are only averages. Let's have a look at how the temperature is distributed around that average. 

In order to do that, the relevant days of the year (i.e. January 1st) will be corrected to remove the long term trend and the yearly cycle. This will leave the difference with the average temperature.  Let's do this with the same days as before, January 1st, April 1st, July 1st and October 1st. 

Be careful though: the long term trend AND the sinusoidal fitting both contain the average value - if these two are subtracted from the data, the result will be that the average value will be removed twice. In this case, I decided to ignore the offset value from the sinusoidal fitting. Other techniques are possible. 
Let's check how the temperature is distributed around the trend/cycle.
From this, it seems that the distribution is fairly normal. The parameters for the stacked distribution are mean μ=8.301619e-02C and the standard deviation σ=4.411646e+00C.

Now, let's be careful: even if the distribution of the differences around the trend+cycle looks like it is a normal distribution, do not confuse this with a normal random variable: the weather, while it looks strangely random, is not a variable independent of everything else, including its history: the temperature at day D will influence the temperature at day D+1. That is one of the reasons why the weather forecast is possible. If temperature were a purely random variable, there would be no forecasting of the weather. 

To be perfectly complete, there is a slight difference between the distributions for the high temperatures and the low temperatures. 


Temperature set Mean [C] Standard Deviation [C]
High1.684078e-014.777461e+00
Low-2.375521e-034.015207e+00
However, from now on, I will consider it as a "somehow random variable."

[To be continued]

Monday, April 14, 2014

Global warming, python and statistics

Global warming ... some people believe in it, others don't. A growing number of scientists show data to prove its reality, detractors show other data sets. However, debating of who is right is not the point here.

The interest - for me at least - of global warming is that large sets of weather data are available. In this article, I will use the Global Historical Climatology Network - Daily data set (here for the readme.txt) hosted by the National Oceanographic and Atmospheric Administration. This data set contains the daily measures for various atmospheric parameters such as maximum and minimum temperatures, precipitation and so forth, for various stations identified by their ID. In the data set I downloaded, each file contains the results for a single station.

The data is line by line, each line representing a month, with fixed length. From the readme file, the structure is as follow:

------------------------------
Variable Columns Type
------------------------------
ID 1-11 Character
YEAR 12-15 Integer
MONTH 16-17 Integer
ELEMENT 18-21 Character
VALUE1 22-26 Integer
MFLAG1 27-27 Character
QFLAG1 28-28 Character
SFLAG1 29-29 Character
VALUE2 30-34 Integer
MFLAG2 35-35 Character
QFLAG2 36-36 Character
SFLAG2 37-37 Character
. . .
. . .
. . .
VALUE31 262-266 Integer
MFLAG31 267-267 Character
QFLAG31 268-268 Character
SFLAG31 269-269 Character
------------------------------
Non possible days (such as February 30) and missing measures have a value set to -9999. For this, I am interested in two variables: TMAX and TMIN. they are expressed in tenth of Celsius. 

As said, the file is organized line by line, with each line representing a month's worth of measures. Each line as 31 entries of the type value+3 flags. The function readFile() reads each line and return a tuple of 4 numpy arrays: the dates for the measures of the High temperatures, the high temperature measures, the dates for the measures of the Low temperatures and the low temperatures.

Let's start with the weather station in Central Park, New York NY.

The reason for returning the dates as well is that there may be missing measurements, i.e. days for which the Tmax, the Tmin or both may be missing from the file. 

Here comes the first graph with two subplots: the Tmax is the top one.


It is quite difficult to say anything about the data besides that "they look the same shifted by about 10C". In order to remove some of the "hairy" behavior, we will apply a sliding average on the data, with 7, 30, 182, 365 and 3650 days, a week, roughly a month, roughly six months, a year and roughly ten years. 

A sliding average is simply the average calculated over the last N points. This is used to smooth out the possible variations due to either the randomness or the seasonal variations. Let's use an example.

The following is a straight line (slope: 1, intercept: 10) with a superimposed Gaussian noise (mean: 0, deviation: 5). The X-range goes from 0 to 10, using 1001 points. Here is a graph (not "the" graph - guess why!)


From it, it "looks like" there is an increase. However, this is not really visible. Let's apply a sliding window to it, with size 10, 20 and 50.


Even at sample size 10 (in blue), the trend is visible. At sample size 50 (in red), the trend is impossible to miss.

This sliding window is performed in python with the numpy.convolve() function. This function does the convolution of two vectors (one-dimension arrays).

Back to our temperature samples, let's use this tool to smooth the temperatures. So, what do we expect?

The sliding average over a week won't probably change a lot - the weather cycle on earth is about a year, or 52 weeks. The same for a month. With a sliding window of six weeks, I expect the seasonal cycle will start showing. I really expect the trends to be visible with the year and ten years averaging. 



With the year sliding window, the trend appears but is still quite noisy. With 10 years, there is no doubt left: a trend appears for both the high and low temperatures. A visual estimation gives that for both measures, over the period the average has risen by about 2.25°C. Let's use a better method than "it looks like."

In the statistician's toolbox, a good tool to estimate the trend of a dataset is  linear regression or linear fitting, usually using the least-square method (minimising the vertical distance).  I will not explain all the values returned by scipy.stats.linregress(). Of these, I will use the slope, the intercept and the R-value.


Data source (High) Slope [°C/day] (High) Intercept [°C] (High) R-value [-] (High) Slope [°C/day] (low) Intercept [°C] (low) R-value [-] (low)
Raw dataset4.5933e-05-1.6242e+016.4263e-023.2326e-05-1.4788e+015.0001e-02
7-day sliding window average 4.5997e-05-1.6287e+016.8779e-023.2374e-05-1.4821e+015.2816e-02
30-day sliding window average 4.6027e-05-1.6303e+017.1649e-023.2403e-05-1.4837e+015.5005e-02
182-day sliding window average 4.6947e-05-1.6924e+011.1339e-013.3059e-05-1.5273e+018.7093e-02
365-day sliding window average 4.7696e-05-1.7464e+016.8588e-013.3669e-05-1.5716e+015.6593e-01
3650-day sliding window average 4.8371e-05-1.7851e+019.0518e-013.2297e-05-1.4709e+017.9373e-01

The first notable point is that the slopes are different for the low and for the high by about 1.6e-5 °C/day. Second - and it was expected - the R-value increases with the sliding window length: as the window increases, the data set is closer and closer to a line. As a result, the linear regression model matches more and more.

If the values for the slope don't look that much, remember that they are per day: over the course of 100 years, this represents about 1.68°C for the high temperature and 1.18°C for the low. The average temperature has risen by about 1.43°C over a century.

[To be continued]


Friday, April 11, 2014

Possible tetraquark particles spotted at the LHC

Scientists at the LHC may have found another type of matter in the form of tetraquarks, or particles made of 4 quarks. If confirmed, this would be an important step in understanding how matter behaves at its most fundamental level. 

The paper on arXiv:  arXiv:1404.1903v1 [hep-ex]

Wednesday, April 9, 2014

Chinese Physicists provide a lower bound for the speed of "spooky action at a distance"

This concerns the speed at which information is transferred between two entangled particles. The paper is really interesting and the lower bound - if correct - is four orders of magnitude higher than the speed of light.

The paper on arxiv: arXiv:1303.0614v2 [quant-ph].

Wednesday, April 2, 2014

Neil Armstrong on Being a Nerd

From a great man ...



Saturday, March 29, 2014

Mac OS X file versus Mac Ports file

So recently, I started having an issue - the default "file" command on my Mac OS machine didn't identify any file, but returned an error message instead.

Trying the native Mac OS version worked, but the one installed by MacPorts (/opt/local/bin/file) would just report a regex error.




$ /usr/bin/file magicmagic: magic text file for file(1) cmd$ /opt/local/bin/file magicmagic: ERROR: line 19439: regex error 17, (illegal byte sequence)

A closer look at the magic file that lives in /opt/local/share/misc/ revealed that line 19439 is

0       regex/s         \\`(\r\n|;|[[]|\xFF\xFE)

Disabling the line with a "#" and recompiling the magic file with the "-C -m <magic file>" solved the issue.