Monday, April 14, 2014

Global warming, python and statistics

Global warming ... some people believe in it, others don't. A growing number of scientists show data to prove its reality, detractors show other data sets. However, debating of who is right is not the point here.

The interest - for me at least - of global warming is that large sets of weather data are available. In this article, I will use the Global Historical Climatology Network - Daily data set (here for the readme.txt) hosted by the National Oceanographic and Atmospheric Administration. This data set contains the daily measures for various atmospheric parameters such as maximum and minimum temperatures, precipitation and so forth, for various stations identified by their ID. In the data set I downloaded, each file contains the results for a single station.

The data is line by line, each line representing a month, with fixed length. From the readme file, the structure is as follow:

Variable Columns Type
ID 1-11 Character
YEAR 12-15 Integer
MONTH 16-17 Integer
ELEMENT 18-21 Character
VALUE1 22-26 Integer
MFLAG1 27-27 Character
QFLAG1 28-28 Character
SFLAG1 29-29 Character
VALUE2 30-34 Integer
MFLAG2 35-35 Character
QFLAG2 36-36 Character
SFLAG2 37-37 Character
. . .
. . .
. . .
VALUE31 262-266 Integer
MFLAG31 267-267 Character
QFLAG31 268-268 Character
SFLAG31 269-269 Character
Non possible days (such as February 30) and missing measures have a value set to -9999. For this, I am interested in two variables: TMAX and TMIN. they are expressed in tenth of Celsius. 

As said, the file is organized line by line, with each line representing a month's worth of measures. Each line as 31 entries of the type value+3 flags. The function readFile() reads each line and return a tuple of 4 numpy arrays: the dates for the measures of the High temperatures, the high temperature measures, the dates for the measures of the Low temperatures and the low temperatures.

Let's start with the weather station in Central Park, New York NY.

The reason for returning the dates as well is that there may be missing measurements, i.e. days for which the Tmax, the Tmin or both may be missing from the file. 

Here comes the first graph with two subplots: the Tmax is the top one.

It is quite difficult to say anything about the data besides that "they look the same shifted by about 10C". In order to remove some of the "hairy" behavior, we will apply a sliding average on the data, with 7, 30, 182, 365 and 3650 days, a week, roughly a month, roughly six months, a year and roughly ten years. 

A sliding average is simply the average calculated over the last N points. This is used to smooth out the possible variations due to either the randomness or the seasonal variations. Let's use an example.

The following is a straight line (slope: 1, intercept: 10) with a superimposed Gaussian noise (mean: 0, deviation: 5). The X-range goes from 0 to 10, using 1001 points. Here is a graph (not "the" graph - guess why!)

From it, it "looks like" there is an increase. However, this is not really visible. Let's apply a sliding window to it, with size 10, 20 and 50.

Even at sample size 10 (in blue), the trend is visible. At sample size 50 (in red), the trend is impossible to miss.

This sliding window is performed in python with the numpy.convolve() function. This function does the convolution of two vectors (one-dimension arrays).

Back to our temperature samples, let's use this tool to smooth the temperatures. So, what do we expect?

The sliding average over a week won't probably change a lot - the weather cycle on earth is about a year, or 52 weeks. The same for a month. With a sliding window of six weeks, I expect the seasonal cycle will start showing. I really expect the trends to be visible with the year and ten years averaging. 

With the year sliding window, the trend appears but is still quite noisy. With 10 years, there is no doubt left: a trend appears for both the high and low temperatures. A visual estimation gives that for both measures, over the period the average has risen by about 2.25°C. Let's use a better method than "it looks like."

In the statistician's toolbox, a good tool to estimate the trend of a dataset is  linear regression or linear fitting, usually using the least-square method (minimising the vertical distance).  I will not explain all the values returned by scipy.stats.linregress(). Of these, I will use the slope, the intercept and the R-value.

Data source (High) Slope [°C/day] (High) Intercept [°C] (High) R-value [-] (High) Slope [°C/day] (low) Intercept [°C] (low) R-value [-] (low)
Raw dataset4.5933e-05-1.6242e+016.4263e-023.2326e-05-1.4788e+015.0001e-02
7-day sliding window average 4.5997e-05-1.6287e+016.8779e-023.2374e-05-1.4821e+015.2816e-02
30-day sliding window average 4.6027e-05-1.6303e+017.1649e-023.2403e-05-1.4837e+015.5005e-02
182-day sliding window average 4.6947e-05-1.6924e+011.1339e-013.3059e-05-1.5273e+018.7093e-02
365-day sliding window average 4.7696e-05-1.7464e+016.8588e-013.3669e-05-1.5716e+015.6593e-01
3650-day sliding window average 4.8371e-05-1.7851e+019.0518e-013.2297e-05-1.4709e+017.9373e-01

The first notable point is that the slopes are different for the low and for the high by about 1.6e-5 °C/day. Second - and it was expected - the R-value increases with the sliding window length: as the window increases, the data set is closer and closer to a line. As a result, the linear regression model matches more and more.

If the values for the slope don't look that much, remember that they are per day: over the course of 100 years, this represents about 1.68°C for the high temperature and 1.18°C for the low. The average temperature has risen by about 1.43°C over a century.

[To be continued]