Wednesday, April 16, 2014

Global warming, python and statistics part 2

In the previous post, we stopped after establishing the long term trends for the high and low temperatures. This gave a very general overview of how the temperatures are evolving on average over a long period of time, in this case a bit more than a century.

Let's refine a bit and determine how the temperature evolves for the same day of the year, namely January the first, April the first, July the first and October the first, for the various years in the dataset.
The temperature curve for each day in the year is quite specific and separated from the other days, both for the high and low temperatures. The weather on earth is a cycle of period about 1 year. 

Now, let's detrend the original data: this means removing the long term growing trend to have a data set composed uniquely of the short term variations. We will also plot the curve representing the averages over a year for each day of the year (all the January 1st, all the January 2nd ...)

For the averages over a year, the detrended datasets were used. To account for the loss of the average in the detrending process, the initial dataset average has been added back. A sinusoidal fitting has been added for the high and low temperature curves. As it appears, that fits quite well.

At this point, we have (a) a long term linear trend and (b) a yearly cycle. These information are useful to give an idea of how temperatures evolve, but they are only averages. Let's have a look at how the temperature is distributed around that average. 

In order to do that, the relevant days of the year (i.e. January 1st) will be corrected to remove the long term trend and the yearly cycle. This will leave the difference with the average temperature.  Let's do this with the same days as before, January 1st, April 1st, July 1st and October 1st. 

Be careful though: the long term trend AND the sinusoidal fitting both contain the average value - if these two are subtracted from the data, the result will be that the average value will be removed twice. In this case, I decided to ignore the offset value from the sinusoidal fitting. Other techniques are possible. 
Let's check how the temperature is distributed around the trend/cycle.
From this, it seems that the distribution is fairly normal. The parameters for the stacked distribution are mean μ=8.301619e-02C and the standard deviation σ=4.411646e+00C.

Now, let's be careful: even if the distribution of the differences around the trend+cycle looks like it is a normal distribution, do not confuse this with a normal random variable: the weather, while it looks strangely random, is not a variable independent of everything else, including its history: the temperature at day D will influence the temperature at day D+1. That is one of the reasons why the weather forecast is possible. If temperature were a purely random variable, there would be no forecasting of the weather. 

To be perfectly complete, there is a slight difference between the distributions for the high temperatures and the low temperatures. 

Temperature set Mean [C] Standard Deviation [C]
However, from now on, I will consider it as a "somehow random variable."

[To be continued]