I recently came across this post from 2009, showing how total returns companies achieved and the remuneration packages of their CEOs had no obvious relation between them. This kind of article, showing a correlation does not exist, is relatively unusual in my experience.

Far more common are articles like this one, by Eugenio Proto and Aldo Rustichini, purporting to show new evidence about the link between life satisfaction and GDP. Even if you accept whatever methodology they have used to derive their life satisfaction index (I don’t think we can get no satisfaction currently, see my previous blog), you have then to accept them defining a feature of the data entirely created by their regression analysis tool (the so-called “bliss point”) before going on to discuss what the implications of it might be.

The article’s references are stuffed with well-known economists’ papers and I am sure that one of its conclusions in particular, that increases in GDP beyond a certain point may not increase life satisfaction in developed countries, will lead to the research paper underlying the article to be widely cited as this is a politically contentious area. However this kind of thing is really nothing more than an economic Rorschach test: the meaning of the ink spots often depend on what you want to see.

But such studies are not often treated in this way. Why? Well what if one of the interpretations of the ink spots was backed up by some mathematics which could be run very quickly on any ink spot pattern by anyone with a computer? There is nothing biased about the mathematics, after all. This is what regression tools give us.

Regression is taught to sixth formers (I have taught it myself) as a way of finding best fit lines to data in a less subjective way than drawing lines by eye. The best fit straight line in a scatter graph is arrived at by looking at differences between the x and y coordinates of specific points and the average x and y values respectively. For y on x (ie assuming y is a function of x, you usually get a different gradient if you assume x is a function of y), the gradient of the line is the sum of each x value less its average times the corresponding y value less its average, all divided by the sum of the squares of the x values less their average. Or as a formula (the clumsiness of the preceding sentence is why we use formulae):

Now let’s focus again on the graphs in the Proto and Rustichini article (the second graph has excluded Brussels and Paris, on the basis that they are both very rich and very miserable) and their regression-generated lines of best fit.

If we look long enough at these graphs we can almost persuade ourselves that the formula driven trend line (not a linear one this time) shown actually represents some feature of the data. But could you draw it yourself? And, if you did, would it look anything like the formula-generated one? If your answer is no to either of these questions, there is a possibility that the feature identified by Proto and Rustichini would be entirely absent from your trend line. The formula will *always* give you some sort of result. The trick is identifying when it is rubbish.

As an illustration of this, I constructed a graph where I was confident there was absolutely no correlation between the two things, and then set Excel’s regression tools to work on it.

As you can see, none of the options, starting with the linear regression we discussed earlier and getting more complicated, result in the kind of #DIV/0 and #N/A messages we get to see regularly elsewhere in Excel. By setting the polynomial option to a quintic, Excel is quite prepared to construct a best fit polynomial of order 5 (it has a fifth power in it – the purple wavy curve) to my array of dots. These lines and curves are merely the inevitable result of the mechanistic application of formulae that in this case have no meaning.

There may be nothing biased about the mathematics, but, as Bernard says In *Yes Minister*, when questioned by Jim Hacker about the impartiality of an enquiry: “Railway trains are impartial too, but if you lay down the lines for them that’s the way they go.”

Many economic research papers contain graphs which are similarly afflicted.