robust linear regression

Simple least-square linear regression gives extreme weight to outliers, and that is distorting my calculations of beta and alpha. Alpha is for me the best way of judging simulation returns. But even with a weekly rebalance, a range of ten years, and a hundred stocks in my portfolio, one single stock that returns 1600% in a week (and this happens on occasion) will throw everything off.

To arrive at better measurements, I was considering two alternative approaches to simple linear regression: Theil Sen regression and trimming outliers before doing least-square regression. (Another option might be to use absolute value regression rather than least-square.) From what I’ve read, there are two different ways of finding the intercept using Theil Sen regression, so that gives me pause. If I were to trim, I could discard values greater or less than 1.5 times the interquartile range from the first and third quartiles, or I could rely on the conventional z-score (difference from mean divided by standard deviation). I’m leaning toward Theil Sen as I’m worried that trimming outliers would be a rather subjective approach and better suited to cases where data is inaccurate than to cases where data is accurate but skews results.

Would anyone who knows more about stats than I do care to offer their two cents?

+1 on median absolute error (MAD) over the mean absolute error (|Error|) because: a) it is easy; and b) “the population geometric mean is the population median for the lognormal distribution” (ref: http://www.tandfonline.com/doi/abs/10.1080/00949658708811013?journalCode=gscs20). This artifact of measurement strongly favor the use of Theil-Sen estimators. However, it is problematic to compare such an estimate with normative metrics for dispersion, unless of course, you define the median absolute deviation. Also problematic: annualizing either of these.

On the other hand, the sampling properties of squared errors (E^2) are much more commonly studied and understood. Also, most cases of ordinary least square (OLS) regressions converge with the sampling moments inferred by max-likelihood estimators (MLEs). If you stop to think about it, it’s quite amazing to think that almost the entire field of normative statistics is based around the expectations of taking the standard error (SE) of the difference between two data points. Why did statistics evolve this way? Is SE an artifice built around convenience or does it reflect something fundamental about the laws of probability? Are there better ways to do things? Moreover, does any of this really matter?

I gave up on trying to answer these questions because it is almost impossible to know whether the convenient answers provided by SE are a result of natural phenomena or a measurement bias (i.e., lens) which has been shaped by cultural indoctrination.

Anyway, apologies for waxing existential: also worth checking out is the class of robust estimators known as M-estimators (OLS is special case): https://en.wikipedia.org/wiki/M-estimator

Now that I’ve tried Theil-Sen estimation a little bit, I really like it. It seems to be much more accurate than OLS, and you don’t get thrown by the outliers, which were making my results too rosy for my liking. In the illustration below, the blue line is OLS for the 15 blue points, and the red line is the Theil-Sem estimation for those same points.


Very interesting, Yuval.

Do you know how you would/could implement such an estimator using native P123 syntax? I can only conceotualize doing so with nested loops (which are not native), but I tend to lack creativity.

Interesting…

I have a question. If you removed the outlier, and performed OLS on the remaining 14 points, do the coefficients come close to the Theil-San estimation?

There are tests (Cook’s) that can be used to determine which are outliers. But there is a debate (with no real conclusion) as to whether removing outliers creates an invalid regression/projection. I would agree with your conclusion that the Theil-San is preferable as it eliminates having to determine if outlier can be discarded.

No, I’m doing it in Excel, which is what I use all the time. I download the data and paste it into my spreadsheets. This is cumbersome. The Theil Sen estimate would be easier in R, but I haven’t taught myself that yet.

It’s such an elegant little formula that I wonder it’s not used more often. For those who don’t know it, you take every possible pair of points, draw straight lines between them, and find the median slope. To get the intercept, you draw lines with that slope through every point and take the median of all the intercepts. Algebraically, the slope is the median of (y(i)-y(j))/(x(i)-x(j)), and the intercept is the median of y(i)-beta*x(i).

simondul -

To answer your question, no. The OSL slope then drops to 0.6392 and the intercept goes down to 0.0306. So it’s not really any closer to the Theil-Sen estimation than it was–it’s just in a different direction. On the other hand, if you remove that point, the Theil-Sen then changes too–to 0.6629x + 0.033. A less dramatic change.

Yuval,

I have little experience with some of your tests.

When I can, I use OLS or standard linear regression. ANOVA when things are not linear or when when using nominal independent variables. T-test when there is a large difference in the variances. Nonparametric tests when the the distribution is not normal. And if I cannot meet the assumption of independence or being stationary, I do nothing.

I keep to the basic stuff.

To your main point of outliers, I just remove the very worst outliers (as few as possible) in a non-scientific way. I have played with Cook’s Distance. It highlights the points that have a lot of “leverage” and have a large residual. An example would be an outlier far to the right of the x-axis. Being far to the right on the axis gives it a lot of “leverage.” Other points may have a large residual but they do not “tilt” or move the line of best fit as much. So keeping them or removing them has less effect.

My personal problem of with this method is that I have a tendency to keep removing points until I get what I want. R gives me new points to remove each time I run it. I will need to learn a formal way to use Cook’s distance that avoids this problem for me. There probably is a formal way but I may need to just remove one or two points the first time and declare it done. But in any case I am not using it seriously yet.

Probably, Cook’s Distance should aid a person in removing fewer outliers (only the ones with a lot of leverage) and is probably not intended to aid me in overfitting, getting a good looking scatterplot or a high R value. The opposite of my natural tendency. But, obviously, removing some spurious outliers can be helpful in getting a model that may perform better going forward: if it is an art I have not perfected it yet.

Attached Cook’s distance from R.

So I might remove the point labeled “2” in this attachment as it has both a large residual and a lot of leverage.

Just what I am doing for now.

-Jim


I did a correlation study to see which better correlates with OOS results, alpha calculated by OLS methods or calculated by Theil Sen estimation, and OLS won. (If you want to know my methodology, I invent thirty different ranking systems, find the alpha of the 100 top stocks with weekly rebalancing and some rank tolerance over 9 staggered 8-year periods, and correlate the results with the returns of the top 20 stocks over the subsequent 3-year periods; I then check this with rank correlation.)

One thing I noticed was that when you calculate alpha by Theil Sen estimation, the result is quite close to the median of the excess returns. So I tested that too, and the result was better than the Theil Sen alpha and just about as good as OLS alpha. (I also tested alpha divided by standard deviation for good measure, and the results were worse than the other three measures.) And it’s a lot easier to calculate.

My conclusion is that Theil Sen estimation may be better at calculating slope than OLS methods, but its calculation of the intercept is weaker, at least for my purposes. Historically, the focus of Theil Sen estimation has always been slope; intercept is more or less an afterthought, and there are at least three different ways to calculate it.

So in the future I’m going to be looking at not only the OLS alpha of my results, but the median of the excess returns–something that would never have occurred to me had I not investigated Theil Sen estimation.

It’s a bit of a relief, in a way. Setting up Theil Sen estimation in Excel isn’t easy, and the files are HUGE. But the investigation has been great fun.

One disadvantage that Theil Sen estimation has over OLS regression is that OLS makes a distinction between the X values (observed) and the Y values (predicted) while Theil Sen estimation treats them the same. In other words, OLS minimizes the vertical distance between the points and the line while Theil Sen estimation minimizes the diagonal distance. It makes more sense to minimize the vertical distance, since the X values remain constant from one observation to another: only the Y values change, so the comparison points should be those.

I’m going to try using LAD (least absolute deviation) next, but it’s much less efficient. A correlation study will take a lot longer.

Good point! I think this it true of most alternative statistics. I was going to say that in my first post but I am not quite knowledgable enough to defend it with a mathematical proof–or even quantitate it–if I were questioned. I just know (or think) this is absolutely true.

And this is true about statistics that use median, too, I think. And one other thing that is intuitively true. The mean means more (no pun intended). If I know the median home price of new homes and the number of new homes sold in a year I do not know much about the total price paid for new homes in a year: and really just cannot calculate it. With additional information about the distribution of prices I can finally get it, sort of. With the mean and the number of homes sold I got it with one simple multiplication.

I get why people who work for HUD and social justice warriors are so concerned about Median Housing Prices.

But this goes double with, say, slippage or trading costs. Why would I even care about the median slippage of my trades?

Mainly just saying: good point.

-Jim

Check out the diagram below, Jim. For the blue diamonds, the blue line is OLS regression, the red line is LAD regression. The x-intercept for OLS regression is 8; the x-intercept for LAD regression is 0; the mean excess return is 0; and the median excess return is 4.

Now do you see why I like using the median excess return?


OLS vs LAD final.png

Right.

I am not sure what you conclude from that (there are a lot of good points there).

Do you want 0 for your alpha calculations?

I will have to think about it and/or look it up but I think it is possible you will be getting 0 a lot with your LAD. I could be wrong on that.

-Jim

deleted - see below.

Cool.

Do you like that better? I have no basis for judging this. (intended for post below)

-Jim

I made a mistake with the LAD line. Here’s the correct version. LAD intercept is 5.97 (according to computer calculations using successive estimations) but should actually be 6.0.


OLS vs LAD final 2.png

Cool.

Do you like that better? I have no basis for judging this. If you do you should keep doing it that way.

At this point I can just support the point you made in your post: you can get an answer with less data points.

-Jim

Just for kicks I added the Theil Sen regression line too, in green. The intercept is 5.5.

For what it’s worth, it seems to me, intuitively, that the correct intercept should be 4. That’s what I would predict given this data set. And that’s what I get using the median excess return rather than any of the linear regression options.

Now, obviously, linear regression shouldn’t be applied to data that looks like this in the first place. But I just wanted to make a point about using medians.


OLS vs LAD final 3.png

I see what you are saying. I would just have said—to myself—I should not be doing an OLS on this. And actually, a mean or median of 5.5 looks pretty good.

I defer to you on why you did a linear regression to get this (in this hypothetical example).

But it seems clear why you should not be doing an OLS. Is it linear? Is it from a normally distributed population (the data in this sample does not look normally distributed). If not, are there enough data points to satisfy the central limit theorem? Constant variance? And edited for David’s comment below: Stationary?

I am not sure about some of these assumptions on these hypothetical points. But I can see why you might need to use something else. I see your point and agree with it.

I would only add that with a large amount of data that takes advantage of the central limit theorem, I try to make the other assumptions true: when I can.

-Jim

I wanted to play with Theil-Sen IOT implement a more robust regression in some DCF models I use in P123. For example, the gross margins in some industries can be incredibly noisy because impairments are highest when revenues are lowest. Therefore, most regression techniques are biased to the upside, with some even going asymptotic. I was hoping to get my head around some better, more robust estimators than can be implemented using P123 syntax.

Based on your discussion, I anecdotally think that Theil-Sen might not be well suited for time-series and/or non-stationary process. However, if you would compare it to something in which the “x” and “y” are concurrent and/or stationary, then the pairwise sampling should increase the statistical significance by increasing the sample size by n*(n-1)/2. This is in effect re-sampling, or in another’s parlance “bootstrapping”

Am I interpreting this right?