Ranking Annualized Returns Vs Performance

I’ve come looking for opinions on evaluating ranking data.

In the past I have almost exclusively been using Annualized Returns to eval ranking weights and factors. However I recently dove back into looking more closely at Performance because I came across an interesting scenario. When I use my standard weighting for bucket evaluation which takes into account things like correlation weight, slope, R-squared, and of course bucket return. I get a total score that shows there is some level of correlation with the buckets and ranks. Nothing new here and I’m sure lots of us do it this way.

However this time I was checking the performance of two scores that were very close but the factor weights were very different and what I saw was the following:

Weight A, scored hi because the top bucket performed very well
Weight B, scored hi because the overall performance looked like there was some level of graduated progression with rank.

Normally in these cases I go with Weight B because many times I’ve ran into situations were a top bucket only score is due to a few lucky holdings and that’s apparent in the Performance graphs with poor R^2 values.

In this case R^2 is better and the return value is much greater (~4x).

So what weighting would you choose A or B and why?


I think the answer to that depends a lot on your universe size. If it’s small (500 stocks or fewer), I’d probably go with B. If it’s large (over 1,000), I’d probably go with weight A. Notice that the outperformance of weight B over the other buckets only really happens in the last four years. Also notice that if you look at the top bucket out of ten, rather than out of twenty, the performance of the top bucket is almost exactly the same.

The universe is small in this case. At the moment it’s about 80 stocks but that’s a sliding scale hence why there was nothing in the universe in 2008 (flat line).

I setup the bucket size to be 4-5 stocks on avg since that was the target diversification number I wanted for this port. However your point is valid none the less.

This is only a few value factors that I stopped at, not an entire ranking system so the weighing of this would be part of a bigger picture. I am leaning towards B but it’s hard not to look at A since that’s alone was enough for me to make a very nice sim of 3-5 stocks. I like the idea of making small ports that are target focused. (5 ports of 5 stks vs 1 port of 25stks)

Hi Tony,

First let me say you are doing some VERY interesting stuff.

Short answer for what I would do: use weight B.

Can you use a non-stationary regression like that to get valid statistics? R^2 here. However you decide to answer that I personally would stick with a scatterplot of THE RANK PERFORMANCE BUCKETS. You can probably regress this scatterplot. Or do a LOESS of it or a nonlinear regression as you have done. Also a Spearman’s rank correlation or Kendall’s Tau of the rank performance buckets may work.

Yuval likes Kendall’s Tau I think and he is probably in a good position to compare Kendall’s Tau rank correlation and Spearman’s rank correlation.

I think the significance of Weight B will be much greater with any of these techniques.

I think that Weight A will loose any statistic significance with any of the techniques I have outlined. It looks random except for the last bucket. It will certainly be less significant (I think).

A little more detail. The rank performance BUCKETS are differenced and probably stationary which is often desirable, if not necessary, for any statistical tests.

I am not sure of the original source of all of P123s techniques but there are good reason for the way the rank performance is done—including making non-stationary data stationary by differencing.

If you are rebalanced weekly then the buckets are an aggregation of weekly returns (ignoring the annualization). Cutting it into weekly returns is the differencing which removes the trend.

I do not want to sound like a broken record but de Prado discussed the need (as well as the frustration) of making things stationary in his book Advances in Financial Machine Learning. He he hates it and has some LIMITED workarounds as he thinks it is necessary.

Hope this helps some.

-Jim

Very interesting, Tony! I have essentially ignored the performance report capability of rank testing but your example gives valid reasons for me to use it more. Thanks! And the fact that your analysis showed essentially the use of market timing (via the custom universe being empty) showed me that I could add a market timer to a custom universe for the purpose of including it in a ranking test. That is a capability that I had not considered! I have only used timing in buy and sell rules, so my rank performance tests to date have not included them. Much to consider!

I would personally be very hesitant to rely on such small buckets, but I don’t create models with so few stocks. With very few in each bucket, I would much prefer to see the bucket relationship shown with B. Mentally at least, that would make me feel that the factor relationships are mapped with more significance.

I would also repeat your analysis beginning around 1/1/2010, since it appears to me that A outperforms B prior to that but that the comparison will likely be much closer starting then, and perhaps even give B an edge. If so, you should consider how that might relate to future results. But that’s just me.

B

I would use B. However this entire exercise is not very useful if your universe is indeed that small. That would make each bucket <2 stocks. This is either extremely curve fitted or just random.

I agree with Nisser, for what it’s worth. - Yuval

Based on the looks of the histograms, most of my models end up looking more like B than A. I don’t know if it’s just a result of my build process, but I’m suspicious of a spike that only works on the top 5%, and prefer a monotonically increasing slope if I can get it. I’ve seen some posts by other designers for systems that do have a notable spike at the top 5%, so it’s possible, perhaps likely, that there’s some tricks I’m not aware of - so I’m open minded to both possibilities. My experience leads me to expect results that looks more like B though.

Like some of the above, I’d be hesitant to use a model if the universe is 80 stocks or even sometimes less. I have something I use for utilities that is a small universe, 73 stocks presently, and I do currently select investments from that system, but in that case I test more like the top 10-15% percent to give myself a chance. Building that model felt “iffy” to me because even the decile plots resulted in buckets representing maybe 7-8 stocks. I didn’t know how else to work on it though, other than looking at many different time periods and trying to make sure there was some consistency over windows of time, and that it wasn’t just selecting for characteristics that happened to match a few hot stocks that drove the entire performance.

Why does the universe size matter? I could have a uni of 2 stocks or ETF’s, and as long as I create a ranking system that places the highest reward/risk at the top then that CAN be enough. If 80stks isn’t enough then what is that magic number and why? Also I’m not sure how you come to a number of <2 when I have 80 stks divided by 20 buckets? However I will say that I was wrong by saying 80. After further analysis over the 20yr period there was an avg of 55 stocks held in the uni. 55/20=~3 so I changed my bucket number to 11 to reflect the avg and get myself back to a 5stk avg bucket size. This is the result of looking at how a few factors impact each other, it’s no where near a final ranking system. There was no curve fitment but the high return % with such a good R^2 is what caught my attention and made me stop and think.

Don’t get me wrong I do appreciate a larger sample size but I don’t like to lose focus over the objective of what am I ranking.

Hi Jim,

Kendall and Spearmen are similar but I don’t see how either really help me make a correlation. If this was a linear regression then ok, but compound growth is exponential so I use exponential regression. If I’m wrong then maybe you can teach me something about how to use those methods of correlation in this environment. For me exponential regression and exponential slope is what I look at (hence the log-lin scale).

If I was to use a scatter plot then what two axis would you use?

I do agree that looking at stationary data is dangerous but it works both ways. Taking a high return, high R^2 plot is just as dangerous as taking a well correlated bar graph. Both could be subject to random outcomes. Hence why I wanted to look at both ways and see if I should “trust” the data enough to move on.

So after running further testing with time frames from 2010 → present and changing bucket sizing I no longer have confidence in either. Back to the drawing board as they say.

Never the less I can’t help but want to make a screen of the top bucket high return and r^2 and keep an eye on that. Who knows maybe the magic to this uni is only based on a couple factors. There’s no rule that says money flow is linear and I think that’s a elephant in the room that gets forgotten about sometimes. Meaning we like seeing progressive bucket growth but I think it becomes less and less relevant the smaller the universe. Institutions aren’t necessarily weighting a uni for investment the same way that I am so why should I expect the growth to be linear or progressive. It could be that investment firms are primarily interested in the companies that display a few key factors as the leaders and put the majority of their money into those. If my uni has a few of those companies and I stumble upon an outlier statistically but historically makes consistent growth then does that mean that it has merit or doesn’t have merit?

On the topic of curve fitting, we all do it to some degree. Nobody decides to invest money that hasn’t show to perform positively in the past. History is all we have and we base the future on it. Beyond that we typically think of curve fitting as finding the most optimal scenario for the greatest AR%. That’s why I think exponential R^2 is so interesting. If it’s been so consistent in the past there is more tendency to be consistent in the future and it really doesn’t matter if it’s 2 stocks or 200 stocks. A port is only part of a book.

For the same reason people do experiments with thousands of people and not hundreds. With such small numbers you leave yourself really vulnerable to random chance. What do the histograms look like with 5 buckets? I bet they’d look very unimpressive.
I think you may have something useful in your system as there’s a clear upward trend but no way would I expect future returns in the top 3-5 stocks to be as in that histogram. You can probably expect some alpha if you were to buy the top 20% stocks though.

Randomness typically has poor r^2 even with high AR but a case can be made that it’s possible. For instance it’s possible that the factors would have always held AMZN in the top bucket the entire time and it was the reason for the constant growth. Then I question, was that luck or design? So what is a reasonable number per bucket and why?

Putting the most contributing stock on a restricted list is something I’ve explored and there’s some merit for it. However when you do that you are removing the money flow that actually existed as well as the influence it has within the group. Meaning if your system predicts the winners of a group and leaves the dogs … then what are you left with when you remove the winners? Again less of an issue in a bigger uni but that doesn’t mean your system is flawed.

This is all hypothetical of course but I think there’s some validity into exploring different ways to test the data that we have available. Ranking is about comparing and I don’t compare apples to oranges when I buy apples so why do I compare apples to oranges when I buy stocks? My method to pick the a good apple may not have any correlation to picking a good orange or worse, it may have a negative correlation. What I end up with could be a basket of really good apples and really bad oranges.

Hi Tony,

First there are lots of good ways to do things so I will not try to sell you on MY way. In fact, I am most likely to use cross-validation now days (none of the above methods).

I fully agree about the problems of nonlinearity in what we do. Kendall and Spearmen are most useful for finding nonlinear correlation and is a nonparametric method. I mention these methods because I agree with you about the problems of linearity…

Cool method! My only recommendation would be to keep that in your tool box.

So I have (in the past) downloaded the rank performance BUCKETS and printed them on an Excel spreadsheet. The scatterplot would be x-axis = rank and y-axis is return for the bucket corresponding to the rank on the x-axis.

And I would just run a Spearman or a Kendall on this: the x-axis and y-axis. Again this is nonlinear. But like I said, use your own method.

I ACTUALLY THINK MARC’S METHOD OF DOING STATISTICS ON THE TOP BUCKET MINUS THE BUTTOM BUCKET (USUALLY FOR 5 BUCKETS) IS ONE OF THE MOST USED AND ACCEPTED STATISTICAL METHODS. I WILL LET MARC EXPAND ON THAT.

Marc is holding out (sandbagging us) and knows more about machine learning and statistics than most.

My only concern is with what you have done is the R^2 you obtain. THAT IS EXTREMELY HIGH. WOW!!!

For sure the R^2 (the coefficient of determination) does not mean what you and I would hope it means. Wouldn’t it be nice if R^2 = 0.9481 meant 94.81% of the market activity (variance) is accounted for by your system!!!

So what does it mean? I believe that some of the high correlation is due to “spurious correlation” caused by the non-stationary nature of your time series data. You could run a RANDOM Monte Carlo simulation WITH TREND and get a high correlation. This would be an example of “spurious correlation.”

Sadly, in my case, once data is made stationary I am happy when I get an R^2 of 0.0006, or R of 0.0245 (over all of the data for individual stocks). This is not as bad as it seems when one looks at the mean and variance of the top-ranked 5-10 stocks. But there is a lot of noise no matter how you look at it and I have given up hope on a ranking system that gives me a coefficient of determination of 94% plus—even in my dreams.

The only recommendation I would have would be to look into the stationary thing at your leisure. See what you think.

You may find the time series correlation (non stationary) R^2 useful. Cool: then use it. Just be clear on what it is telling you.

As I said you are doing some VERY cool stuff and thanks for sharing!!!

-Jim

Thanks for your input Jim,

I’m curious what you mean by Marcs method of cancelling top and bottom buckets as with the succession of those. I suppose this method is dependent on how many buckets you are using?

As for R^2, it actually holds my lowest weighting when evaluating histogram returns. However I do like to use it as a comparative measure when looking at performance returns. I’ve learned that the histogram returns don’t always tell the whole story. This is along the lines of you referring to a result as stationary. Oh wouldn’t that be the golden goose if we found a stationary relationship with time? However I’ve yet to find that. Evaluating the covariance of a ranking system through time to find the most stationary relationship would take a long long time if you were to do it on every combination of weights. At least that’s how I see it but I’m not a statistician.

With all of the various ways to analyse the data I always have to bring myself back to reality and remind myself what am I actually looking at and what conditions may have contributed to the result. I don’t believe that when creating a small “unique” universe, you should entirely rely on linear distribution. To me I feel that should be saved for much larger universes where you have a higher signal to noise ratio for lack of a better term.

Tony,

I think Marc uses “quintiles” or five buckets. I think he usually gives the difference in returns for the top minus the bottom bucket. I like his method and mention if for that reason. I welcome any corrections.

Here is a link to a well known (peer-reviewed paper) that uses “deciles” or 10 buckets: HERE

He does generate a t-score with his method.

I will mention that his method probably gets around your (correct in my estimation) concerns about the shape of the distribution through the central limit theorem. He does not have the additional assumptions that are part of a linear regression with this method.

Probably one of the methods that could be used should any of us wish to publish in a peer-reviewed journal.

-Jim

Here’s why universe size matters.

Let’s say I narrow down my universe to the stocks in the NYSE that begin with the letter N and create a ranking system that works well for those stocks over the last ten years.

Let’s say I then see if that ranking system works on the stocks in the NASDAQ that begin with the letter N. What do you think the chances are that it would?

I would guess it would be far less than 50%.

Now let’s say I create a ranking system that works on the Russell 3000 universe (excluding stocks in the NASDAQ that begin with the letter N). What do you think are the chances this ranking system will also work on stocks in the NASDAQ that do begin with the letter N?

I’m almost 100% certain it’ll work better than the first.

As Yuval says size does matter.

I like that way the Bayesians think about this (just another way).

A typical Bayesian question is this: I have a coin. Is it a 2 headed coin or a fair coin (heads and tails)?

One reason I like this is you may have a PRIOR belief about the coin. Did you get the coin as change at the grocery store or did you buy it at the Magic Shop? Truly a frequent problem/exercise in the Bayesian texts. Another—more nuanced—question might be: Is a lung tumor cancer, or not cancer, in a 70 year old smoker?

This is like a question on a stock: is it guaranteed to go up or is it 50-50? You probably cannot help but have a prior belief on this too. Maybe the stock problem is like the coin or more nuanced like the lung cancer question—but the principle is the same.

So if you flip that coin 2 times and get heads both times you might not be too sure. But if you got it at the grocery store you should bet that it is a fair coin as there are not too many 2 headed coins at the grocery store. This is despite the fact that 2 heads in a row is (weak) evidence that it is a 2 headed coin.

Flip it 1,000 times and get heads every time then it is a 2 headed coin. It doesn’t matter where you got the coin. You cannot argue with the evidence here.

The size (of the sample) has an effect on your belief about the coin (or about a stock).

Marc and others have a lot of problems with the use of frequentist statistics for stocks. They have a point. A Bayesian worries less about predicting the future. Rather, a Bayesian just wants to know if their belief about the coin, or the stock, or the lung tumor is correct.

Marc is a good Bayesian Statistician, IMHO. He does not have to call himself that—he just does it well. Perhaps this is because Marc frequently uses conditionals (also a Bayesian tool) in a logical manner.

-Jim

That example isn’t what I was proposing. What if the universe consisted of only banks of a certain size constraint and historical profitability that also paid a div? Hardly companies that start with N. There are many cases where you can narrow the focus of stocks to better compare an apple to and apple. I fail to see why that’s a bad idea?

Tony,

You are probably right. Just flip the coin enough times to get some support from the evidence (which you may have already done to your satisfaction).

In the real world it is like a 24 sided die (singular of dice). Where a favorable die (strategy) has 13 good sides (days) and 11 bad sides (days). Where most die (strategies) have 12 good and 12 bad. It takes a good number of rolls of the die to find out if it has 12 or 13 sides in you favor. The 24 sided die with 13 good sides approximates 54:46 odds (54 good days and 46 bad)—maybe what could be expected for a good strategy in the market. But most strategies will end up having 12 good and 12 bad—breaking even with the benchmark.

How many rolls of the die do you want to before you place a bet on what kind of die it is? As many is you can. But also, you can calculate how many rolls of the die give you what kind of certainty.

We all have good stories about whether the true value of a stock is already baked into price, or not. But get some evidence too.

You do have evidence. Perhaps enough but size does matter.

-Jim