Better Than Random Stock Picks?

I have a Port that I have been trading since 8/5/13. It has done well but I want to know whether it has outperformed a random selection of stocks from the same universe.

So I ran the same port with a ranking system that selects stocks randomly. My port is all edited real trades. The port of randomly selected stocks uses no slippage or commissions and uses “Next Average of Hi and Low.” Results:

My Port. Total Return: 94.61% (AR 57.02%)

Random Selection: Total Return: 0.14% (AR 0.09%)

Good. My port managed to outperform this random selection of stocks. But was it my wonderful port or just dumb luck. After all, it is a 5 stock port that has been running for a relatively short time. Is there any statistical significance?

I would like the answer to this so I ran a paired t-test (two-tailed). This paired t-test matches the weekly returns of my port and the weekly returns of the port of randomly selected stocks. This is all downloaded from P123’s Performance>Graphs>“Weekly returns since 08/05/13” into an Excel spreadsheet. The statistics are computed by Excel.

Briefly, the null hypothesis is that the population mean for the weekly returns of my port is the same as the population mean for the weekly returns of randomly selected stocks. That is the same as saying that following my port is really no different than picking stocks at random. Any apparent difference is just due to chance.

The alternative hypothesis is that picking stocks with my port is different from picking stocks at random even if it is just because of the negative effect of transaction costs.

Results: p = 0.16

There is a 16% chance of observing a difference this large (or larger) even if the population means are identical: the null hypothesis.

Based on just the out-of-sample performance, there is a 16% chance that using this port is no different than picking stocks at random!

Jim,

Very good performance. I would suggest running the random ranking system 100 times. It’s fairly fast. Then you can count how many times ‘random’ beat yours. I’d be curious if you get the same result.

Best,
Tom

Tom,

I did do the same thing yesterday and I got p = 0.14. Also have done this before with a slightly different null hypothesis: are my returns different than the benchmark? The results are a little more significant for this question, presumably because the benchmark has less variance. I forgot exact numbers but it was significant for a one-tailed t-test. Pretty similar though.

Jim,
Use the optimizer. Put in the full date range 50 times each run without changing anything else. Can do it in 2 runs.
Best,
Tom

Will do and will get exact numbers on the comparison to the benchmark

Tom,

So I ran 100 sims with my random ranking system. I got a range for the AR as -15.64% to 40.83%. None of the sims beat my port.

Of course, if I could run my real port 100 times there would be some variance there too. The paired t-test gives us insight to this variability.

I also took my port and compared the weekly returns to the SP500. So the null hypothesis would be: Investing in my port is no different than investing in the SP500 (through SPY say). Technically the null hypothesis is: the population means are the same.

I got p = 0.07 for the 2-tailed paired t-test.

So, just based on the out-of-sample returns there is a 7% chance that investing in my port is no different than putting my money into the SPY ETF.

Jim,

The outperformance in your ranking / system is likely ‘real.’ The 100 runs with ‘random’ ranking is likely more accurate assessment of the probabilities for a variety of reasons, for example it makes no assumptions around the shape of the underlying distribution (it assumes they are normal, when returns have fat tails). This doesn’t mean it will continue to outperform, but it’s very unlikely the outperformance you have had was due to chance. I would be worth while ‘decomposing’ the actual source of outperformance, for example - sector or market cap or ‘style’ tilts.

Best,
Tom

Tom,

Thanks. It is looking pretty good for the port.

A few random thoughts:

There is survivorship bias. I did not bother to look at the ports I am no longer trading and this is my best (probably) port that I am trading now.

T-tests can generally be done with n > 30 even if the population is not normally distributed because of the central limit theorem. That is why I used weekly returns. I have 78 weeks of data.

I continue to think what is meant by “population.” Notice I did not really define it and I have yet to see it well defined in any books or on the net. There is a lot that can be said but most of the problems with this can be simply stated: The only population that I really care about is the future returns of stocks. I do not know how to get a sample of this. Clearly, I do not even have a random sample of the past. Perhaps it is best termed a convenience sample. Akin to taking a sample from your volleyball class in order to determine what the people at a university are majoring in. Many previous post (including Oliver’s and Marc’s) have correctly pointed this out.

Problems of random sampling aside (it is probably a convenience sample), I think the comparison to the random selection of stock is pretty correct. I would certainly like to be made aware of any other problems.

It may be conservative but I think the p = 0.07 for the comparison to the SP500 rings pretty true to me.

This is easy to do: maybe eight clicks of the mouse including the download into Excel. I think I will probably use it in the future.

Thank you for your ideas and comments.

Jim,

Rushed answer below:

I hear what you are saying, but I still think comparing to a bunch of actual samples from the population is closer to ‘true’.

And all of these tests still doesn’t tell you very much about what to expect the following year, however. Which is likely what you care most about.

So, I still would focus more on building a diversified basket of conceptually very different systems (say a hedge overlay system, a bond system and whatever else).

Beyond that, I think that doing 2 things, a) figuring out the total trading costs and slippage costs across various levels of market volatility and b) plotting the dispersion in ‘hypothetical returns’ from the ‘random ranking’ results. Doing these can start to give you a better real sense of what levels of variance you can likely expect on a going-forward basis for the actual system out-of-sample. The dispersion is likely fairly high. That means that even if the ranking system has an ‘average’ yearly return of say 40% for the top 5% of stocks, but portfolios of this size and turnover have a yearly random ‘style drift’ of ±40%, and the fixed costs of trading it are 10% a year, you start to see more clearly the range of likely annual outcomes.

You can also look at varying the weights slightly to heavily and seeing the ‘stability’ of the ranking.

Beyond that, all factors and styles can have varying performance with huge (and fairly long) cycles. If you can find the ‘style’ and ‘factor’ reasons you think the system worked, can help you to monitor that specific ‘index’ and see when it starts to reverse or fail.

You can also, if you can program, take the full period backtest results and then randomize the order in which the various weekly returns occur and then look at rolling 12 month results. This is another way to test, but won’t necessarily give great insights on an overly optimized system.

My best performing ‘portfolios’ are systems I have little or no money in, because I have little confidence in their year to year stability.

You also DEFINITELY want to analyze your real money, total port results, not just look at your winners / best system.

There is survivorship bias.

No, there isn’t. A “bias” is, here, the tendency of a given statistic measured on a “random” sample to differ from that statistic measured on the population. Suppose the question were “How well did professional fund managers perform over the last 10 years?” The population is “All of the professional fund managers who managed funds at any time during the last 10 years”. However, the data that is readily available for manager performance only includes managers/funds that are around today. The funds/managers that had a losing year sometime in the last decade were fired and replaced. So a sample of this readily available data will have a “survivorship bias” such that statistics measured on samples will not trend towards the statistics of the true population as the number/size of samples increases. The performance of other systems you have built and discarded are not part of the population whose average you are estimating, thus no survivorship bias.

Most statistical tests have biases. For example, suppose I have 1000 people who are normally distributed for height. Those 1000 people are the population I’m interested in. I can measure the standard deviation of the whole population using the equation SQRT( SUM_FOR_ALL_Y( (y - ybar)^2 ) / NUM_Y). Or I could estimate the population’s SD by taking a randomly sample of 10 people from the population and measuring the standard deviation of their heights. However, I have to modify my equation for standard deviation because otherwise it will have a bias, it will under/over-estimate the population’s standard deviation. So I use: SQRT( SUM_FOR_ALL_Y( (y - ybar)^2 ) / (NUM_Y-1)) or SQRT( SUM_FOR_ALL_Y( (y - ybar)^2 ) / (NUM_Y-1.5)) depending upon the application. Excel provides STDEV and STDEVP for standard deviations of a sample vs population.

I continue to think what is meant by “population.” Notice I did not really define it and I have yet to see it well defined in any books or on the net.

Continuing the thought, yes, defining “population” has to be the starting point and is where your analysis got off track. The math that you are currently doing assumes that consecutive weeks are independent, but they are not, they are dependent in two ways: 1) the performance of a week correlates to some extent with the performance of the prior week (the relative success of any price based technical indicator tells you this); 2) the performance of the portfolio over those weeks is the product of its weekly returns, not the sum. So asking about “average” weekly return returns a meaningless answer. You could reasonably ask about the geometric average (which is related to what p123 calls Annualized Return), but that still doesn’t get around the interdependence of consecutive weeks.

Perhaps more helpfully, you could ask about your portfolio’s performance on a given week versus a random portfolio (of equal) on that week. So the population is “All Possible 5 Stock portfolios on Week N”. The null hypothesis is “My 5 stock portfolio on Week N was chosen at random”. The universe I use is basically the top 3000 or so stocks according to liquidity. On a given week, there are [3000 choose 5] possible 5 stock portfolios (Google says that’s 2,000,000,000,000,000). This population is far from normally distributed. I know because I have measured, that for individual stocks in this universe, a little over 50% of the stocks lose money on any given week. 60% do worse than the average for that week. 20% do very much worse and 20% do very much better. Doing random samples of 5 stocks from this universe narrows the range of possibilities, but the distribution remains very similar. Asking “How much better than random am I” can’t be easily answered even with all of that data. There are weeks when my 5 stock port is in the top 10% of possible 5 stock portfolios (it did better than 90+% of the possible combinations of 5 stocks that could have been held that week). There are weeks when my 5 stock port in in the bottom 10%. It averages in the top 35%. Does that tell me how much it will make next week? No.

There is yet another way to try and answer the question “How much better than random am I”. Instead of looking at a single week or averages of weeks, I can look at performance over the whole time period. So I can say my 5 stock port made X1% from week 1 to week N, and here are the total returns of 200 random (sell rule true, ranking system of “Random”, port size 5) portfolios over that same N week period. You’ll notice two big things about the random returns: 1) they are much closer to normally distributed than the individual weeks (though they are still fat tailed); 2) the average of the total returns is much lower than the universe’s average total return over that time period. In any case, at this point you can start applying some statistical tests to answer the question “How likely is it that a 5 stock portfolio of randomly selected stocks would perform as well as my 5 stock port?” Maybe you calculate the mean and standard deviation of the random ports and measure how many standard deviations out your port’s performance is and then use a cumulative distribution function to calculate the odds. Maybe you use that same process but use a log-normal distribution or use formulas that permit you to customize a distribution’s skew and kurtosis. Say you get a number you’re happy with. Say it’s 9 standard deviations. At 4.25SDs, the odds are 1 in 100,000. What do you know now? Does this tell you what your return will be next week? Does it tell you how “robust” your system is? It is not particularly difficult to beat a portfolio of 5 random stocks over any reasonable period of time. On my universe a random 500 stock portfolio will beat a random 5 stock portfolio almost 60% of the time in terms of total return over 15 years.

I did not do it in excel; however, all of the data crunching I did could have been done using p123 and excel. The tedious work is getting the data for your universe’s performance each week for the period you’re interested in. Create a ranking system with “Random” as the only factor. Create a single stock sim with that as the ranking system, and a sell rule of true. Run the sim a few hundred times and download the weekly performance data each run. Crunch data until you arrive at a lie you like.

Tom,

Agree. For example, I definitely look at my total real returns at Folio Investing and can compare my results to the S&P 500 and the Russell 2000 easily.

My ultimate question remains for each port: Is it really better than just investing in SPY or IWM? There are many good ways to get answers to this.

SUpirate1081,

I want to take the time to look at all of your post. As far as survivorship bias, that probably is not the correct term. I was only trying to say that if I had 100 ports and presented one that looked pretty good it may not really be significant no matter what the p-value or AR is. It could be just that one of my 100 ports did well by chance.

Great post. I will keep studying it.

Jim,

I just read this thread and have some questions of how you ran the random Sims. First, I assume that you used the public Random ranking system. To have the T-test have useful results it would have to be comparing the random test Sims buy and sell rules to your Port’s rules. The random tests should also hold the bought stocks for about the same length of time as your Port does.

Do you have a rank sell rule? If so, there is a good chance that you may be comparing apples and oranges. I assume that your Port hold the stocks for a significantly longer than 1 week. However, if you have a rank sell rule the random tests that you ran probably hold the stocks for a much shorter length of time, on average probably only a little longer than 1 week, since the rank values are changed randomly every rebalance. A comparison of stocks held for a little over 1 week doesn’t have a realistic statistical relationships with one that holds them for months, and your Port may have other sell rules to sell laggards and /or losers letting the winners run. If a rank sell rule is selling the holdings after only about 1 week the effects of the other sell rules can’t be evaluated in the test.

To put it a way I know you will be familiar with, it is like comparing effect of a new drug on the tumor size change from a group of cancer patient’s after 3 months, to a control group given the same drug, but tested after only 1 week. There is no valid comparison.

Denny and all,

Thanks guys. Good points for me to consider over the next few days. Some of my idea was copied from the CFA study manual at Investopedia. They have an example where they compare mutual funds using a paired t-test. They use quarterly returns to get an n = 40 rather than annual returns: similar to what I did with weekly returns.

As far as the most (potentially) useful idea I’m most interested in comparing to a benchmark. Link here for anyone interested: here.

Denny,

I appreciate your feedback and wanted to respond more fully. I used random < 0.8 as the only rule in my rank for the random picks. My sell rule was Rank < 101. My port used RankPos > 5 as my only buy/sell rule, mostly. I recently added Eval(GainPct > 25, .25, 0) to the sell rule. Also I was using some timing rules for a while.

I would love to match each real trade with a random stock held over exactly the same time period but I am just too lazy. I was thinking that buying a random stock for a week, selling then buying a different random stock (with no transaction costs) might be a close approximation of holding a random stock for 2 weeks. Similar argument for 3 weeks.

Whether this is an adequate approximation or not, I think comparing to a benchmark rather than random stock picks seems more potentially useful as random stock picks could be a little bit…well random. So, I don’t want to waste too much forum space unless people are really more interested in this than comparing to a benchmark. Comparing to the benchmark may be worth discussing in detail.

As for sampling problems (e.g., SUpirate1081): I agree. Problem is it’s the only sample I got. I’m not going to be able to get away from that. I will continue to look for any reasonable inferences.

Finally, SUpirate1081, compounding is a moot point if the null hypothesis is true. A zero return is still 0 no matter how many times you compound it. If I reject the null hypothesis, I can start to count my money or calculate how much it will be.

If useful at all, this is very easy to do. All your ideas for improvement are much appreciated. I remain open to scrapping it if there is a fatal flaw.

Jim,

A t-test depends on a few assumptions that probably do not hold in this case. The population (the returns in this case) should either be distributed normally OR they should be i.i.d. (independent and identically distributed). If they are i.i.d. then the average (of the returns) is approximately normally distributed when you have many weeks (large N, central limit theorem), which ensures the normality assumption is satisfied anyway.

Everybody agrees that returns are not normally distributed, so you need to have many i.i.d. returns. SUpirate already pointed out the issue with independence, but the “identically” part also likely does not hold, because statistical properties of stock returns like the (true) average and standard deviation change over time (this is called “non-stationary”). So the expected average stock return in one week is not the “same” as in another week, violating the “identically” assumption.

You probably got fairly high p-values (0.14 - 0.16) because the weekly variance of returns is quite large compared to the weekly outperformance of your port. Even with i.i.d. samples you’d need quite a lot of samples to get decent p-values. Many more than 30.

To work with compounding returns you can take the natural logarithm of the returns before applying the analysis. The logarithm transforms something multiplicative in nature to something additive. You’d take log(1+r) where r is the simple return (e.g. 0.05 = 5% return). However, for small r, log(1+r) is approximately equal to r so it actually doesn’t matter. Small is generally considered less than 20% or so. For weekly returns that is no problem.

You were asking what the population actually is. When returns are stationary, that would be “all returns in the past, present, and future”. Note the “and future”. Any inference you do on the past implicitly assumes that the future will be like the past. But returns are non-stationary! That does not mean that the population changes, but it does mean that the past may not be very informative about the future.

Classical (frequentist) statistics assumes that you can repeat an experiment any number of times, for example sample an unlimited number of stocks. That conflicts with the fact that there is a limited set of stocks and a limited time frame. Maybe this is why you’re not sure what the population is. Btw, Bayesian statistics does not make this particular assumption, but it is much more complicated, needs other assumptions (the “prior”), and also does not solve your non-stationarity problem.

Personally, I would do what Tom suggested in the second post in this thread. To get more precise estimates, run many more than 100 random sims (under similar conditions as the port as has been pointed out). This makes fewer assumptions than a t-test, and the fraction of random sim runs that beat your port has exactly the same interpretation as a p-value from a t-test.

Regards,
Peter

Peter,

Edited: I re-read your post. You really know this. I’m going to research what you have written.

I use Tom’s method a lot (he had similar posts before). Guessing for now, I think it is too optimiistic as it includes no information about the variance of my port. However, I can see how it is like taking the Z-score for a single measurement. I will look into this further also. In any case, I agree it is useful: probably more than I realized.

Thank you!

Jim,

My biggest concern with how you set up the random tests is the one week random hold time vs. the longer time of your port. I think you can reduce that error by changing your sell rules for the random test. Instead of using Rank < 101, use NoDays > xx, where xx is the Avg Days Held value from your Port. You need to adjust xx to get about the same # of days as the Port. (Be sure to turn off any Rank < yy rule).

By using NoDays > xx you allow the other sell rules to have a similar affect on selling holdings like it does in your Port. This still won’t be quite the same since all the random holdings that are not sold due to the other sell rules will be sold after xx days, and none of them will be held as long as the longest held stocks in your Port.

Denny,

I like that idea! I will do that.

Thanks.

Peter,

Yes. Just as you said i.i.d. is an assumption of the central limit theorem.

Thank you so much!