backtesting, data-mining, bootstrapping, and Jim O'Shaughnessy

I found this blog post by Jim O’Shaughnessy quite inspirational:

http://jimoshaughnessy.tumblr.com/post/98397869279/the-power-of-back-testing-investment-strategies

In particular, I was taken with his bootstrapping technique. Essentially, he splits his universe randomly into two universes and splits his time period randomly into two time periods. This provides four different data sets. He does this 100 times. In something else he wrote, which I can’t find right now, he takes the results of his bootstrapped tests and averages them to get an optimal strategy.

There are various ways to split your universe on P123. The easiest is to use EvenID = 0 and EvenID = 1. Another idea I came up with is this one: Trunc(100*Mod(Subindustry/2793,0.02)) = 0 or = 1. (The number 2793 is just a number I chose at random; others will work just as well.) This will give you two equal universes but split pretty randomly by subindustry rather than by ticker. Using different numbers in place of 2793 will give you more different universes. (Don’t use the Random command to split your universe because every week you’ll end up with different stocks.)

I did a test of optimizing four strategies for four different half-universes and then averaging the results, then compared that to a strategy optimized for the whole universe. I did this over several different six-year periods and then tested the subsequent two-year performance of each. The difference was quite remarkable–the averaged strategies outperformed the single-universe optimized strategy by a lot.

So I wanted to share this idea. I know we’ve talked about bootstrapping before, but I never understood how to make it work very well until I read O’Shaughnessy.

Thanks Yuval!

For those who want to do some bootstrapping in R:

  1. Download “Boot” package

  2. download your data into R from an Excel spreadsheet. : column heading lnxs. Attach the file. Or you can change the script name (change lnxs in the script to your column head name). It is named lnxs because I tend to use the log of excess returns. Excess returns can be downloaded from “statistics” in a P123 port. But you can bootstrap almost anything.

  3. Run this script (which will also give you t-test results):

simx ← function(lnxs, d) {return(mean(lnxs[d]))}
simxx=boot(lnxs,simx,R=100000)
plot(simxx)
ci = boot.ci(simxx, type=c(“bca”, “perc”, “norm”, “basic”), conf=0.95)
t.test(lnxs, conf.level=0.95)

I looked at a couple of texts and bca is the way to go: “perf”, “norm”, etc are included for completeness here (they will all be in the output and labelled). The t.test and bca tend to agree in my limited experience.

FWIW. Maybe not exactly what O’Shaughnessy is doing but you are free to modify this, of course. But then I am not entirely sure that I know every type of bootstrapping O’Shaughnessy is doing from this very short article. There are limits to what one can do and still calling it bootstrapping. So, for sure he is doing a little more, mathematically, than the brief description in this article. And again, feel free to modify.

I think this is better than any rolling backtest and I am pretty sure that anyone who can do much with Excel can do this.

Yuval, I know you like the median over the mean for many applications you have posted about. You can also bootstrap the median which may useful for skewed distributions. Or better: it can be useful for symmetrical distributions with asymmetrical outliers in the sample due to a small sample size. Median does seem to work but can give a very funny looking (not so continuous) plot—but does seem to work.

-Jim

Thank you for sharing this, Yuval.

Some good ideas here.

Just quick question, when we do EvenID = 0 and EvenID = 1, we are basically spliting the universe to test IS and OOS, however, doesnt the rolling backtest serve as an even more strict OOS test? meaning that if you start your test every week, then you have 51 OOS datas per year.

ivillalongabarreiro,

That’s sounds about right if you have holding periods of twelve months and run it forward weekly.

On the other hand, if you have weekly rebalancing/holding, then there is nothing held OOS.

//dpa

Hi Primus,

Yeah that is true, if you do weekly rebalancing, then there would not be OOS, on the other hand, if all you use is fundamental TTM data, you would not have OOS either right as these data only come out once a year.

Thanks,

delete

Thanks for the tip on this Yuval. I’ll have to take time to figure out what’s going on the way the subindustry calc is divying up the universe.

I’m curious, do you use the average monthly returns and std dev of those returns from the rolling backtests as part of analysis? I realize average monthly returns are not the same as compounded and could be deceptive, but at the same time it “seems” to have useful and helpful aggregate data.

delete

Maybe you could take advantage of the fact that the geometric mean of a lognormal variable differs from the arithmetic mean by one-half the variance of its logarithm. This of course of assumes that the increments of logarithmic returns are both independent and normally distributed. But it is an elegant modeling assumption and easy to implement.

See: Log-normal distribution - Wikipedia

Concerning what is defined as in-sample and what is out-of-sample, my understanding is that whatever P123 provides for data is in-sample.
Whatever happened before their data begins (1/1/1999) or after today is out-of-sample.

So running rolling (or really any kind of) tests using P123 data is by definition in-sample, not out-of-sample.

Is my understanding of the definitions correct?

Really appreciate the conversation here guys… I had run across the O’Shaugnessy article as well and found it interesting.

Along these lines, it would be great if P123 could improve the Optimizer functionality to allow for more efficient bootstrapping of Ranking Systems. In particular, when I run Optimizer on a Ranking System to test its performance across multiple time periods and subsections of the market I should be able to pull out the performance for each decile of the ranking system not just the slope, etc. of the deciles.

Not quite.

In sample refers to the testing period(s) you use. Out of sample refers to periods not tested.

If you run a max test on p123, then you can’t do an out-of-sample test until some time lapses, a few monbths, a year or so, which may or may not involve use of real money. This is why it can be so incredibly important to not rely solely on the test and make usre you know your strategy should work, even without testing to confirm.

To do an out of sample test on p123, you would have to carbve out a subset of the available time periods, develop your model based on that, and then test it on other periods to see if it works. In other words, you might test frm 1-1/2004 - 1/1/2014 and do whatever you need to do in order to get the model into finished form. You could then try it out from 1/1/2014 to the present and treat that as an out-of-sample test.

In a purely statistical sense, you could reverse it. Build the model based on testing from 1/1/2014 - present, and then examine 1/1/2004-1/1/2014 as the out of sample period. Fundamentally speaking, though, I would not recommend this since you would increase the risk of structural changes in the market giving you a set of out of sample results that could differ very much from, say, 5/1/2018-5/1/2022.

Not quite.

In sample refers to the testing period(s) you use. Out of sample refers to periods not tested.

If you run a max test on p123, then you can’t do an out-of-sample test until some time lapses, a few months, a year or so, which may or may not involve use of real money. This is why it can be so incredibly important to not rely solely on the test and make usre you know your strategy should work, even without testing to confirm.

To do an out of sample test on p123, you would have to carve out a subset of the available time periods, develop your model based on that, and then test it on other periods to see if it works. In other words, you might test from 1-1/2004 - 1/1/2014 and do whatever you need to do in order to get the model into finished form. You could then try it out from 1/1/2014 to the present and treat that as an out-of-sample test.

In a purely statistical sense, you could reverse it. Build the model based on testing from 1/1/2014 - present, and then examine 1/1/2004-1/1/2014 as the out of sample period. Fundamentally speaking, though, I would not recommend this since you would increase the risk of structural changes in the market giving you a set of out of sample results that could differ very much from, say, 5/1/2018-5/1/2022.

I couldn’t derive the relationship between arithmetic and geometric returns myself, so I found this paper useful On the Relationship between Arithmetic and Geometric Returns.

Walter

Walter,

Thanks.

Edit: The formula in the image is commonly used. It is the one I have used. I posted the image thinking it was pretty well accepted (before I thoroughly read the paper). As you know the author of the paper is not a big fan of this approximation and I apologize for being overly simple in my post.

Interesting!

-Jim


Thanks Walter and Jim,

Very interesting, indeed.

It is also interesting that the paper didn’t even mention Ito’s lemma (or the Stratonovich integral), whence the canonical convexity adjustment (i.e., one-half the variance) is derived. The convexity adjustment is often seen as a consequence of Jensen’s inequality since when you take the exponential of some linear function, you turn it into one which is convex up. The convexity adjustment is just then really a correction factor, which while not entirely 100% accurate, is often close enough.

Anyhoo, I know this is off topic from the OOS vs in-sample, but it seems to me that if one is that concerned about geometric means, one could more easily calculate those directly. It’s not like there’s a shortage of computing power…

Agreed, it’s a bit off topic but still useful I think.

If my memory is correct, when I first started seriously using p123, a model’s average daily return was bandied about as a quality metric. The higher the ADR, the better. That appears to have fallen out of favor. Perhaps that’s because it didn’t account for the effect of volatility on geometric average return - the metric we really care about.

Walter

LOL. After trying to put the other formulas into excel I think I understand why the simpler formula might be used. I promise I tried, but never could get A4 in the paper to calc properly to match the data in the paper. Math dummies like me will have to settle :wink: Based on the paper it looks like A4 is the best way to adjust the average return in rolling backtests for the comparison of higher volatilities seen like in small caps.

edit: I just got the calc to match! V = StdDev^2. This is useful to me. Thanks wwasilev for the link to the paper!

Walter,

Excellent point! As you know, another name for this is volatility drag. But these formulas also illustrate why there must be some “volatility harvesting” going on in our ports. Using the simplified formula:

Geometric mean = Arithmetic mean - (standard deviation^2)/2

If you assume the return of each of the stocks in a 5 stock model have about the same return (on average) then the arithmetic mean return for all the stock combined in a port is about the same as the return for the individual stock in the port (on average over a long period). But by combining stocks that are not fully correlated the standard deviation for the port is reduced compared to the standard deviation for individual stocks.

So in this equation the 5 stocks combined will have the same arithmetic mean for the returns as an individual stock would but the combined stocks are not 100% correlated and the standard deviation is reduced.

Or put simply, combining the 5 stocks will reduce the standard deviation compared to a single stock and the geometric return will increased in this equation.

I believe this fits the definition of volatility harvesting and it would be hard for it to be the case that we are not doing a little harvesting in our ports. How much volatility harvesting we are actually doing depends on how correlated our stocks are and how volatile the individual stocks are. And as explicit stated in this simplified proof, it does assume the stocks ranked 1 thru 5 have about the same returns and that the assumptions in the derivation of the equation are correct (e.g., lognormality).

-Jim