Overfitting: A visual example

ANN is an artificial neural network. She (Ann) is like I was when I first started at P123. She has one purpose: To continuously optimize and improve the in-sample data she is given.

She does this well. The top graph shows continuous improvement as she works on optimizing the data–lower is better.

For a while her out-of-sample results improve too (second image). But after a while (about epoch 30 on the x-axis) she starts to overfit. Her in-sample results continue to improve but the out-of-sample results relentlessly and continuously worsen.

This is just the best graphical example of overfitting I have encountered. The reference to Ann is incidental. Ann just keeps track of her results better than I have. This is just something that helps me understand my own overfitting a little better.

-Jim



What does the y axis represent?

Philip,

Jargon alert (DEFINITELY recommend skipping this): This is the loss or cost function. In this case, the loss function is logcosh which deals well with outliers and non-normal distributions.

Less jargon but you may have had this in class: It can be though of as no different than a linear regression which minimizes (makes as small as possible) the root mean squared error.

So, an error of the predicted returns compared to the actual returns of a stock—many error functions available.

To the point and accurate: tells you how far off the model is on the predictions of the best stocks to buy. And, I think, is therefore applicable to what we do.

-Jim

Thanks for trying to enlighten us.

As an ignoramus in neural nets, can you please help me understand, if 30 is datapoints or variables? (variables would be equiv to factors, datapoints would be performance)

LOL;-)

Epochs are directly equivalent to each rank performance test you run. This is the simple answer to your question.

So to continue the example, initially when you add factors and adjust the weights your (and Ann’s) in-sample and out-of-sample results both improve. Improve on the epochs and the rank performance tests you run.

Some time before you add data on the business school that the CEO graduated from it is only the in-sample data that is improving. You are unaware that you are harming your out-of-sample results.

But ANN can tell exactly when this harm starts happening with her holdout data. We have to wait a few years for our out-of-sample data (assuming we have kept good enough records).

-Jim

thanks Jim

I saw an example where you have scatter plots and you try to do a line of best fit. As you add more variables the line turns into a curve then a squiggly. When you overfit the contorted line eventually intersects each point. But the problem is by that point you have tweaked the best fit line to also fit the noise.

The market is apparently noisy. That doesn’t mean the noise is necessarily random but this noise can appear random. We need to first develop sound models in theory and then test those. However there is always some refinement. It’s inevitable. The point is that we actually want to ignore that noise because we can’t explain it. If it can’t be explained it is dangerous to model it. So the key is understanding what actually improves our model through causality and not just correlation. Correlation without grounding in causal theory is noise fitting. I inquired about noise injection because it is one tool to break a model. If a model is curve fit to noise it will break very easily when noise is added or subtracted in the mix. Is this an end all. Absolutely not. Models have to be grounded in sound theory of why the securities being picked are mispriced and have potential to outperform.

Jeff

As far as ANN goes I question how helpful it might be though. If you are fitting in sample data and keep optimizing and then ANN tells you are are now overfitting that is relative. How do you know you weren’t overfit to the in sample and out of sample data before ANN flagged there was a problem? You don’t. What prevents you from then trying to fit more after ANN tells you a change was bad? Perhaps I am not understanding or misunderstanding.

I actually think that out of sample testing doesn’t really work for a few reasons. The first problem is that most models even good models won’t outperform every year. If you choose a really good in sample period and really bad out of sample period you might think you have a bad model when you actually have a good model. The reverse is also true. The other problem is if your out of sample data stinks do you just forget your model? No. Usually people will try to improve to get better out of sample data. Now your out of sample data is in sample. The other problem is even if you arrive at both good in sample and out of sample data how do you still not know that your model is overfit? You don’t again.

Marc is right. It comes down to sound financial theory. Even a sound model can stop outperforming or even underperform for a period of time. What worries me most about models most is not lacking outperformance but simply falling apart when I start trading it.

Jeff

Jeff,

You are EXACTLY RIGHT about this. In fact, IMHO, you could not be more right.

That is why I defended you so strongly on you ideas about adding noise to the data.

If you are interested I will show you Ann 2.0 who uses some of your ideas. Your ideas have been studied by numerous Ph.D.s for decades now and your ideas have been expanded upon just a bit.

Anyway, Ann 2.0 does not overfit. Ann 2.0 does not overfit because you were right.

-Jim

Jim,

Yeah I’d be curious to see it.

Thanks,

Jeff

Running now.

I will attach it in the morning.

-Jim

Jeff,

There is an incredible amount of literature about your ideas on randomization. It is often called regularization. The only debate in the literature is about which technique to use and whether some of the techniques can be used together.

“Dropout” is used in Ann 2.0 and was turned off in Ann. Nothing else is different or changed. Both Ann and Ann 2.0 use batch normalization which is also serves as a regularizer.

The first image shows the fitting of the in-sample data. Notice that the ability to fit the data levels off. Ann 2.0 is not able to fit as much noise in the in-sample data as Ann was able to.

The x-axis is epochs. Each epoch is a new attempt by Ann to improve the fit of the data (analogous to each of our attempts to improve a rank performance test). The y-axis is probably best described as how much error there is in her attempt. Less error is better.

The second image clearly shows reduced overfitting of the out-of-sample data compared to Ann 1.0 above. The error goes to a minimum and does not rise again. Rising again–which was present with Ann 1.0–would be the hallmark of overfitting.

Dropout was introduce by Geoffrey Hinton in 2012 so it is pretty new. Dropout randomly turns off some of the neurons in Ann 2.0 (different neurons are randomly selected in each epoch). This is the randomization method. This is the source of the noise that is being introduced. This is the method that–somewhat paradoxically–reduces the fitting of noise in the data. This is just the tip of the iceberg as far as effective methods. Dropout is one of the most used now. L2 regularization is used everywhere including with Ridge Linear Regression and works well in artificial neural networks.

Geoffrey Hinton’s paper can be found: HERE

The point is: You were clearly right. The literature supports your idea. You idea is frequently used and works in practice. This is just one example of how well your method works.

Thank you for your input. P123 benefits from an influx of new ideas.

And as Lights recommends in her aptly named Little Machines album: “Turn up the noise.”

-Jim



I really like what you wrote here. But there may be a solution to this conundrum, if I’m not very mistaken, and it’s one I’ve been using for years.

Find a large number of different strategies that were created OUTSIDE of your own testing. These are strategies that you can adapt from a book, from a website, from a series of research papers, from other users, from random variables, whatever. They should resemble the strategies you plan to use yourself in some small measure (i.e. number of stocks held, period held, kinds of variables used, etc.), but it’s probably best to avoid strategies or elements of strategies that you’ve created and backtested yourself. They can be entirely different from one another, but an important thing is not to include any strategies that strike you as complete garbage, as doing so will bias the results. (The more garbage you include, the better the results will be, and you don’t want that.) These strategies can vary in a lot of different ways: number of stocks held, universe rules, screening rules, ranking systems, rebalancing periods and techniques, etc. Or you can keep them all quite similar to each other if you want to know how that affects the results.

Run backtests of those strategies for the longest period possible and collect the results. I think you need at least 40 or 50 strategies, but the more, the better. The backtests should be as granular as possible–i.e. rolling backtests if you’re testing long holding periods, weekly rebalancing if you’re testing short holding periods.

Now chop up your backtests into in-sample and adjoining out-of-sample periods. I like using ten-year in-sample periods with three-year out-of-sample periods immediately before and after the in-sample periods. But be flexible. Do this chopping up as much as you can, even if some of the periods overlap. For example, if you can do a twenty-year test and use eight-year in-sample and three-year out-of-sample periods, you could have 12 different overlapping in-sample periods starting each year, with some of the early and late ones only matching with one out-of-sample period but some of the middle ones having two. You might want to have a six-month or year-long buffer between in-sample and out-of-sample periods so that there isn’t any overlap.

Now you can do correlation tests, using different measures–CAGR, alpha, Sharpe ratio, whatever floats your boat. Compare the results of the 50 strategies in the in-sample period with the results in an out-of-sample period. What’s the correlation? If it’s significantly above zero, then that tells you that there is a relationship between in-sample and out-of-sample returns. If it’s close to zero or below, then that tells you that there’s no relationship at all.

You can use this information in various ways. For example, if the fifty strategies are all based on fundamentals, or all based on technicals, that will give you an indication of whether strategies based all on fundamentals or all on technicals tend to persist out-of-sample. Another example: you can compare different measurements: you might notice that out of the following five measurements–Sharpe ratio, CAGR, information ratio, alpha, and median monthly excess return–one of them tends to correlate better with out-of-sample unadorned returns than the others. A third example: you could try to find out the length of the in-sample period that correlates best with the returns of the out-of-sample period. A fourth example: you could run correlation tests keeping the number of holdings in the out-of-sample period fixed but running the in-sample periods using twice as many, three times as many, or five times as many holdings and see what’s most correlative. Perhaps a model that holds five stocks will have absolutely no out-of-sample correlation but a model that holds fifty will.

Basically, this is an exercise that allows you to see what METHOD of backtesting and what performance measure to be applied to that method is most correlative with out-of-sample performance. It sets your mind at ease and gives a value to your backtesting. You no longer have to guess as to whether your backtests will correlate with out-of-sample results–you actually have some evidence that they will or won’t.

Of course, this is tons of work and is hardly foolproof, and it’s possible that the last twenty years may produce more or less correlative results than older or future periods. And there may be flaws in this methodology that I’ve overlooked. Also, correlations are very rough measures, and you might get very different results from one set of fifty strategies as you would from another. Certainly a difference in correlation of 0.025 between two options is completely insignificant, and the threshold for significance may be a lot higher. These correlation tests won’t tell you anything about overfitting unless you use 50 similar strategies with varying degrees of overfitting to each in-sample period, which would be so much work to produce that I can’t even contemplate it. Still, I find that this exercise helps with those times in which you lose all faith in backtests because your out-of-sample strategy is (temporarily, I hope) underperforming.

IMHO everything in this entire thread can be called regularization, validation or an ensemble technique.

All good ideas I believe. Each one could be expanded upon and made a little easier or more usable in some way.

Each idea makes sense to some of us just as the ideas made sense to the people who originally developed them and later perfected them.

Few or none are entirely new.

I think this should be evident with Jeff’s idea of randomization already. Jeff has a great idea that has a large body of published literature that already supports it with a number of ways to implement the idea already available.

There are text books with regularization in their title.

For sure I am not the first nor the best with my ideas. Ann is a product of the men and women at Google through Tensorflow. Google made Tensorflow open source not too long ago. A lot in Tensorflow was first developed and published by people with Ph.D. behind their name as is the case with the Greggory Hinton paper above.

Nope. Nothing new in my case.

And at least in Jeff’s case there was never anything about his idea that was the least bit controversial.

-Jim

Jim, Are the Epochs independent variables?

Often when we do a design of experiments, we try to use what we think are independent variables knowing that there will be some interactions.

But in designing models, i often see designers use highly correlated variables. (Eg., almost all “value” measures and almost all “return” metrics)

It seems to me financial domain is highly “noisy”. The reason a lot designer models are failing is that they mistaking noise for signal. Any techniques that can draw out this will help us all.

Yuval,

Are you suggesting this exercise to establish good in sample and out of sample intervals and to gain confidence in those sample intervals, and then use those for future backtesting?

Jeff

Hi RT. BTW, I am enjoying your posts.

I do not think epochs are an independent variable for Ann here. Epochs are kind of like every time I run a rank performance test. I do not think each run of a rank performance test is independent. Not when I do it anyway. Each of my runs is related to the other runs—making only a few changes each time–when I do it. I think this is also true with Ann’s methods. Each run (or epoch) is related and not independent.

As far as reducing overfitting I want to stay general and show you 3 situations where people use the same general method, I think.

  1. Ann is not something I can use here at P123 but I think seeing a graph of her data may be helpful (it is for me). Notice I divided her data into two samples. Also notice that I changed Ann into Ann 2.0 so that she had better results taking both samples into consideration. I changed her so that she had good results with both in-sample data and with a “hold out sample.” I made it so that Ann 2.0 had less errors taking both samples into consideration.

If you asked a professor about this I know she would say two things. It does not eliminate overfitting but it limits how much overfitting you can do. She would probably get all professorial and use some jargon. She would probably say it puts a “constraint” of the amount of overfitting.

But people in machine learning tend to divide samples. They do it a lot. This is called “validation” in machine learning. I am going to stay away from naming the next 2 methods. After all the name does not matter anyway.

  1. If I read Yuval’s post correctly he divides his samples sometimes. Yuval describes what he likes to see happen with the 2 samples when he does this in his post.

  2. I also divide my samples with what I do at P123. I usually divide them into the period before I started (in 2013) and the period after I started here at P123.

On a new system that I am considering putting money into I certainly compare the period after 2013 (for the model under consideration) to what I am using and putting money into now.

Dividing the samples seems to be a tried-and-true method. Used by machine learner professors (all of them, like forever), me and I think Yuval if I read his post right. I do not think there is just one best way to do this.

I am not going to claim it works miracles for me. My value models are not setting the world on fire at the moment.

-Jim

Yes, that’s right. I do other things with this exercise as well, but that’s the first and most important thing.

thanks Guys, nice discussion!