A Request that People Show The Scatterplot for Regressions and Correlations.

There are a lot of assumptions that must be met before a regression or correlation can be thought to have any meaning.

For linear regressions and correlations it would be nice, at least, to think we are talking about linear data: but I do not think that is always the case.

That the residuals have a normal distribution can be important and that the variances are constant is always important for regressions. Fortunately one can often see whether these assumptions are violated in a scatterplot too—along with the assumption of linearity.

It would be nice to see scatterplots for this type of data. Then people could decide whether it is worth their time to read the rest of the post. And people who might want to add their own post to the thread could have a handle on of the validity of data before they make any comments.

Thank you in advance for adding scatterplots to any posts with linear regressions or correlations—when possible.

For those who do not wish to share the information about their ports or sims that an Excel linear regression reveals then a graph of the residuals would be just as good. Maybe better.

In the end, people can post whatever the want but, personally, I won’t be reading many posts about regressions and correlations without a scatterplot. Those with a lot of free time who cannot find something productive to do may wish to continue reading. And this can be fun or interesting: when it has any meaning at all.

I think linear regressions have a lot of assumptions (and therefore a lot of problems). In truth those assumptions are almost never met. One can argue forever about what you may be able to get away with. But with a scatterplot you can make your own personal judgement and not have to seem contrary by asking the original poster, in the thread, about the validity of his or her data.

I have enough concerns with regressions that at the moment I don’t use them for much more than slippage—which, at least, is not a time series and has some independence from day to day. And you have to do something: ANOVA, buckets or something if you are looking at (Q/V). Even there I am paying close attention to the residuals for my own data. And I will throw the data away (or use a different statistical method) if it is not linear, or the variances are not constant or………

Again, thank you in advance.

-Jim

Can you provide an annotated example of what you are asking for? I think I know but am not exactly sure.

David,

Here are the scatterplots for almost all of my larger window trades for this year as of Friday (the smaller trades just add noise to the data). The ones not included were days with unusual volumes and I do not know whether they affect the data. This is for (Q/V)^0.6 which for this data had slightly better residuals than (Q/V)^0.5. I will give you the link to my post on this if you have any interest in the power used.

This is the only active data that I have to share, as an example. Like I said, I do not use linear regressions for much else (I have played with a couple of things in R but nothing serious). I don’t want to get in a debate about trading techniques so I cut off the left side of the linear regression: email me if you have any interest in information specific to window trades. I’ll send you a different image.

You can make your own judgement about the residuals and whether I should be doing linear regressions on this data at all. I would be interested in your opinion if you have one.

But to consider this data useful you should, at a minimum, think it looks kind of linear. And that the residuals (at the right) are kind of evenly spaced—throughout—around the horizontal line.

As an aside: you can almost imagine a line of blue dots trailing diagonally (down and to the right) in a linear fashion in the residuals. These are nickel spread stocks that happened to have zero slippage for the trade (much more on other days/trades).

Edit: I added the R and p-value in the second image for those who are interested. The lower p-value is for the slope of the regression line and the upper p-value is for the Y-intercept as per usual for Excel.

And as another aside. There are 2 pretty large outliers. I have a good R and good p-value without removing them. So why would I want to remove them? And I double and triple check any outliers greater that 2% so they are real. Removing them would just mislead me about my total trading costs: they are large enough to impact this calculation. I think I will just keep all of my outliers and avoid fooling myself with tailored data—assuming I don’t get a slippage of -50% because of a flash crash or something.

Best regards,

-Jim



Every time you do a simulation or a screen, P123 gives you linear regression data: alpha (intercept) and beta (slope). Here are scatterplots of the monthly returns of P123’s ONeil and Prudent Yield Hog screens. From these you can see that the monthly alpha and beta of the ONeil screen are both higher than those of the Prudent Yield Hog screen. Is this what you’re looking for? If so, why?

My correlation studies consist of tables of over 3,000 points of data arranged in sixty to eighty rows. It’s impossible to produce a scatterplot picture of that.


ONeil screen.png


Prudent Yield Hog screen.png

Thanks Yuval.

It speaks to the quality of the data. And I have already said a lot about the potential problems with linear regressions and correlations. I doubt that anyone cares about any more opinions that I may have on this: anyone with any deep interest in this already know more that what I have posted. I do think that those who know statistics will be interested in the scatterplots. I know for sure that I am.

I appreciate the opportunity to draw any conclusions I may have from these—without imposing my opinion on you. I did not post further in your thread as I think we have a different understanding on some of this and there is no obligation for you to be interested in what I have to say.

I appreciate it a lot, in fact.

You are using this data?

Best regards,

Jim

Jim, thanks for the explanation, I understand a little better now what you were looking for. I can see why your request for a simple scatter plot provides a quick way to assess how disbursed the data is. I appreciate your comments on what to look for.

I know the basics of statistical analysis, but frankly, not to the level you and some other do. I am not sure I can add much to your discussions. I have read books and taken college level classes on probability and stat. (I just finished one. It was fun and enlightening).

I don’t worry about slippage since I trade larger cap ports that are low turnover.
I have done some rudimentary statistical analysis of my port simulations to look at returns, though. All of my ports seem to have some kind of ab normalcy either in their return’s distributions or in relation to their benchmark (excess returns). This always causes me concern but I am not sure what to do about it or think about it.

My biggest concern is, is the variance of annual historical results from my model sims such that I have little/great confidence in seeing similar returns into the near (less than 10 years) future?

I would find discussions about statistically analyzing models for future returns to be the most interesting. For instance, is the use of the Information Ratio useful and somewhat predictive (due to momentum maybe). This has been in some other threads. I found those discussions useful.

I use rolling tests now and evenid to show ‘variance’. I don’t use stddev or IQR or whatever on those tests but just running those tests and seeing the dispersion of normalized results give me some feel for how dispersed my returns can be. I also do a worse case analysis by varying my buy/sell rules.
The one type of analysis that I don’t do, which I know might be very useful, is how my models work over different economic climates. That is, how do my models performance vary in different interest rate, inflation rate, GDP growth, sector significance, etc. environments. I wish I could run variance analysis for these different regimes (time frames). Some would say that the recent past is usually a good predictor or the near future so there is no need to do that (ie, use the last 5 years to see the next 5. That makes some sense to me as well. Maybe my fears or a major regime change are unfounded).

Anyway, I enjoy reading your threads and learning from them. Keep them up.

David,

Purely with regards to statistics my advice would be to stick with this where possible. It is my belief that almost anyone can address any legitimate compliant that one might have with statistics and still use this. It is not much different than the Sharpe Ratio which is almost universal.

For example, if someone says stock market data is not normally distributed you will have a reasonable answer for this. And it is pretty easy in Excel.

This is in the weeds and generally not necessary to know. But it automatically detrends and you are normally using “differenced” data. This makes the data “stationary.” There is not assumption of linearity. The benchmark and your sim can have different variances etc.

If someone you are talking to thinks the Sharpe Ratio carries any information then they should be willing to listen to you on this. Of course, there are all sort of ideas on P123. But if you are talking to someone who thinks the Sharpe Ratio carries any information you will have a common basis of understanding.

Some of the techniques you discuss later in the thread I put in the category of avoiding overfitting. I do not claim to be able to add much in this area. Denny and others have great posts on this. It is important, I just cannot add much and I certainly have no criticisms with any of it.

Best regards,

Jim

cool. thanks

Indeed, I am. The ideas of alpha (intercept) and beta (slope) are so fundamental to my philosophy of investing that I can’t imagine ignoring them.

Scatterplot graphs can be quite misleading. For example, in those graphs I posted are 2,500 dots. But what you see when you look at them is only a few hundred of those. The bulk of them are obscured in the middle of the ellipse. The outliers have a relatively insignificant role in generating the regression line.

In my correlation studies I’m trying to process about 200,000 of these data points. And when I’m developing a strategy to maximize alpha, I’m processing even more. It’s a messy business, but linear regression offers me a way to deal with it all.

As Marc Gerstein pointed out a while back, the absolute return of a strategy is less important to evaluating that strategy than its return compared with the market or with a benchmark. Statistical analysis is the only way to make that comparison, even if it’s something as basic as average excess returns. A paired t-test gives you the information ratio; a linear regression gives you alpha and beta; and I’m sure there are other ways to study that relationship. I like alpha and beta a lot because I strongly believe that switching to a low-beta strategy during a market downturn will save my skin, and sticking to a high-alpha strategy at all times will make me money. I trust those two numbers more than anything else in this messy business.

Now it may very well be the case that I’m not suppose to do linear regression analysis on this kind of data, even though it’s been done over and over again for fifty years and everyone who studies the market refers to alpha and beta. After all, the data doesn’t look at all like a straight line to the naked eye, and the data is, as I understand the word, non-parametric. It’s possible that a different sort of regression would give us better figures for alpha and beta, and I’m all in favor of experimentation along those lines.

I have absolutely no problem with any of that.

Again, I appreciate the opportunity to look at the scatter plots.

And if you want to talk about the weather, it’s predictability and randomness every time I discuss statistical data that is fine too. None of this bothers me. I will post as long as I can learn and people like David have an interest.

It is raining here if that is where you were headed there again.

BTW, did you ever get sorted out what a p-value was. I still have to believe you already new that. What was that about–really. A distraction from whatever point I might have been making? It can’t be possible that you have this strong of an opinion and you just learned what a p-value is.

I do think it is about time for Marc to post so we can see if he agrees with either one of us on anything. I’m certainly not going to pretend that he agree with me except to the extent that he would prefer that I use no statistic at all. You sure he likes linear regressions–that may not be linear-- a lot? Maybe, I have seen people twist themselves it to all sorts of positions. But I think he is an honest strait-shooter who does not like statistics.

I am coming around to his position: not completely but moving there.

I would love to respond to anything Marc says, if it is appropriate. Otherwise, why don’t you take over the conversation from here: which I think is giving you what you want.

I know no one else cares about this. And I do understand the feeling that there is something magical in statistics. And I think there is actually. I just think I am not the first, or the best and there are limits.

Jim

No, I still don’t know what a p-value is. Wait, let me look it up. Now I know.

I have absolutely NO background in statistics. About half the stuff you’ve written about baffles me until I read up about it. I like it, and I want to study more. Thanks for pushing me in that direction.

Yuval,

Go back–and if you can find them—read some of my post advocating z-values and using those for linear regressions to get rank positions. Actually, I know you have better things to do but think you will get my point.

I do not think it was a bad idea really. And it even gave ranking systems very similar to P123’s rank systems. But in the end not quite as good.

And I tried to use percentiles in linear regressions. I will let someone else address this. I think maybe you can do this but they may have something to say about nominal variables. Maybe they will say something about dummy variables. But that is not the point. The ways I was doing it was just wrong, I think.

My attempts at ANOVA were not much better, I think. Not a bad idea but I really cannot believe how different the variances were for one of my ranking systems. There was no getting around the fact that it just was not appropriate.

But back to the Z-scores and multivariate regressions for ranking: some of the people at P123 had the patience to teach me some stuff about linear regressions. I am still learning. But more importantly they showed me some of my limitations: not that I’m not still an arrogant D**k sometimes.

But I know what it feels like the have an idea that you know is right. And like I said the idea was not bad. And in the end I expect you to develop your ideas further and continue to share it with us. Or move to something else: it is your call of course. Whatever direction you had I know will we all learn something from it. And I think I said this before: I really did not understand alpha or beta until your posts on this.

Thanks.

Jim