I Get different simulation results with exact same system / timeframe

Hi P123,

I built a new trading system about 2 weeks ago. I tested the system over and over and got the same results for about 1.5 Weeks.

Now, when I run the exact same system, with exact same timeframe, I get a big difference in results (e.g. system picks different stocks).

The Systems are:

  1. https://www.portfolio123.com/port_summary.jsp?portid=1625641 (simulated and converted today to a portfolio, copied from 2))
  2. https://www.portfolio123.com/port_summary.jsp?portid=1624786 (simulated and converted 1-2 weeks ago to a portfolio)

I checked via compare function: 100% the same.
I checked the ranking system, it has not been changed for > 2 Months.

Please help, thank you

Best Regards

Andreas

update:

if i take the original sim → https://www.portfolio123.com/port_summary.jsp?portid=1624924
and copy it and run it today with the same timeframe with exact same settings, the results differ as well → https://www.portfolio123.com/port_summary.jsp?portid=1624924

Same happens with other systems, that I have updated around July and have not touched them since then (throug difference is not as big).

I checked without PIT (PIT Method - Prelim = off), the lower performance of the 2nd (later simulated system) matches almost 100%, if I disable PIT, this might help you… maybe pit functionality is disabled???

I think these things will happen more often with FactSet. They are not a true point in time database. For example they only have one version of a restatement. If companies restates, only the most recent restatement is kept, this will alter your backtests. Other things that they do is back-fill older stuff for variety of reasons. We did investigate a few cases , and it was a good thing they are doing, but it does alter the past a bit.

It does not make backtesting useless by any means – if done properly. This is something we’ll have to live with with FactSet more so than Compustat…

wow, I am quite surprised! the differences
are pretty hefty: 5% ann.!

well then, how to backtest the right way?

Should I use the great backtest from two weeks ago or should I retire it because now its not so good?

It would be super, if p123 would
document exactly how far back
factsheet does update and how often on average.

From what I saw in the backtest I saw, they might go back pretty far to update.

Also, why does P123 not simply freezes the database Today-1 and only updates today-0?

Pit is the heart of p123 and the bigest
differenciator to zacks.com an Co.

Please help on how to backtest the right way.

Please also restate the conditions for the subscription of S&P Data, if this is not fixed for me (by
changing the algo of P123 or getting understanding on how to backtest the right way) I am interested in
the S&P Data.

Sorry, but this might be a game changer!

Thank you

Andreas

Also, Why has this not been publically stressed by P123? From what I read on the board, so far it seemed not a big deal.

  • why are we not educated proactively on how to backtest the right way with factsheet, what is best practice and what not? gee, we got money on the line here!

My understanding was that there is a difference during the transition phase, but that PIT is (not 100%, but almost) restored after the transition phase.
If in two weeks backtests swing like that, I can not see how P123 still has PIT.

Am I the only one with this problem? Where am I wong?

Best Regards

Andreas

I too would be interested in knowing the correct way to backtest.

For one “correct way to backtest,” see https://blog.portfolio123.com/break-your-strategy-how-to-stress-test-your-quantitative-models/ This is just my view, not that of everyone here. There isn’t one “correct way to backtest,” but I hope that post might help you find some more certainty in a very uncertain world. If not, well, I tried.

The problem you’re having is not unique to FactSet. Before the transition I definitely saw differences in results for the exact same simulation run on different dates. Even P.I.T. data isn’t completely static, though it tries to be. With a portfolio with only 10 holdings and low turnover, just buying a few different stocks on a few different dates will make a huge difference in returns.

As Marco pointed out, we are no longer completely P.I.T. It’s simply impossible with the way FactSet provides us with their data. We are as close to P.I.T. as we are able to be.

If you’re interested in getting a license from Compustat and willing to spend a few thousand dollars per year, just e-mail me or Marco and we’ll put you in touch with someone there.

  • The Problem is much worse to factsheet. I do thousands of simulations each year since 2012 and my differences on compustat where never bigger than 1%.

  • You either have pit or you do not have PIT, this is black or white and its a question of a value and culture of a data provider, that gives the impression of PIT Data.

  • And it would be easy to reach if you simply make “today - 1 read only” on the database. If Factsheet tries to update the past, discard the data and only write to the database today -0. Every document management system
    does that, otherwise it will not be accepted by auditors and every DMS Vendor has to get a certification that Today -1 can technically not be updated (Read only memory).

  • If you do not have PIT your back test can be worthless (if you do not know how and how often and on what line items updates are done and if you do not know which one is the more conservative one), because you back test with hindsight, that is exactly what a systematic trader needs to avoid at all cost (and every auditor of financial data would point out!). That the whole Industry is crazy (e.g. makes it wrong) does not mean it is right! And if companies change the past, they have to make it public (which usually lets the stock drop up to 50% or more!).

As long as you do not implement “today - 1 read only” (maybe in a separate database and the user can choose) PLEASE STOP giving the impression that P123 has PIT but point out that the only PIT available right now are running ports that build history over time (which is a strong feature!!!).
That would be an open and fair treatment of the subject, everything else is sucking in customers, that think you they trade on Pit, but do not.

BUT lets look at the bright site:

  • Ports that are running from week to week are PIT (if you do not change them manually of cause), so that is great feature!

  • Furthermore, my frozen sims from the past are better (sometimes over 5% ann.) then the recent sims, which means, that the recent sim is more conservative and that your realtime results will be better. So far, I spotted that strong tendency with factsheet, I had no recent sim that was better than the historical one.

This makes sense in terms on what Marco stated: If Preliminaries are not PIT and get overwritten, I guess that the original preliminary is more optimistic (that would be normal behavior of a CFO covering his or her position) and the more recent one is less optimistic, then this would be a good thing, bc. the market (that’s my assumption outside of bear markets) is trading on the (optimistic) data on hand. So for shorter time frame
(weekly rebalance) systems, that tendency could be exploited! (If my assumption is right).

  • Also with Factsheet, my sims do not decay, they get better over time. With compustat my results from 2000 - 2010 where much better than from 2010 - 2020, that is the other way around now for many sims I have. That hints out that the sim of today is more conservative WHICH would be great thing (not as good as PIT, but at least not a trap!).

Please P123 confirm if you have the same impression (a good guess is enough for me).
As long “Today -1 = Read only” is not implemented we need the following:

  • Give us extensive information about the updating process of factsheet today - 1 in general and how P123 is handling them.

  • State the exact behavior on what line items are updated “today – 1” by factsheet and how often this has happened or happens in general. Put a counter on the stock on updates today - 1 and show it in the table of the port positions. Give a possibility to drill down to the changes.

  • Consider implementing real pit by “Today - 1 = read only”, that would be the best solution and it’s not too hard to implement, just put the updates 1:n under the line item, so you got the history data protected and let the sims run on true historical data.

  • Stop giving the impression, P123 has PIT until “today - 1 = Read only”

Sorry to be that harsh and opinionated, I have been a big evangelist for P123 since 2012 and I understand it’s a family and friends business with limited resources and great price / value, but this matter goes to the heart of a systematic trader tool, PIT is the most important feature to be protected by p123!

Best Regards

Andreas

here is the difference between the sim 2 Weeks ago and the one of today. Just so that everybody knows what I am talking about.


sim two weeks ago (which got transitioned to a port)


Andreas and Yuval -

Many years ago, I wrestled with the same issues you are experiencing today, Andreas. A backtest that I created performed exceptionally well. However, when I reviewed its performance and saw the OOS results a few months later, the performance was significantly worse, without making any changes—just hitting’rerun.’ I was astonished that results could change so radically—seemingly always for the worse—without making a single change to the Sim.

At the time, I made a post similar to yours in which I sought out reasons for this confidence-shaking lack of consistency. Several members chimed in – confirming they had the same experience as me, and their confidence was equally shaken. However, now I believe that blaming Portfolio123 for the problem was misguided.

I discovered that substantial performance differences could occur in backtests if there is just a slight change near the beginning of the run. A 20-year backtest, with a (seemingly) inconsequential change of a few dollars in the entry price of an equity or ETF, could result in an enormous swing in total performance and a significant Annual Return difference of 10% to 30% by the end of each year.

Einstein called compound interest “The greatest force in the universe,” with its phenomenal results based on the amount of time under consideration. Many call compounding ‘magic’ because compound interest involves earning profits on the principle plus the profits added during the previous period (and the period before that, and the period before that, etc.). Unlike most things in life, with compounding, time is our best friend.

Of course, this principle also works in the opposite direction when we compound losses. An insignificant, share-price difference of a dollar or three at the start of a 20-year investment of $100,000 will result in an enormous decline in results.

This principle is the well-known-in-quant-circles ‘Butterfly Effect,’ which is the sensitive dependence on initial conditions. A minute change in one state of a deterministic nonlinear system can result in large differences in a later state.

For this reason, I always use a wide range of starting dates as one of many tests for robustness. A (for example) three-day difference in the start date of a backtest can easily mean a (for example) plus/minus $3 or more difference in the share price of a stock or ETF purchased. That minuscule $3 difference in share price; for example, 2,000 shares of SPY at $47 vs. 2,000 shares of SPY at $50, compounds to become a difference of $1,140,297 over 20 years at an average Annualized Return of 30% — i.e., $19,004,964 becomes $17,864,666 at the terminal date because of that $3 difference in the initial state.

Moreover, a difference in ‘Point-in-Time’ (PIT) data with FactSet could result in a price difference in the initial state that will compound over the years to become a substantial difference at the terminal date. Still, we don’t have enough information to know how to handle this potentiality. I agree with Andreas that we need to have the PIT issue with FactSet explained in detail.

I agree 100% with Andreas, hoping that P123 will provide us with the details of this PIT discrepancy—assuming there is a difference. At this point, it isn’t a deal killer for me because my OOS results continue to match in-sample testing. However, users need to know the details of this potential difference to adjust and accommodate a discrepancy between the prior PIT data and FactSet PIT data.

@ETF Optimize:

We are talking about 2 different things. I do not compare sims that have a different starting date, my starting dates are exactly the same.

PIT I:
If you have true PIT, e.g. “today - 1 = Read Only” (on the database), there is 0.0000000000% difference between a back test performed today and a back test performed 10 Years ago for the same time frame (non OOS, only the exact same timeframe), e.g. no butterfly effect.
You can still have algo changes (e.g. you found a bug and correct it), this can lead to a butterfly effect and there is no
good solution for it. But on the database, “today - 1 = Read Only” is a must (maybe besides for obvious bugs).

PIT II: Even if you simulated 10 Years ago and put saved it to a portfolio and then you take this portfolio 10 Years later and back test the OOS Time, the results should match. PIT II can only be the same if the port behaves exact the same way. But it varies on the slippage calculation of the sims. This is even harder to implement and I do not expect P123 to reach that.

My point is, if you back test, there should be as fewer moving parts as possible and almost 0 Moving parts on the data. Otherwise you shoot for something that is constantly moving and you do not know how it is moving. When I hear that the new data vendor moves those parts around by
default, e.g. the prems. get overwritten regularly, THAT IS NOT A GOOD Concept!

My short-term solution is the following and I leave it there:

  • from my experience there is not true PIT Vendor on the market, but with the current non-PIT Situation at P123, others might be as good or better, so I will research! I doubt that any vendor is as good as P123, even now, so I do not want to complain but felt that this “less PIT Situation” needed to be point out!

  • try to find out as much possible about the data saving algos of P123, so that I know what parts are moving

  • stick to my stuff that worked the last 10 Years and where I have live ports (that are as close to PIT as possible).

  • making sure that my strats make sense conceptually and the concepts I use have been confirmed by
    the science community

  • try to convince P123 to implement “Today - 1 = Read only”.

All these points mean, that I am going to be more of a discretionary trader then a system trader (kind of a mix) (or find something more true to PIT or convince P123), that seems the way
P123 goes, when I read the Posts from Marc (not Marco).

I understand that there is no true PIT, but my assumption is that P123 has been much closer to PIT with CompStat and I think there is an argument that it could be possible to get closer to the PIT we had. I think that was (still is, just less) one of the best selling proposition points of P123. P123 decides!

Best Regards

Andreas

I don’t find meaningful differences between sims run on Compustat and new FactSet data.

One of my old sims 01/02/00 - 03/18/17 with stocks from S&P500 and Compustat data shows an annualized return of 20.81%.
Re-running with FactSet data (PIT Method - PrelimExclude) the return becomes 21.52%. With PIT Method - PrelimUse the return shown is 21.40%.

Perhaps modes using S&P500 large caps are less impacted by the change of data vendor.

All,

PIT is a good thing. More PIT is always better than less PIT. Andreas certainly has a point. And no arguing with it.

But what is a backtest for? Not useful for predicting out-of-sample returns. That is for sure.

Look at the below, somewhat randomly selected, Designer Model for proof. I am pretty sure the backtest had better returns. In fact, it is obvious that every model this Designer created is not performing as well as the backtest did.

This is generally true of all backtests and not a reflection on the Designer. But I think I can make the point that backtests are NOT useful for predicting out-of-sample returns with more examples.

Then what is a backtest for? It COULD be helpful for picking a model (out of many models) along with other considerations. Useful with a lot of other methods, IMHO. Marc is an example of a person who found the usefulness of a backtests to be limited. I think he is right about that much at least.

So again, Andreas has a concern. But the only important question is: Would you have chosen a different model?

I suspect ETFOptimize also has a point (as does Andreas). That the changes to the sim are somewhat chaotic and a careful look using the buckets, larger sims, statistical tests, discretion, “domain knowledge”….etc (whatever one prefers) would not have lead to a change (or error) in which model to choose.

I suspect the level of information leakage (not PIT-ness) at P123 would rarely lead me to the wrong model. But admittedly, I have no proof of this—other than the worst example we have of this shows just a 5% difference and just for the sim.

For me, it is still an open question as to how efficient the market is and how much we are fooling ourselves with all of our backtest. I think I will not speculate here.

But if we are fooling ourselves with backtests, we only have ourselves to blame. It would not be a reflection on P123.

So, Andreas is not wrong. We should try to make things as PIT as possible. But I am not ready to spend $20,000 a year on CompuStat and CapitalIQ. Not that I would try to stop someone else from doing so.

BTW, CapitalIQ is not perfectly PIT either.

Best,

Jim


Jim, you are right, that is my
conclusion as well, would I trade another
model if I am convinced
by the construction of it.
I am, so I will trade it.

I testet today my top 30 modells.
Only one later sim was much better then
the historical one, most where less good, so my thesis, that later
simulated sims are conservative
is kind of confirmed.

Also all cap curves still looked very well
tradable.

Best Regards

Andreas

Hi all,
I am with Andreas on this one. I have noticed the same large differences as he has with Factset that I did not see to the same extent with Compustat. Since the change over I thought I was the only one experiencing this and I have combed over my sims and outputs 100’s of times to see if I was doing something wrong.
PIT has to be as close as possible to PIT or our backtesting work becomes close to meaningless.
Don’t get me wrong, P123 has helped me significantly in my pursuit of Alpha and I am grateful, but this is not small potatoes. This is at the core of what a lot of us do.
What is wrong with his suggestion of “today-1= read only”? Am I missing something? Otherwise it is hard to have confidence in any result with backtesting. This could be a deal breaker for me.

Like Brett and Andreas say more PIT is ALWAYS better.

Is it time that P123 look at the P123 option of PIT FactSet earnings estimates?

I have brought this up many times and will not bring it up again as Marco (finally) understood that this option exists and I assume he remembers.

I agree that it probably is not for P123–possibly for price considerations but we do not actually know the price of FactSet’s PIT earnings estimates. But why wouldn’t one look into this and get back to us?

Nothing wrong with looking into the option of a lag either (if I understand Brett’s post). I think I will stick with P123 no matter how this evolves.

But P123 needs to respond to posts seriously.

BTW, Quantopian lags its earnings estimates data (from FactSet) so Bret’s idea is not radical in any way.

Best,

Jim

If we were to implement “today - 1 = read only” we’d never be able to fix any bugs! And lord knows, we’re always finding bugs . . .

PIT is one thing. Correct data is another. Which is more important? If FactSet or Compustat corrects their data, should we ignore that? If a company corrects their data, should we ignore that? If we find out that we’ve been calculating IntExpTTM so that the results had a lot more zeros than they should have had, should we ignore that and only correct the data going forward?

We’re trying to be as PIT as possible. It’s always our goal, and we spend a huge amount of time trying to accomplish it, often stumbling over large roadblocks on the way. But we have to balance our concern for presenting correct data and fixing bugs with our concern for being PIT. Maybe the balance that we’ve found isn’t satisfactory to some of you. Personally, I would be happier to rerun a backtest, get different results, and know that the results were different because the data was now more accurate, than to rerun a backtest and get the same results, knowing that there was something wrong in the data the first time.

Yuval,

Some do. It is called a “SnapShot” which P123 has done before . For the purpose of keeping things PIT.

As CompuStat does. And Quantopian lags some of its data. I actually defer to P123 as to what is best on the lag. But it is not an idea that should be dismissed out-of-hand.

I get that you probably CANNOT do this with FactSet data but you should ignore changes in data used for backtests when you can.

I think P123 is probably doing the best that it can with the data. I actually have no complaints.

But the discussion should be somewhat factual. Brett and Andreas have valid points that do not need to be dismisses with every single argument one can think of for not doing anything.

Best,

Jim