Upcoming rebuilt USA data & Canada

Dear All,

This post is about non-financial statement data, like estimates, insider, institutional only.

As explained in this post for the upcoming Canadian data we have re-built all the USA data. This is what Compustat/CapitalIQ allows you to do and it’s what we did to populate point-in-time values back in July 2012 when we switched to Compustat. After the switch we built the historical weekend data by snapshotting current values.

When we rebuilt USA for the Canadian project some slight differences are showing up. You can see the kind of differences in the image below that compares re-built USA values and current snap-shotted values. Here’s what we’ve found out that could be causing this:

  • Starting in July 2012 we were snapshotting on Saturday. When we found out that quite a bit of updates were coming in “late” on Sunday morning we started snapshotting on Sunday.

-For certain seemingly startling differences like the # of analysts for the CurrY for IBM going from 28->23 or the CurFYEPSMean changing on 10/19/13 , we asked Compustat. Their answer was:

8 of the brokers estimates were excluded from consensus because they were on a different basis. These 8 “Include Workforce Rebalancing Charge” while the consensus excludes them. This is the reason, the analyst count decreased by 8 from the consensus. Estimate data , unlike financials, is “what the market knew” approach. Since the data comes in bits and pieces it is possible for current values to be updated.

In other words, taking a snapshot and re-building can generate different values. The # of analysts seems to have been a mistake, but it was only a Compustat mistake, not “what the market in general knew” (there are other data providers).

  • From the image below the # of differences is greater after july 2012, than prior years, like 10/10/2009. That’s because 2009 was built from scratch as I mentioned before, not snapshotted. There are still differences because we bumped up the point-in-time to Sunday 3AM when recreating data to approx. match what we do live.

  • In the rebuilt data there are also a few changes we learned in this past 2 years with Compustat. For example their point-in-time active & inactive security list can be missing some fields in the past resulting in a stock missing then re-appearing the next week. THere’s a simple screen backtest that will show this problem. It has to do mostly with companies with no fundamentals, ie penny stocks, but I’ve spotted this problem with a few reasonable stocks. The new USA data corrects this.

In conclusion, our analysis shows slight differences. There are pros & cons to keeping the current USA or replacing it with the new rebuilt one. Our minds are to replace them. Some of the values we’ve snap-shotted do not even exist in Compustat anymore since current values are adjusted as data comes in. We also would like to add data-points, like EPS Median, but in order to do that we need to rebuild the whole estimate numbers to apples & apples. We do not feel that any of these variations should invalidate any robust system. It will backtest different, but overall it should not be noticeable.

The BETA server has Canada & the new USA. Please let us know what you think. We’re planning to launch and replace the USA data early next week.

NOTE: BETA is not updated daily. It has data as of this pas Saturday. Do not use it for current screens nor rebalances.

Thank You


Yes it is better to have the new data points even if it means slight changes. Don’t launch Canada before providing price benchmarks / ETFs also what is the issue with having Total Return Benchmarks? All you need is S&P 500 TR, S&P TSX TR and maybe Russell 2000 TR to begin with. These three are a must have.

Thanks

Marco, thanks I’m all for accuracy improvements even if it changes the backtests.

But did I get that correctly? Is the PIT weekly snapshot taken from Sunday’s data and not from the MON morning update? That might explain some of the differences that I am seeing between live ports and sims. Also, if so then the “Server Status” page is confusing. It doesn’t say anything about a Sunday update.

Thanks.

Wow I think I just noticed Russell 1000, 2000 and 3000 are TR on the beta server! Now are almost in business, just the S&P 500 and S&P TSX and we are all set with great benchmarks! :slight_smile:

Marco, Thank you for being open and clear about these issues. My vote is to replace USA with rebuilt data. A suggestion: create a script that runs a top-to-bottom comparison and flags any data change from one build to the next. Then publish change counts to show where the concentrations are. For instance, you mention that the POT active/inactive list shows firms disappearing/reappearing, and that this is concentrated in penny stocks. When you have time, it would be helpful to see a systematic analysis of this sort that covers any change in the rebuild. This way, users can better interpret any resulting impact on backtests/simulations and help monitor any data quality issues. All the best, D

I have also noticed differences in the universes, I have a simple Universe that uses
Close(0)>1 AND AVGDailyTot(50)>200000 AND MktCap<2000 and Country(“CHN”)=False and Universe(OTC)=0 and in the normal server I get different numbers than in the beta server for the same date, and also, of course slightly different ranks in my systems. I would not have expected these very very basic variables to be affected though.
If it is part of this, it would certainly be important, as Durandus said, to have a complete understanding of the changes.

iavanti , how to reproduce? run a screen with your rules on which date? should be easy to figure out why. Thanks

One quick note of our architecture: historically some data points come from:

a) our snapshots (estimates, revisions)
b) from on-the-fly calculations from data retrieved from Compustat database (MktCap, PE)

Snapshots are necessary because it takes the whole day to query this data point-in-time (15Y * 52Wks * 10000 stocks * 100 data points = ~1Billion data points) which would make reloading the sim servers impossible. Our database is our fastest most expensive machine, but we would need sometime 10x faster to load on the fly, maybe a $200K super-computer.

There are drawbacks to each. Snapshots never change, but we could snapshot errors. If compustat fixes something the snapshots need to be re-built in order to pick up the change. Reloading point-in-time values causes simulations to change if something was fixed by them (it happens all the time).

Our take on these variations of historical data and different sim reruns results, is to view them as positive reinforcements. If a sim falls apart it’s a sign of over-fitting.

Here’s an example of how a seemingly small error could cause huge changes. A few weeks ago a stock was no longer loaded by our servers. It turned out that the company went inactive and their Sector classification was cleared out. Our server do no load a stock if it’s not classified in a sector/industry.

Now, if your 5 stock sim was picking up this stock in 2001 and made 100% on the trade, and now it’s not, the ripple effects could appear to be huge.

I would also add that it’s human nature to continue tweaking a system until the best returns are achieved. Therefore any change is almost guaranteed to show worst returns.

Hi Marco,
I think my post was certainly not clear, I meant with those restrictions, my universe in the normal server has a different size (number of stocks) than in the beta server, even if you rebuilt data from compustat, I wasnt expecting to see a different size universe as it uses very basic variables.
The date I used was 3/3/2014 in case it is relevant.

One of the stocks that appears in the beta server is Barnes Group (B),in production is missing. It’s because MktCap in production is 2,028M in BETA is 1,999M

BETA is not being updated daily, it’s using fundamentals from Saturday. (B) reported last week. I bet some of the # of shares changed on Sunday as they filings are being processed in the weekend.

BETA should not be used for current screens, just backtesting.

Thanks

Thanks!

Hi,
Thanks for this Canadian data. I just build a good screen using it. Is the data in condition that you can invest real money to screen using it; I mean is there still coming notable changes to it. What about the sentiment data: short interest, insider buying and selling.
Yours Ari

Ari, we don’t have CAN short interest . You’ll see “USA only” in those factors and CAN stocks will return N/A. Insider info yes, but I have to check how far back it goes (you can do it too with a simple screen backtest :slight_smile:

Looks like insider for CAN is from ~2004 onwards, however it’s very, very sparse until 2012. Basically useless for backtesting. We’ll make a note in the docs