Nice ML integration on FactSet

Here’s a webinar of DataRobot on FactSet

Very slick ML integration. Looks like the “AutoPilot” mode just goes through all 87 (gasp!) models to find the best (isn’t this curve fitting??). Nice presentations of feature impact. Obviously running the data through all 87 models is excessive but I’m guessing that’s how DataRobot makes money : by enticing you to test more models.

Another good webinar is this one Machine Learning for Quant Investing with DataRobot on FactSet

A good primer is this one Five Lessons On Machine Learning-based Investment Strategies

We will be investigating our own ML integration so let us know what you think. In simple terms initially I see it as an alternative to the ranking system. If we rely on an ML cloud services, like DataRobot, the integration will be much faster. Most of what you see in the video is made by DataRobot. But there are big advantages to hosting our own ML system. For example you could use the data points directly w/o a data license since the data would never leave our servers, and no downloading would be needed. BTW, DataRobot offers $500 credit with a registration.

So FactSet seems to be embracing ML. Not sure what are the costs are. What are others doing ? Have you seen any other integrations of ML with investing?

Marco,

Thank you for the link.

Notice DataRobot uses XGBoost! “Enterprise-grade open source” in the video. One of only 6 program (and computer languages) included. Looks like Steve is onto something.

And XGBoost is one of the models that he uses in his example.

BTW, is P123 now the premier AI/ML site for retail investors? I think so, but I would not stop here. I do not know exactly where to go but I am sure there are some great ideas at P123 and in the community.

And then there is marketing. Maybe P123 could have a video of DataSet’s data with XGBoost for retail investors if DataRobot can do it for institutions. Maybe Steve and Hem could moderate it.

Jim

Marco,

I think you need to get beyond this.

It is okay if people posting on P123 do not get it and do not want to use P123’s full machine-learning potential—including the methods to reduce overfitting. P123 has other things to offer.

But done right machine-learning reduces overfitting—unlike the methods used creating most of the designer models.

The video talks about “out-of-sample validation.” And he uses a hold-out test set in the video. As the owner of the premier machine-learning site for retail investors in the world you might consider promoting P123’s ability reduce overfitting

Best,

Jim

BTW, I find what Steve has done over at Colab to be a little more user friendly in many ways. As you would expect from the premier machine-learning platform (for retail investors) in the world. Although what they have done is both incredible and pretty intimidating if we actually want to keep up.

I started to promote using XGBoost and machine-learning here at P123 years ago when I saw that FactSet was already using XGBoost. i realized that keeping it a secret at P123 was not going to give me any advantage.

The smaller user-base at P123 is not my competition. It is the large institutions with their magnitudes greater money and resources that are our real competition. The only way for any of us to compete is easier access to the data.

Easier that it is now should be a goal, I would think. The API is getting better almost daily and I hope improvements keep comming.

Jim

Wow!!!

Just focus on XGBoost for now. And your strength: fundamentals. You have some work to do.

Jim

I didn’t have an in depth look at what they are doing but while they separated In Sample from Out of Sample, they did not mention anything about data leakage. They should be separating the training set and validation set by at least the prediction period. And the same goes for the In Sample versus Out of Sample. They are not doing the latter at least. It becomes an expensive toy.

Steve an Marco,

So I want to go with this. I think they did mention leakage at one point. And we do not know what options there may be in the program for controlling the train, validate and test sets.

BTW, they did show a walk-forward validation which I think is adequate. P123 can add an “embargo” it Steve and P123 feel a need.

But I AM GOOD WITH THE IDEA that we can do better.

Marco, Steve is talking about things here that mitigate and can eliminate overfitting.

And again, agreeing with Steve, let’s beat our flawed competition! Or seek to provide the equivalent with XGBoost.

Steve can show all of us that XGBoost is not really that hard of a programming issue and show how easy it is to optimize with a grid search.

Best,

Jim

Just FYI i haven’t been able to use XGBoost to improve any of my strategies :frowning:

Steve, pretty sure I saw a demo where they handle leakage, or have ways to do it. DataRobot is very well funded and the real deal.

Jim, not sure we should aspire to “be better”. Just different , with a subset of the features for sure, and more useful and affordable for investors.

In other words our goal should be to add ML to our arsenal in a similar way that P123 is doing for mechanical, factor & rule based investing.

Marco,

I agree completely. I do think that XGBoost is a known thing and not really that complex. I do think you could duplicate what they do with with XGBoost (and add some additions methods). Especially if you focus on fundamentals. Or Steve could over at Colab. Or whatever direction is best with that. But getting XGBoost to produce good solid models with fundamental data from P123 can be done.

Jim

Marco,

This sounds interesting to me. Thank you for exploring ML options for P123.

Mark

Just to be clear, the In-Sample/Out-of-Sample example that they give didn’t handle leakage, but maybe there is an explanation of why not if I were to dig deeper. In any case, there are advantages to keeping the data on P123. That part I like. But surely there is a cost for Data Robot that has to be offloaded on to P123 subscribers. And that is probably where you will have a stumbling block.

I have aspirations that go beyond XGBoost. As I mentioned in another thread, I may want to look at creating an AI-based ranking system designer. Think of the possibility of designing RS’s based on least square error instead of the ranking buckets that we currently have. And also employ some of the anti-optimization techniques from XGBoost, just not decision-trees. In other words, I want to keep my options open for how to work with ML/AI and not get locked into a platform that does some stuff well.

Steve

Steve,

I think this is a small thing that FactSet can worry about. There was a portion of the video where the training data preceded the validation data (in time).

This is “causal” for sure. Clearly no look-ahead bias. I get that with a time-series they could do more. Exactly as you say.

I am all for you—over at Colab— fixing this to ensure there is no “data leakage” over at Colab. I understand the issue.

I know you can take care of this over at Colab. It is just a matter of where you slice the data. In fact, I think you have already addressed this.

Now if you and P123 can work on seamless downloads of data to Google Drive we can be up and running with one implementation of this by the weekend;-)

Jim

I actually have no problems using Data Robot or other software that Marco digs up. Just be wary that trendiness (Factset using ML) doesn’t imply success. A case in point, Quantopian… It was very popular with almost everybody. But they couldn’t make it work after years of trying. The fact that everybody was using it didn’t mean they were headed in the right direction. Data Robot may be very popular and very professional for the most part. But all it takes is one tiny aspect not handled correctly and you’ve got nothing.

Steve and Marco,

If we can afford DataRobot or P123 can get us access to it then I am ALL FOR IT.

But Steve, I think Marco is just saying he likes what you are doing with XGBoost over at Colab and is getting different ideas on where to go with this (Colab and/or elsewhere).

Marco can clarify but I cannot imagine DataRobot is an option for us. Actually, I hope I am 100%, 180 degrees wrong on whether we could get access to DataRobot.

But Steve, you might declare a victory on this and keep going over there at Colab if I am right. I do not think we can afford DataRobot.

Jim

DataRobot is just a cloud service for ML. They help companies use ML, has a seemingly easier, no-code, front end, and they rent out their ML instances for big data. There are many others, and many are getting unicorn valuations . DataRobot valuation is $2.7B , or around 25% of FactSet! This last fact alone is reason enough for p123 to get involved in this.

I doubt DataRobot would even speak to us to do a proper integration (JV) . We’re too small. But you can certainly signup as a user to rent their ML instances using the data from P123 and take advantage of their slick interface and pre-built model blueprints. They offer $500 credit when you signup up. I bet you can do a lot with $500 credit. We don’t generate terabytes of data. Perhaps 1 gigabyte at most. To train a model with a gigabyte of data can’t be that expensive (they have to compete with the many ML cloud services out there).

DataRobot probably can charge a premium because of their front end and their “model blueprints”. For example their blueprint “Light Gradient Boosting on ElasticNet Prediction” has about 30 settings. Things like “subsample_for_bin” “min_child_weight” “min_split_gain” “reg_lambda” “max_delta_step” and so on. You need a data scientist to understand these. But judging from their demo they have reasonable defaults set for FactSet users. So, at the very least, we can use DataRobot to test a lot of these models and default values and just focus on the ones that work well for us. My guess is that each of these model blueprints is based on open source libraries and easily reproduceable elsewhere.

Marco,

I was impressed with DataRobot.

I got some data and spent a few years doing much of that. E.g., XGBoost, Ridge Regression etc.

I am now funding a model using Factor Analysis and going to convert it to XGBoost over the next week or two.

To see several years of work running real-time on multiple processors completed in minutes at most……Still finding the words.

Anyway, I encourage you to keep looking into this and finding the best business model for P123. Not sure I can add anything to the business part of this.

Jim

Jim - I think the business model is for P123 to sell data to users and have them run the models with Data Robot. And maybe the next unicorn will be me (not likely). I could use a few extra $billion.

Steve,

We will see what can be worked out with DataRobot. I will say you do not need it to do XGBoost and adding Ridge Regression or a Random Forest will not help you much.

I like what you are doing a lot.

Jim

"But judging from their demo they have reasonable defaults set for FactSet users. So, at the very least, we can use DataRobot to test a lot of these models and default values and just focus on the ones that work well for us. "

Marco - I just want you to know that my tulip python library takes care of all of the (major) parameters for XGBoost. It is not a problem and saves a tremendous amount of evaluation time. I think that the big problem with the code that I gave out was that it was at too high a level and most people can’t appreciate it until they discover how things work at a low level first.