ML integration Update

An update on our ML features and how we plan to integrate ML on P123. There are several threads on the topic in case you want to read more. If I missed some important posts let me know.

Python code for calling 123 API - https://www.portfolio123.com/mvnforum/viewthread_thread,12580
Nice ML integration on FactSet - https://www.portfolio123.com/mvnforum/viewthread_thread,12693
Machine Learning for Factor Investing - https://www.portfolio123.com/mvnforum/viewthread_thread,12733

We are currently deciding the best way to have something user friendly enough while maintaining flexibility for power users, that won’t require huge effort from us, and won’t cost the end user too much to use.

Cost

DataRobot was helpful to show our data scientist an example with real data and to get a sense of the costs . He uploaded the sp500 stocks with about 60 features for 10y and he burned $500 credit pretty fast. So back of the envelope estimate to train a model for russell 3k you would spend maybe 18x that (6x for 3K stocks and 3x for different feature combos). So that’s ~$10K which seems expensive since we only used CPU intensive algorithms as far as I know. No GPUs or TPUs. We’re going to try different clouds to compare. DataRobot is likely the most expensive. And, if we only require CPUs for these algos, then doing it in our own servers will be the best way to make it affordable. We would just use a cloud for peak usages.

How it will be cohesive and easy to use …

There are too many ways to screw up if you are downloading from P123 , uploading to train, then downloading predictions to use . So we need this to be seamless. The bare minimum simple integration involves these components

  1. A front end to create a feature set and target with some simple tools to examine data and transform it
  2. A front end to kick off the learning: universe selection, periods, models, # of “cores” used
  3. A front end to examine results
  4. A way to use the trained model in P123 systems which will be just another function

Notice nowhere in the use case above is the API mentioned because it’s all behind the scenes. The learning part will be (initially) in a cloud service like AWS in a P123 account. We would simple charge the user for the cloud cost + some profit factor. The great thing about the integration is that you will be able to use the actual values if you want , not the ranked ones, since there’s no data license issue (data is never downloaded)

For advanced users that want to use the API we will add a way to import the results back into P123 so they can runs backtests with predictions or run a live model.

That’s the current direction. Let us know your thoughts.

Wow!

I found this interesting and pretty cool:

Jim

Thank you for providing this cost estimate. Personally, I think it is on the light side based on my own personal experiences of developing systems. Generally, most factors will strike out and I have to go through hundreds of iterations before I’m happy.

When you implement this, please make sure you have a solution that generates indicators, not just buy/sell trade signals. For example, I have a couple of use cases, one is for enhanced analysts’ estimates. (I have mentioned this previously.) Another that I am extremely excited about right now is to predict whether an income stock will cut or increase dividends in the coming year.

Down the road, you might also want to consider introducing properties for indicators, so that users can distinguish between in-sample and out-of-sample for a given indicator.

So just adding to what Steve is saying, I think. People like Steve will hopefully have a lot of access under-the-hood to Python and be able to munge the data. Creating his own validation samples and out-of-sample test sets for example.

Steve really understands validation and overfitting. The more access he has the less problems he will have with overfitting.

But P123 will want to create some default methods to perform validation and generally prevent overfitting by newbies. I don’t actually think that would be too hard to implement. There are plenty of templates for this over at Scikit-Learn. Even without the templates it is really a little slicing followed by some for-loops, df.append() and then df.mean() with Pandas. P123 could probably improve on what Scikit-Learn does (as a default for Newbies).

And actually if you could work with Steve and others to get a conservative default that they could use most of the time that would be incredible! And possible, I think.

As you know Pandas was created by Wes McKinney with funding by AQR Capital Management for this very purpose. I definitely think P123 has the skills and the tools to do this.

Jim

Hi Marco – I’ve been following this topic with great interest over the past several months and it seems that you guys are getting close to something that can be really cool.
From my perspective I am interested in this comment:

Here is my use case. Let’s say I have a model pipeline where I download data via API and built a DNN model in TensorFlow. For this model to be usable for me I would need to be able to upload it into p123 and then have p123 functionality to create a ranking system that uses my TS model to score my universe. Furthermore, I should be able to use it in Simulations/Screens/Live portfolio etc. Then my models would be (could be) fully ML driven. This is the same as what you have today except that the model weights would not be visible to the uses b/c they are non-sensical.

For this to happen on your end you would need a TensorFlow installed and implemented (or some other ML library that you decide to support) to read models and use them for scoring. This should not be difficult to implement. I’ve done something similar with restful API to return a scored observation to a client application. One thing I spent time debugging is to ensure that my scoring library is of the same version as the model building library.

I hope this makes sense. Would love to see this functionality. The only thing missing in P123 right now is the ML scoring module for Ranking Systems.

enisbe, having our own ML libraries in our system is still being evaluated and will take a bit. There’s lots of “wiring” to do and NN require specialized hardware which we do not have right now.

But you can use your results now anywhere in P123 by importing the scores/predictions from your model. Did you try it ? The import is under Research->Imported Stock Factors. The factors you import will be visible in the reference in the MY FACTORS folder.

You can import historical values for backtesting as well as current factors for live portfolios. The big pain right now is that you have to import using CSV files with three columns date ,value , and one of these: id/ticker/gvkey/cusip/cik. We should have the import factor API soon so that ML results can be much more seamless with very little effort once programmed.

Let us know, thanks

PS. We recommend using ‘id’ (our own unique integer identifier for a stock issue). Using tickers is ok as long as they were freshly downloaded.

Marco,

I assume you have definitive information on this as that is your are of expertise.

But just in case you are listening to some hype from people who have GPUs and are hyping it (or trying to sell them to you) you might think about where you got that information and quantitate how much better GPUs really are. P123s data seems like a lot to us but I do not think it is compared to say Google. I think CPUs will probably be fine for P123.

There are multiple lines of information that would lead me to think that GPUs have a lot of hype behind them—assuming you are referring to GPUs as the limiting problem.

Again assuming you are talking about CPUs vs GPUs you might look at this: Intel Deep Learning Boost. But there is even hype in this from Intel. Nearly every modern processor has added instruction sets to impove the performance with neural nets.

I took Coursera course on neural nets where a very famous Google engineer (working on speech recognitions and self-driving cars) talked about the hype of GPUs. Perhaps that is my most definitive source. I would get his exact quote if it mattered.

He does seem to be right in my experience. I have run a LOT of neural nets. Deep nets with lots of layers on by OLD MacBook pro. And to put it quite simply it is FAST. As fast as boosting.

There is a lot of additional specific and general information on the internet to support this as not being just my perception (or the engineer at Google). For example, MacBooks has long been the preferred hardware at Kaggle competitions (and they always at least try a neural-net model now).

Especially with the potential use of AWS for peak demand, I think you might look at just how good CPUs really are with regard to neural nets. I think you will find they are quite adequate once you get past the hype.

According to the Google engineer on Coursera the main advantage with GPUs (as you probably know) are faster with matrix multiplication (with numpy). Of course, all of his code used numpy matrix multiplication (and few loops) for what is called backpropagation. But matrix multiplication works with any processor. And newer CPUs have instruction sets to help speed this up.

FWIW as you continue to look at what is practical as a business model. I can use my MacBook for my purposes and I don’t not have a personal need for everyone at P123 to be using neural nets right away.

I will say that one needs a lot of control to build a good neural-net (number of layers?, use dropout?, other type of regulation?, Batch Normalization?, standardization or normalization etc). Boosting, ridge regression, random forests etc will be be lot better for newbies.

Jim

Jim, I’m a newbie in ML, I rely on others since I have very little first hand experience. We conducted a study last year with a data scientist with a relatively small data set and the NN training was taking days . I think he told me it would have taken a week+ on our hardware that was not that bad. We do have newer machines so might be a different story now. But the learning has to come down by orders of magnitude, so I don’t know. We’ll see. I will show him your post.

You can borrow my MacBook if you want.

I think they did not have TensorFlow then. All these programs, now, do a lot to speed things up in may different ways. Parallel processing just being the most obvious. That is actually, XGBoost’s main claim to fame but regularization and other things too. it is just a great program. KD-trees, sorting algorithms and a bunch of other things can be important for different machine learning programs.

I do not know what he was doing but I will compare a useful boosting model you (or Steve Auger) develop against a neural-net model when you want. I will bet a year membership at P123 it will take about the same amount of time.

Ask Steve Auger. He too has run both. He likes boosting for a lot of practicals reasons but speed was not one of them.

For practical member models I think what you describe is TOTALLY OFF. I can not explain why but something is off.

In fact, because I generally use a slow learning rate with boosting (e.g., eta =0.001). I found neural-nets to be faster when running the same model. Marveling at how fast both of them are, however.

I have had kernel regressions run for a weekend before I shut them off (and support vector machines do not run on my MAC either). So I know the problem. But I just have not seen it with neural nets—even on my Mac.

At worst you would just have to ask people to consolidate some of their factors into nodes for a while until you eventually got the processing power you need through AWS or your own processors.

Jim

Marco, do you have a reason to prefer AWS over Azure? I had a 2-week course a few months ago on Azure ML studio, it seemed ahead at modeling pipelines, fast prototyping and managing the full app life cycle. They also have an auto-ML feature with a GUI taking a set of algos with ranges of parameters as inputs, to compare them on your datasets when you don’t have a clear idea where to start. I don’t know about cost/performance compared with AWS, but it seems cost is easier to control on Azure. I have read a few bad stories from AWS users (individuals and small businesses) who received unexpected large bills because there was no way to set a hard limit for spending in AWS (only alerts). Most of the time they had a way to negotiate the bill, but it was time- and stress-consuming.

Fred,

I wonder if you might expand on your knowledge about Azure. I went to sign up. Looks like a free account is possible but they want a credit card which I do not think I will do today.

Are AWS and Azure different products? Does Azure use Python? Does Azure have pre-packaged solutions: like drop-down menus? Or is it pretty-much like Colab? Colab for me was just like Jupyter notebooks but with different ways to upload files.

Any comparisons to Python (e.g., Jupyter Notebook or Colab) and your experience with Azure (and/or AWS) would be informative for many I think.

Best,

Jim

Jim,
Azure ML Studio is a visual layer above the code. It looks like a professional integrated development environment (IDE), where you put and link boxes to model data sources, algos, data pre-processing. A pipeline is modeled as a chart where you drag-and-drop components from a big library, copy-paste them, enter parameters in contextual windows depending on the type of components, build, train and deploy a model without writing a line of code (It is possible to write python code too). MSFT has 3 decades of experience in IDEs and software life cycle management, they have used it to make a tool to develop and deploy apps fast in a quite seamless way. The drawback like with all IDEs, if you start a project in ML Studio, it will not be seamless to port it elsewhere (executable trained models can work outside the platform, but maintenance and iterative development would be complicated without it).

Several years ago when Azure was in its early stages, there were complaints that credit cards were billed automatically once the free trial was over without informing the user. The free trial was consumption-based, not time based, so you don’t really know when the free trial is up. I think that the problem was was rectified but the billing concerns are real and you want to make sure that you are able to set limits on what can spend. I can see having an algorithm that unwittingly burns a lot of CPU time by accident or your account getting hacked. Just a thought.

Jim - speed is one of the reasons. Right now I am running XGBoost against the entire universe of dividend paying stocks, monthly history back to 2003, 11 inputs. I can do about 250 complete training runs in ~20 minutes. It would probably take at least 1 day for 1 training run using Tensorflow.

Thank you Fred,

I am for whatever Marco and users like you and Steve think will work. I am happy to discuss my experience with Python for informational purposes in helping the decision process.

Taking neural nets as an example, seems like Azure might be an easier solution. Not necessarily better if P123 has roadmap for that.

Thank you for the information.

Jim

Thank for the information.

My experience has been different. I wonder why? As you know, I used a lot of layers in my neural-net when I shared my code. You commented on that.

My universe was pretty big: @1,400 stocks per week rebalanced weekly (somewhere over 1,000,000 rows for the neural-net array)

Also true as you know, I do not tend to use a lot of factors/nodes (6 or 7). So not a lot of columns in my array (DataFrame). I considered that as a possible limiting factor when I mentioned that perhaps people might need to consolidate factors into nodes.

As you know I use early stopping.

Should you want to try TensorFlow again in the future: Did you start normalizing or standardizing your data? Batch Size?

Anyway, interesting and great information for P123 to consider.

Best,

Jim

As I wrote above, in 2020 I have read about more billing issues with AWS because the expense limit is not a hard limit but only triggers custom alerts. If I remember correctly, the worst (and non verified) horror story was a guy who had put AWS login infos in a private Github team directory, which was hacked by a rogue bitcoin miner. He probably received alerts while sleeping and found a 5-digit bill in the morning. Azure has hard limits. Maybe AWS has implemented some since last year.

I have no cloud preference since we have not used them much. Whichever is better suited for our use case

We want to kick off learning & predictions from P123 . And suck data back in. Also would like to be able to get an estimate before kicking off a learning process. This will be very useful once we use our own company account to run user’s workloads (and pass on the cost). And doing it all under a single company account should benefit from discounts

I recently used AWS and they update the number of credits you use sometime the next day. So if I exceeded my available credits I wouldn’t know it until the next day. I mildly exceeded my credits and that was permitted.

I haven’t tried importing these yet. What I had in mind does not actually require anything special. I can provide a trained model saved in a persistent state which I would upload. All that is needed is that p123 “hooks” the model and scores my universe with a ranking system. Technically this is just replacing weights from what you have in your current ranking systems with the model weights. p123 wouldn’t do any training of the models. Only scoring. The hook I am referring to is just tensorflow/scikit package that can read my model.

It might be too big to chew at this time but we’ll get there.