A tale of woe

Suppose you work at a company. You want to use machine learning to improve marketing or operations or the product. Everyone agrees that this is a good idea because machine learning is the future. There are data scientists and engineers. The data scientists say that they know machine learning and will build you a model. They go away for a couple of weeks and then come back with a strange 100 MB file. They say that if you simply feed it these thirteen pieces of data it will make the prediction you care about. You thank the data scientists.

Next, you go to the engineers with the model in hand and relay what the data scientists told you. You task them with wiring the model into your “production systems”. This being a highly complex and intricate process, the data that is fed in is in a different form than what was intended. Now the model is outputting garbage. Oops.

Consider another scenario. The engineers figure out how to wire up the model and it’s doing great, but your data scientists come out with a model that is going to make the company twice as much money. The engineers don’t even need to change anything on their end except swap out the old model with the new one. Despite all of the excitement the model is a flop and is performing well below expectations. It is clear that the data scientists introduced a bug when updating the model. They say that they can fix the bug and in a few days give you another model that they say will work. You’re not fully convinced that the new model doesn’t contain yet another bug.

Another possibility. The model generated some bad predictions six months ago that cost the company a lot of money. It has only now come to light, and the CEO wants an explanation of what happened. The model is an inscrutable black box. The data scientists are asked to explain how the model was constructed. The data scientists have changed their model a hundred and two times since then and did not foresee that anyone would want to audit their predictions. Despite their best efforts, they are not able to reproduce the exact same model, and they cannot answer the CEO’s questions.

Or, suppose your engineers are able to execute this complex handoff regularly and flawlessly. Nevertheless they are unhappy. Instead of building awesome things that they can take pride in, the engineers are spending a lot of their time working on someone else’s project. They find themselves debugging an edge case of a complex data transformation that plugs in as some model’s thirteenth feature. They complain to you that they feel disempowered and demoralized. On the flip side, data scientists wonder why the engineers are so slow to implement their models. Resentment between the two teams grows and then festers.

With all of the different things that have gone wrong, you never feel completely confident that your machine learning model is performing as it should. Given all of the friction with the people involved, you’re very hesitant to try and modify or improve it.

The path to victory

To avoid these problems, a key thing to realize is that data munging code and model training code are production code (production in the sense that bugs in this code will lose your company a ton of money). This code needs to be held to the same standards of quality as any server code. This kind of code may not seem like production code, but that’s only because it’s typically written by data scientists, and data scientists do not tend to approach coding the same way that software engineers do.

Building quality machine learning code is not easy though. You need to have a deep understanding of the data science workflow in order to build a system that will have the right functionality that will allow a data scientist to iterate and improve the performance of their models. You also need to have strong software engineering skills to build a system that can manage all of the complexity that machine learning entails. That is to say, you’ll need to bridge the large cultural divide separating data scientists and engineers, and combine their skills in order to make this work.

Data scientists

Data scientists are good at taking a real world problem and figuring out how data can be leveraged to solve it. They can explore, transform and visualize data and use the results to draw conclusions about the state of the world and build statistical models that accurately represent it. Their toolkit includes data manipulation libraries such as pandas and plyr, and machine learning libraries such as scikit-learn. They are drawn to settings with rich multi-dimensional data sets including relational data, text, audio, and video.

On the other hand, most data scientists do not have the know-how to develop quality software systems that are maintainable, reliable, fast, scalable, well-monitored and so on. Data scientists need to understand what quality software looks like. Any model code that is driving important business processes must be rigorously reviewed and tested. All of the steps in the modeling process should be automated, and any operation that requires human intervention is risky, error-prone and should be avoided.


Software engineers pride themselves on building reliable and powerful systems that can zip through terabytes of data and recover gracefully from any failure.

However, they generally do not have a deep understanding of the data science workflow — the processes and analyses that data scientists perform to develop models, understand them, and improve their performance. Software engineers need to understand the different steps of the machine learning modeling cycle and build a system that can perform them.

Bridging the gap

In order to do machine learning well, engineers need to build a machine learning platform that data scientists can use. This platform needs to be able to:

  • assemble the data required to train and evaluate the algorithm, including features, model objectives
  • train the machine learning model using the training data assembled
  • generate predictions on the evaluation dataset using the trained model
  • analyze the predictions to determine if the model is performing well enough to deploy to production
  • deploy the model to generate production predictions

Once the system is built, they can expose an API to the data scientists that allows them to tell the system what kind of model to build and see the results. The data scientists can then use the platform to build, iterate on, and deploy models to their hearts’ content.

A few important considerations

Build a fast system. All software engineers are familiar with the sheer joy and boundless motivation of working within a tight iteration loop. Modify the code; execute the code; observe the output; repeat. The same holds true for data scientists. Slow data infrastructure really kills the party. One common source of latency is repeatedly fetching and computing the same features over and over again. You'll want to cache feature computations so that you don’t need to recompute them every time you build a model.

It is useful to store immutable snapshots of the data used to train your models. If instead, you train your models on all data available in the current day, you won't be able to tell if iterations of your model are improving because of the changes made to the model or because of changes that have accrued in the dataset. Also, if anything goes wrong with one of your models, you’ll have access to the exact dataset that it was trained on, which will help you when debugging.

There are decisions that data scientists will need to make across all of the different steps listed above. They will need to decide, among other things, which features to use in the model, which model objective to use, how the data should be filtered and split, which algorithm to use, and what its hyper-parameters should be. You’ll want to create an interface where the data scientist can specify these decisions upfront in a way that is compact, declarative, machine-readable and human-readable. These configuration files additionally serve as documentation of the models and will help data scientists reason about various versions of models and how they differ from one another.

There will be tremendous pressure to improve the accuracy metrics of your machine learning models, as those metrics will be most visible and the easiest to optimize. Invariably, someone will suggest that you incorporate the output of some other machine learning model, or an opaque third party data source into your model to try and improve its accuracy. You should be very careful about doing this. As with any software dependency, you’ll need to consider how stable the data dependencies are. If the distribution of the external machine learning model shifts or the third party data source shuts down, it will break your model. It is tempting to just optimize for predictive accuracy, but you need to manage the complexity of your system as well, and data dependencies are a huge source of complexity.

The performance metric of your model should be expressed in units of your local currency. No one is going to be able to understand the impact or significance of a ROC curve AUC metric. Everyone understands the impact and significance of money.

If I could emphasize just one point to take away from this blog post, it would be that machine learning models should not be built ad hoc on scattered Jupyter Notebooks on a data scientist’s computer and then handed off to an engineer to be deployed. Nothing is stopping you, but here be dragons.

Thank you Alexey Komissarouk, Zain Shah, Xinlu Huang, Ben Rifkind and many others for feedback on this blog post.