Machine Learning: Using MLOps tools and practices into production

Machine Learning: Using MLOps tools and practices into production

13. März 2020 0 Von Horst Buchwald

Machine Learning: Using MLOps tools and practices into production

New York, 13.3.2020

Anyone who wants to reach the highest level at production level with DevOps faces a unique challenge. How good that now under the name MLOps a new space of tools and practices is created, which is analogous to DevOps, but tailored to the practices and workflows of machine learning.
Machine learning models make predictions for new data based on the data they were trained with. The way this data is managed so far is one of the main reasons why 80 percent of data science projects never go into production – a Gartner estimate.
It is critical that the data is clean, accurate, and safe to use, without causing privacy issues or inaccuracies and errors. Real data can also change constantly, so inputs and predictions must be monitored for possible shifts that could be problematic for the model. These are complex challenges.
Background: DevOps practices focus on the „build and release“ process and continuous integration. Traditional development builds are packages with executable artifacts that are compiled from source code. Non-code supporting data in these builds is usually limited to relatively small static configuration files. Essentially, the traditional DevOps is designed to build programs that consist of sets of explicitly defined rules that produce specific results in response to certain inputs.
In contrast, machine-learning models make predictions by indirectly capturing patterns from data, rather than by formulating all rules. A characteristic machine learning problem is to make new predictions based on known data, such as predicting the price of a house using known house prices and details such as the number of bedrooms, the number of square meters, and the location. Machine learning programs run a pipeline that extracts patterns from data and creates a weighted machine learning model artifact. As a result, these structures become far more complex and the entire data science workflow becomes more experimental. Therefore, an essential part of the MLOps challenge is to support multi-level machine learning model builds that involve large amounts of data and varying parameters.
To run projects safely in live environments, we need to be able to monitor problem situations and see how we can fix things when they go wrong. There are fairly standard DevOps practices for recording code builds to revert to old versions. But with MLOps, there is no standardization on how to record and return to the data used to train a version of a model.
There are also specific challenges that MLOps faces in the live environment. There are largely consistent DevOps approaches to monitoring for error codes or increasing latency. But it is another challenge to monitor bad predictions. You may not have a direct way to know if a prediction is good, and instead you may need to monitor indirect signals such as customer behavior (conversions, the rate of customers leaving the site, any feedback). It can also be difficult to know in advance how well your training data represents your live data. For example, you might have good agreement on a general level, but there could be specific types of exceptions. This risk can be mitigated by careful monitoring and careful management of the introduction of new versions.
The MLOps tool scene
The effort required to solve MLOps challenges can be reduced by using a platform and applying it to the specific case. Many organizations face the choice of using a standard machine learning platform or trying to build an internal platform themselves by assembling open source components.
Some machine learning platforms are part of a cloud vendor’s offering, such as AWS SageMaker or AzureML. Depending on the cloud strategy of the organisation, this may or may not be attractive. Other platforms are not cloud-specific and offer instead a self-installing or individually hosted solution (e.g. Databricks MLflow).