So, your organization has decided to use machine learning (ML) and to invest resources that can deliver business value to your customers. Data scientists and data engineers are now solving problems that seemed next to impossible a few years back, and proofs of concept (POCs) are showing exceptional results.
And now it’s the moment of truth: Business executives are asking when you can deploy the trained models into production!
It shouldn’t be difficult, because most of the complex work is done, right? Unfortunately, it’s not as easy as it seems. According to the deeplearning.ai 2019 report, only 22% of companies have successfully deployed an ML model into production. The rest are stuck in the preproduction phase or have simply failed at the deployment and model management stage.
But don’t be discouraged. In this blog post, I discuss some of the challenges of ML lifecycle management in an enterprise environment—and what methodologies and tools can help productize ML solutions.
ML has advanced a lot, and we are privileged to have most of the resources that we need at our disposal. We have access to compute resources (on premises and in the cloud), to the necessary quantity and quality of datasets, and to state-of-the-art ML research. ML systems are also being streamlined. Data engineers together with data scientists are transforming and preparing data that’s consumed by ML models for training. And models ultimately go to production for model serving, where they’re monitored and retrained if necessary.
In the real world, only a small segment of an ML system is composed of the ML model code. The rest of the process consists of data collection, data consolidation, system configurations, model and data verification, debugging and testing, resource management and infrastructure serving, feature and metadata management, and monitoring.
Let’s say that you have a team of data scientists and data engineers who are working on dynamic pricing for airline flight bookings. The business objective is to allocate ticket price based on travel dates; seat availability; and, to increase sales, a relative competitor pricing model.
You and your team work mostly independently in your own work environments, like Jupyter Notebooks, and use the dataset that’s available for training and validating the model. Maybe team members share notebooks with each other by email or they use some code versioning (GitHub, Bitbucket, etc.). They also have regular catch-up meetings to make sure that everyone is in sync and that the project is progressing as expected.
You’re all using allocated compute and storage resources (AI infrastructure) for training by executing the cells in your notebooks. After some time, your trained model is producing good enough results on your holdout test dataset, and you believe that it will work in the production environment and will predict better pricing for airline tickets. You also have data analysis and visualization reports in your notebooks that back the results and validate your model’s performance.
Finally, it’s time to deploy your best trained model and integrate it into the existing airline ticket–booking system. But there are a few unanswered questions that you need to take care of, including concerns such as:
If you don’t consider and mitigate these challenges, you can have disastrous consequences near the end of your project. You might have to rebuild everything from scratch, or the project might fail to reach the production stage.
Successful and mature AI processes require automation of these phases or to smoothly carry out training of new models with new data or with new implementations. Automation helps abstract away the complexity and lets you focus on the actual problem at hand. Wait a second, isn’t that very similar to what DevOps practices are known for, and can’t we use a similar concept for ML? That’s right, I’m referring to machine learning operations, or MLOps for short.
But can we really use DevOps methodologies for ML and simply call it MLOps? It does seem to be an obvious option, because an ML system is a software system (Software 2.0) at its core. But it’s a different beast altogether, and it demands a new mindset for handling AI development and workflow management. The core difference between ML (Software 2.0) and a traditional software stack (Software 1.0) is that ML is not just code and configurations. Data is also an integral component of the ML lifecycle and defines the behavior of a trained model.
MLOps is a methodology and a practice for a collaborative approach, and it combines data engineering, ML, and DevOps. It aims to operationalize the process of training and tracking models at scale, deployment and maintenance of models in production, and the entire data pipeline that encompasses the ML system. MLOps also ensures the model performance and measures it against business objectives, and it enables continuous delivery of business value. The following figure shows some of the benefits of MLOps.
Here are the general practices that you can use to achieve MLOps in your ML projects:
When your organization uses good MLOps practices, you can ultimately produce better results while being cost-effective. You can set up a platform and architecture in place to make the whole process as easy as pushing code to code versioning. The rest (packaging, preprocessing, training, ML versioning, model deployment, autoscaling, etc.) is taken care of for your team.
To learn more about MLOps and how Netapp® AI makes it easier, check out our featured video.
In part 2 of this blog, I discuss a use case and how to deploy a collection of tools (GitHub, Kubeflow, Jenkins, and NetApp AI data management) to incorporate the MLOps methodology into your projects.
To learn more about NetApp AI solutions, visit www.NetApp.com/ai.
Working as an AI Solutions Architect – Data Scientist at NetApp, Muneer Ahmad Dedmari specialized in the development of Machine Learning and Deep learning solutions and AI pipeline optimization. After working on various ML/DL projects industry-wide, he decided to dedicate himself to solutions in different hybrid multi-cloud scenarios, in order to simplify the life of Data Scientists. He holds a Master’s Degree in Computer Science with specialization in AI and Computer Vision from Technical University of Munich, Germany.