Enabling lane detection at scale with NetApp, Run:AI, and Microsoft Azure
Today's automotive leaders are investing heavily in data-driven software applications to advance the most important innovations in autonomous and connected vehicles, mobility, and manufacturing. These new applications require an orchestration solution and a shared file system for their massive datasets to run distributed training of deep learning models on GPUs. The fascinating process for training AI models in the automotive industry involves many, many images used in a 3D matrix that’s formed from 2D color images. These images are analyzed at the pixel and color (RGB) level to detect various objects, such as pedestrians, other cars, and traffic lights.
GPUs need to be maintained at high utilization to reduce training times, permit fast experimentation, and minimize the cost of usage. In addition, a high-performance, easy-to-use file system that prevents GPUs from waiting for data—“GPU starvation”—is imperative in accelerating model training in the cloud and optimizing cost.
Run:AI, Microsoft, and NetApp have teamed together to address a lane-detection use case by building a distributed training deep learning solution at scale that runs in the Azure cloud. This solution enables data scientists to fully embrace the Azure cloud scaling capabilities and cost benefits for automotive use cases.
Here are the tools we used, and how we used them:
By working with Run:AI, Azure, and NetApp technology, we enabled distributed computations in the cloud, creating a high-performing distributed training system. The system worked with tens of GPUs that communicated simultaneously in a meshlike architecture. And—to optimize cost—we were able to keep them fully occupied at about 95% to 100% utilization.
We were able to saturate GPU utilization and keep the GPU cycles as short as possible. (This is one of the highest-cost components in the architecture.) Azure NetApp Files provides various performance tiers that guarantee sustained throughput at submillisecond latency. We started our distributed training job on a small GPU cluster. Later, we added GPUs to the cluster on demand without interrupting the training—by using the dynamic service level change capabilities of Run:AI software to provide optimal GPU utilization.
Different data science and data engineering teams were able to use the same dataset for different projects. One team was able to work on lane detection, while another team worked on a different object detection task using the same dataset. Researchers and engineers were able to allocate volumes on demand.
We had full visibility of the AI Infrastructure. Using Run:AI’s platform, we had full visibility of the AI infrastructure including all pooled GPUs, at the job, project, cluster, and node levels.
In this use case, lane detection for autonomous vehicles, we were able to use NetApp, Run:AI and Azure to create a single, unified experience for accelerating model training on the cloud, thus reducing costs while improving training times and simplifying processes for data scientists and engineers. Details are available in this technical report and apply to model training across industries and verticals.
Verron has been in IT over 25 years. His current role started when he joined the ANF team as Cloud Solution Architect per Aug 1st. 2020. Verron has a focus with the CSA team on AI/ML and continously looks for new ways of doing things and collaborating with other BU’s to create joint solutions in the Azure Cloud. He is based out of Amsterdam and covering the EMEA with a focus on Benelux, Austria and Switzerland. In his previous role within NetApp, where for the past 2 years he has helped grow the Hybrid Cloud Services BU in the Benelux. Before that Verron has worked at various companies like EMC, VMware, NetApp (yes repeat offender) and Tintri in technical/sales roles.
Working as an AI Solutions Architect – Data Scientist at NetApp, Muneer Ahmad Dedmari specialized in the development of Machine Learning and Deep learning solutions and AI pipeline optimization. After working on various ML/DL projects industry-wide, he decided to dedicate himself to solutions in different hybrid multi-cloud scenarios, in order to simplify the life of Data Scientists. He holds a Master’s Degree in Computer Science with specialization in AI and Computer Vision from Technical University of Munich, Germany.