Deep learning with Apache Spark and NetApp AI—Horovod distributed training

Welcome!

An account will enable you to access:
- NetApp support's essential features
- NetApp communities
- NetApp training
- Sign in to my account
- Don't have an account?
  Create an account
NetApp account
- NetApp dashboard
- Sign out
Language
- English
- Deutsch
- Español
- Français
- Italiano
- Português
- 日本語
- 한국어
- 简体中文
- 繁體中文
See your global contacts
Learn
Browse

Contents

Share this page

Rick Huang

February 06, 2023

237 views

In Hybrid cloud solutions with Apache Spark and NetApp AI, I explained how NetApp^® solutions for hybrid cloud can provide data protection and consistent, reliable, cost-efficient performance across your distributed infrastructure for AI workloads.

One common hybrid cloud deep learning (DL) application is natural language processing (NLP), and in two subsequent blog posts, I provided financial sentiment analysis and testing results to validate NetApp solutions with Spark NLP on-premises. That testing achieved sentence-level sentiment for the Nasdaq top 10 companies’ quarterly earnings call-in use case.

In this blog post, I discuss another common Spark use case: distributed DL with Horovod. Horovod improves speed, scale, and resource utilization. With the NetApp storage portfolio, your workflows move data seamlessly for model training, validation, and inferencing in Spark clusters on premises and/or to and from the cloud.

Horovod delivers simplicity, flexibility, and speed at scale

Apache Spark is a programming framework for writing Hadoop applications that work directly with the Hadoop Distributed File System (HDFS) and other file systems, such as NFS and object storage. It’s a fast analytics engine with machine learning (ML) libraries that are designed for large-scale distributed data processing, and it functions seamlessly with the NetApp AI and modern data analytics portfolio. Its in-memory operations are more efficient than MapReduce for data pipelines, streaming, interactive analysis, and ML/DL algorithms. Apache Spark also mitigates the I/O operational challenges that you might experience with Hadoop.

With Horovod, you can choose from popular DL frameworks like TensorFlow, Keras, PyTorch, and Apache MXNet. After a training script has been written for scale by using the Horovod API, it can run on a single GPU, multiple GPUs, or even multiple hosts without any further code changes. In addition to being easy to use and to scale, Horovod is fast, achieving 90% scaling efficiency for both Inception V3 and ResNet-101 and reaching 68% scaling efficiency for VGG-16 benchmarks.

Is NetApp storage right for your Horovod workloads?

Before you decide to deploy Apache Spark Horovod workloads with NetApp storage to overcome your large-scale distributed data processing and DL challenges, you might need to answer questions such as:

Why should I use NetApp for Apache Spark Horovod workloads?
What are the benefits of using NetApp with Apache Spark and Horovod?
How can the NetApp DataOps Toolkit facilitate traditional or containerized workspace management, data manipulation, ML/DL data, code and model versioning, and inference server orchestration?
What NetApp storage controllers should I use for my in-memory engine, and what controllers should I use for my data lake?

The challenges for Apache Spark Horovod workloads and how NetApp solutions address some of them

In Deep Learning with Apache Spark and NetApp AI – Financial Sentiment Analysis, I addressed your DL challenges based on our findings from many proof-of-concept (POC) studies with large-scale customers in the financial services and automotive industries. For Horovod distributed training, the specific considerations include:

Unpredictable performance. Traditional Hadoop deployments typically use commodity hardware. To improve performance, you must tune the network, operating system, Hadoop cluster, and ecosystem components such as Spark, TensorFlow, Keras, and Horovod. Even if you tune each layer, it can be difficult to achieve the overall desired performance levels because Hadoop is running on commodity hardware that’s not designed for high performance.
Scaling of compute and storage. Because your compute servers are busy running DL workloads and serving data, you can’t scale your servers and storage independently. Going forward, it’s not feasible to continue adding servers and storage to keep up with your increasing data quantity, analytics, and large-scale distributed model training demands.
Sharable data and GPUs. Your data is locked up in local HDFS clusters. But you want to share it between multiple clusters and applications in a hybrid cloud and be future-ready for the use of GPUs to accelerate DL model training. You want to use the latest GPUs and virtual GPUs (vGPUs) as defined by your system administrator and as provisioned to your cluster of virtual machines.
NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software that’s optimized and certified for the industry’s leading virtualization platform, VMware vSphere, and for VMware Cloud Foundation. NetApp supports NVIDIA AI Enterprise on VMware, and NetApp is the first enterprise storage partner to validate storage solutions with NVIDIA AI Enterprise for VMware environments. The new platform is offered in addition to the proven NetApp ONTAP® AI infrastructure that’s built on NVIDIA DGX BasePOD for bare-metal deployments. With this new platform from NetApp and NVIDIA, you get best-in-class data storage and data management with NetApp all-flash cloud-integrated arrays and AI software.
Lack of support for more than one language. To run jobs, often you need support for multiple languages in addition to MapReduce Java programs. Options like SQL, Python, Scala, and scripts give you more flexibility for getting answers, and they provide data science workflow development. They also give you more options to organize and retrieve data and faster ways to deploy DL models into production.
NetApp works with independent software vendors (ISVs) like Domino Data Lab and Run:ai and open-source frameworks like Kubeflow and MLflow, which can be used to orchestrate your Horovod workflows.
Complicated frameworks and tools. Enterprise AI teams face multiple challenges. Even with expert data science knowledge, the tools and frameworks for different deployment ecosystems and applications might not translate simply from one to another. With NetApp AI solutions, you get a data science platform that integrates seamlessly with corresponding big data platforms built on Spark and that delivers ease of data movement, reusable models, and code out of the box. You also get a set of tools that supports best practices for prototyping, validating, versioning, sharing, reusing, retraining, and quickly deploying models to production.

To confirm that NetApp technologies can resolve these potential issues, we defined the deliverables for solutions that use an Apache Spark Horovod workload with NetApp storage. We used Horovod on Spark, TensorFlow, and Keras to validate Spark performance with NetApp on-premises, cloud-native, and hybrid cloud solutions. Those solutions used NetApp AFF controllers, Azure NetApp Files, and NetApp StorageGRID^® object storage.

Performance testing: Apache Spark Horovod workload with NetApp storage

Training workflow setup

The Horovod on Spark package provides a convenient wrapper around Horovod that makes running distributed training workloads in Spark clusters simple. This package enables a tight model design loop in which data processing, model training, and model evaluation are all performed in Spark, where training and inferencing data resides. You can use two APIs to run Horovod on Spark: a high-level Estimator API and a lower-level Run API. Although both use the same underlying mechanism to launch Horovod on Spark executors, the Estimator API abstracts the data processing, model training loop, model checkpointing, metrics collection, and distributed training. In our validation testing, we used Horovod Spark Estimators, TensorFlow, and Keras for an end-to-end data preparation and distributed training workflow based on the Kaggle Rossmann Store Sales competition.

We also used the script keras_spark_horovod_rossmann_estimator.py, which you can find in the Python scripts for each major use case in NetApp techncial report TR-4570. The script contains three parts. The first part performs various data preprocessing over an initial set of CSV files that are provided by Kaggle and that were gathered by the community. The input data is separated into a training set, with a Validation subset, and a testing dataset. The second part defines a Keras deep neural network (DNN) model with a logarithmic sigmoid activation function and an Adam optimizer, and it performs distributed training of the model by using Horovod on Spark. The third part performs prediction on the testing dataset by using the best model that minimizes the validation set’s overall mean absolute error, and it creates an output CSV file.

Run-time results: NFS versus HDFS and local storage

The following figure shows various run-time results, comparing the use of a NetApp AFF high-availability (HA) pair, HDFS, and local storage. NFS direct access with AFF HA pair clearly outperforms the other two protocols by up to 7.55 times in terms of run-time speedup, for several reasons:

NFS allows direct access to files that are stored on remote servers, rather than having to go through a centralized NameNode or DataNode as in HDFS. That direct access eliminates the need for data replication and reduces network overhead, leading to faster data access times.
NFS uses standard file access protocols, such as TCP/IP, which are optimized for high-performance data access. HDFS, on the other hand, uses its own custom protocol, which can introduce more latency.
NFS allows the use of multiple client nodes to access the same data simultaneously, whereas HDFS requires data to be accessed through a single NameNode. Simultaneous access enhances parallelism and improves performance when working with large datasets.

In terms of workflow run time, the speedup can vary depending on the specific use case and dataset. In general, however, NFS direct access can be several times faster than the use of HDFS and local storage. In this case, NFS direct access is 7.55 times faster than local storage and 1.43 times faster than HDFS.

It’s important to note that the NFS direct access protocol is designed for shared data access, and it’s also built to handle big data processing tasks, which HDFS is designed for. The goal of the NFS direct access is to give you a fast way to access remote data that is more reliable, more consistent, and less complex than the use of HDFS. You also get performance and throughput benefits in both read and write operations. On top of that, you get battle-tested, industry-proven NetApp technologies like Snapshot^™, SnapMirror^®, FlexCache^®, and FlexClone^® capabilities.

NetApp BlueXP for unified control

Our testing solution also included the NetApp BlueXP^™ unified control plane. BlueXP brings industry-leading NetApp storage and data services to your hybrid multicloud experience. BlueXP enables you to simplify and to automate the deployment and management of NetApp storage solutions. It also includes NetApp Astra™ product family, which provides application data protection, mobility, and storage for cloud-native, containerized applications.

With BlueXP, you get a single point of control to manage your data storage across multiple remote and branch office locations. It also automates the process of tiering data based on your usage and performance requirements. Your IT administrators can manage your organization’s storage resources across the entire enterprise from a single web-based interface.

BlueXP is a software-defined solution that you can run on premises or in the cloud. It can be integrated with other NetApp products such as OnCommand® Insight and SnapCenter® technology, and it can be an integral part of your NetApp powered data fabric.

NetApp Astra for cloud-native applications

The NetApp Astra family enables your teams to focus on delivering cloud-native Apache Spark applications with full confidence in their data infrastructure. Astra helps you prevent application downtime and data loss, reduce time-consuming operations, and scale to fit your needs. Astra includes:

Astra Control, which gives you application-aware data protection and mobility for Kubernetes, and it’s available both as a fully managed service and self-managed software. It’s also easy to manage your on-premises, data-hungry Kubernetes applications. Astra Control works seamlessly in your on-premises Kubernetes cluster. It provides containerized workflow and data/model backup, disaster recovery, and synchronization with the cloud through the BlueXP copy and sync service. This service extends and enhances NetApp Cloud Sync technology, so you can move your data from any source to any target.
If you’re a NetApp ONTAP customer, you can also deploy Astra Control as on-premises software. This version of Astra Control, called Astra Control Center, extends the rich enterprise-class data management capabilities of ONTAP to Kubernetes applications.
Astra Trident, which provides data connectivity to persistent datastores for Kubernetes applications.
Storage services, including NetApp Cloud Volumes Service for Google Cloud, Google Persistent Disk, Azure NetApp Files, Azure Disk Storage, NetApp Cloud Volumes ONTAP, and Amazon FSx for NetApp ONTAP. Protocols are flexible depending on your data lake and your prototyping and production environment specifications.

Data lake and access

Our testing storage solution used a shared data lake that serves large data compute server farms simultaneously. We based this solution on ONTAP, which provides easy management and superior performance, with NFS direct access for fast and secure access to NFS data and with HDFS for access to distributed low-cost storage.

Bonus sentiment analysis

In addition to Horovod distributed TensorFlow and Keras training, we also executed variations of financial sentiment analysis workloads on the Nasdaq top 10 companies’ quarterly earnings call transcripts. We compared run-time performance on different file systems, and we performed click-through rate (CTR) prediction by using DeepFM (deep factorization-machine) models. We generated sentiment analysis result comparison tables by using PySpark Python scripts, and we recorded the run times by storing the data and models in different underlying storage file systems.

More details

For the complete portfolio of NetApp Apache Spark and Hadoop storage positioning, software versions, and hardware used, see NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results.

The benefits of Apache Spark Horovod workloads with NetApp storage

With an Apache Spark distributed Horovod workload and NetApp storage, we built a successful solution with the following benefits:

Analysis of the data from its current location, which prevents the time- and performance-consuming task of moving training data to a Hadoop infrastructure such as HDFS
Reduction in the number of replicas from three to one
Ability for users to decouple compute and storage to scale them independently
Reduced backup time by using the dynamic multithread capability
Enterprise data protection with the rich data management capabilities of NetApp ONTAP and the DataOps Toolkit
More than 7 times faster Spark workflow run time on premises by using distributed Horovod training and model/data parallelism in Keras and TensorFlow

Our testing showed that you can run Apache Spark distributed Horovod DL workloads with NetApp storage to overcome your large-scale distributed data processing, model training, serving, and retraining challenges. In our performance validation tests based on industry-standard benchmarking tools and customer demand, the NetApp Spark solutions demonstrated superior performance over the use of native Hadoop systems.

Learn more about how NetApp can help you

For more technical information about Apache Spark Horovod workloads with NetApp storage solutions, check out NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results. And to learn more about overcoming the challenges of Apache Spark hybrid cloud workloads, data protection and multicloud connectivity, and accelerating IoT analytics workloads, read Hybrid cloud solutions with Apache Spark and NetApp AI.

Rick Huang

Rick Huang is a Technical Marketing Engineer at NetApp AI Solutions. Having prior experience in ad-tech industry as a data engineer and then a technical solutions consultant, his expertise includes healthcare IT, conversational AI, Apache Spark workflows, and NetApp AI Partner Network joint program developments. Rick has published several technical reports since joining NetApp in 2019, presented at multiple GTCs, as well as NetApp Data Visionary Center for various customers on AI and Deep Learning.

View all Posts by Rick Huang

Next Steps

Blogs

Brush up on the latest trends and developments in cloud, on premises, and everywhere in between. This is where it all gets real, with a cherry on top.

Get to reading

Community

Explore a wide range of open forums where you can post questions, share answers and just generally get smart on all the NetApp technologies that matter most to you.

Join the discussion