In Hybrid cloud solutions with Apache Spark and NetApp AI, I explained how NetApp® solutions for hybrid cloud can provide data protection and consistent, reliable, cost-efficient performance across your distributed infrastructure for AI workloads.
One common hybrid cloud deep learning (DL) application is natural language processing (NLP), and in two subsequent blog posts, I provided financial sentiment analysis and testing results to validate NetApp solutions with Spark NLP on-premises. That testing achieved sentence-level sentiment for the Nasdaq top 10 companies’ quarterly earnings call-in use case.
In this blog post, I discuss another common Spark use case: distributed DL with Horovod. Horovod improves speed, scale, and resource utilization. With the NetApp storage portfolio, your workflows move data seamlessly for model training, validation, and inferencing in Spark clusters on premises and/or to and from the cloud.
Apache Spark is a programming framework for writing Hadoop applications that work directly with the Hadoop Distributed File System (HDFS) and other file systems, such as NFS and object storage. It’s a fast analytics engine with machine learning (ML) libraries that are designed for large-scale distributed data processing, and it functions seamlessly with the NetApp AI and modern data analytics portfolio. Its in-memory operations are more efficient than MapReduce for data pipelines, streaming, interactive analysis, and ML/DL algorithms. Apache Spark also mitigates the I/O operational challenges that you might experience with Hadoop.
With Horovod, you can choose from popular DL frameworks like TensorFlow, Keras, PyTorch, and Apache MXNet. After a training script has been written for scale by using the Horovod API, it can run on a single GPU, multiple GPUs, or even multiple hosts without any further code changes. In addition to being easy to use and to scale, Horovod is fast, achieving 90% scaling efficiency for both Inception V3 and ResNet-101 and reaching 68% scaling efficiency for VGG-16 benchmarks.
Before you decide to deploy Apache Spark Horovod workloads with NetApp storage to overcome your large-scale distributed data processing and DL challenges, you might need to answer questions such as:
In Deep Learning with Apache Spark and NetApp AI – Financial Sentiment Analysis, I addressed your DL challenges based on our findings from many proof-of-concept (POC) studies with large-scale customers in the financial services and automotive industries. For Horovod distributed training, the specific considerations include:
NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software that’s optimized and certified for the industry’s leading virtualization platform, VMware vSphere, and for VMware Cloud Foundation. NetApp supports NVIDIA AI Enterprise on VMware, and NetApp is the first enterprise storage partner to validate storage solutions with NVIDIA AI Enterprise for VMware environments. The new platform is offered in addition to the proven NetApp ONTAP® AI infrastructure that’s built on NVIDIA DGX BasePOD for bare-metal deployments. With this new platform from NetApp and NVIDIA, you get best-in-class data storage and data management with NetApp all-flash cloud-integrated arrays and AI software.
NetApp works with independent software vendors (ISVs) like Domino Data Lab and Run:ai and open-source frameworks like Kubeflow and MLflow, which can be used to orchestrate your Horovod workflows.
To confirm that NetApp technologies can resolve these potential issues, we defined the deliverables for solutions that use an Apache Spark Horovod workload with NetApp storage. We used Horovod on Spark, TensorFlow, and Keras to validate Spark performance with NetApp on-premises, cloud-native, and hybrid cloud solutions. Those solutions used NetApp AFF controllers, Azure NetApp Files, and NetApp StorageGRID® object storage.
Training workflow setup
The Horovod on Spark package provides a convenient wrapper around Horovod that makes running distributed training workloads in Spark clusters simple. This package enables a tight model design loop in which data processing, model training, and model evaluation are all performed in Spark, where training and inferencing data resides. You can use two APIs to run Horovod on Spark: a high-level Estimator API and a lower-level Run API. Although both use the same underlying mechanism to launch Horovod on Spark executors, the Estimator API abstracts the data processing, model training loop, model checkpointing, metrics collection, and distributed training. In our validation testing, we used Horovod Spark Estimators, TensorFlow, and Keras for an end-to-end data preparation and distributed training workflow based on the Kaggle Rossmann Store Sales competition.
We also used the script keras_spark_horovod_rossmann_estimator.py, which you can find in the Python scripts for each major use case in NetApp techncial report TR-4570. The script contains three parts. The first part performs various data preprocessing over an initial set of CSV files that are provided by Kaggle and that were gathered by the community. The input data is separated into a training set, with a Validation subset, and a testing dataset. The second part defines a Keras deep neural network (DNN) model with a logarithmic sigmoid activation function and an Adam optimizer, and it performs distributed training of the model by using Horovod on Spark. The third part performs prediction on the testing dataset by using the best model that minimizes the validation set’s overall mean absolute error, and it creates an output CSV file.
Run-time results: NFS versus HDFS and local storage
The following figure shows various run-time results, comparing the use of a NetApp AFF high-availability (HA) pair, HDFS, and local storage. NFS direct access with AFF HA pair clearly outperforms the other two protocols by up to 7.55 times in terms of run-time speedup, for several reasons:
In terms of workflow run time, the speedup can vary depending on the specific use case and dataset. In general, however, NFS direct access can be several times faster than the use of HDFS and local storage. In this case, NFS direct access is 7.55 times faster than local storage and 1.43 times faster than HDFS.
It’s important to note that the NFS direct access protocol is designed for shared data access, and it’s also built to handle big data processing tasks, which HDFS is designed for. The goal of the NFS direct access is to give you a fast way to access remote data that is more reliable, more consistent, and less complex than the use of HDFS. You also get performance and throughput benefits in both read and write operations. On top of that, you get battle-tested, industry-proven NetApp technologies like Snapshot™, SnapMirror®, FlexCache®, and FlexClone® capabilities.
NetApp BlueXP for unified control
Our testing solution also included the NetApp BlueXP™ unified control plane. BlueXP brings industry-leading NetApp storage and data services to your hybrid multicloud experience. BlueXP enables you to simplify and to automate the deployment and management of NetApp storage solutions. It also includes NetApp Astra™ product family, which provides application data protection, mobility, and storage for cloud-native, containerized applications.
With BlueXP, you get a single point of control to manage your data storage across multiple remote and branch office locations. It also automates the process of tiering data based on your usage and performance requirements. Your IT administrators can manage your organization’s storage resources across the entire enterprise from a single web-based interface.
BlueXP is a software-defined solution that you can run on premises or in the cloud. It can be integrated with other NetApp products such as OnCommand® Insight and SnapCenter® technology, and it can be an integral part of your NetApp powered data fabric.
NetApp Astra for cloud-native applications
The NetApp Astra family enables your teams to focus on delivering cloud-native Apache Spark applications with full confidence in their data infrastructure. Astra helps you prevent application downtime and data loss, reduce time-consuming operations, and scale to fit your needs. Astra includes:
If you’re a NetApp ONTAP customer, you can also deploy Astra Control as on-premises software. This version of Astra Control, called Astra Control Center, extends the rich enterprise-class data management capabilities of ONTAP to Kubernetes applications.
Data lake and access
Our testing storage solution used a shared data lake that serves large data compute server farms simultaneously. We based this solution on ONTAP, which provides easy management and superior performance, with NFS direct access for fast and secure access to NFS data and with HDFS for access to distributed low-cost storage.
Bonus sentiment analysis
In addition to Horovod distributed TensorFlow and Keras training, we also executed variations of financial sentiment analysis workloads on the Nasdaq top 10 companies’ quarterly earnings call transcripts. We compared run-time performance on different file systems, and we performed click-through rate (CTR) prediction by using DeepFM (deep factorization-machine) models. We generated sentiment analysis result comparison tables by using PySpark Python scripts, and we recorded the run times by storing the data and models in different underlying storage file systems.
More details
For the complete portfolio of NetApp Apache Spark and Hadoop storage positioning, software versions, and hardware used, see NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results.
With an Apache Spark distributed Horovod workload and NetApp storage, we built a successful solution with the following benefits:
Our testing showed that you can run Apache Spark distributed Horovod DL workloads with NetApp storage to overcome your large-scale distributed data processing, model training, serving, and retraining challenges. In our performance validation tests based on industry-standard benchmarking tools and customer demand, the NetApp Spark solutions demonstrated superior performance over the use of native Hadoop systems.
For more technical information about Apache Spark Horovod workloads with NetApp storage solutions, check out NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results. And to learn more about overcoming the challenges of Apache Spark hybrid cloud workloads, data protection and multicloud connectivity, and accelerating IoT analytics workloads, read Hybrid cloud solutions with Apache Spark and NetApp AI.
Rick Huang is a Technical Marketing Engineer at NetApp AI Solutions. Having prior experience in ad-tech industry as a data engineer and then a technical solutions consultant, his expertise includes healthcare IT, conversational AI, Apache Spark workflows, and NetApp AI Partner Network joint program developments. Rick has published several technical reports since joining NetApp in 2019, presented at multiple GTCs, as well as NetApp Data Visionary Center for various customers on AI and Deep Learning.