ONTAP Reaches 171GiB/s with GPUDirect Storage

Contact Sales
Welcome!

An account will enable you to access:
- NetApp support's essential features
- NetApp communities
- NetApp training
- Sign in to my dashboard
- Don't have an account?
  Create an account
- BlueXP is now NetApp Console
  
  Monitor and run hybrid cloud data services
  NetApp Console
NetApp account
Language
- English
- Deutsch
- Español
- Français
- Italiano
- Português
- 日本語
- 한국어
- 简体中文
- 繁體中文
See your global contacts
Learn
Browse

people sitting on a table looking at laptop analyzing and taking notes

Contents

Share this page

David Arnette

April 28, 2023

5,745 views

A year ago, I wrote a blog post about NetApp^® ONTAP^® support for NVIDIA GPUDirect Storage™ (GDS) technology at a time when GDS and the surrounding ecosystem were still very new. As with many things that are new, there’s always a lot of room for improvement, and our initial release of NFS over RDMA was no exception. I’m happy to say that, as usual with NetApp innovation in Collaboration with NVIDIA, the latest version of ONTAP met and exceeded the “room-for-improvement” challenge.

The latest release has increased performance of NFS over RDMA and GDS dramatically, and customers like you now can get more than 171GiBps from an ONTAP storage cluster to a single NVIDIA DGX A100 compute node. So, you can achieve the highest levels of performance for machine learning and deep learning (ML/DL) workloads, using data center–standard protocols and technologies to deliver the simplest deployment and operational experience. If your existing ONTAP systems have the appropriate network adapters, you can add this level of performance with a free upgrade simply by updating to ONTAP 9.12.1 or later versions.

A reminder about how NVIDIA GDS works

As a quick refresh, GDS enables RDMA-based data transfers between a storage system and host GPU memory. GDS allows your data to move directly from the host network interface into GPU memory, which bypasses several steps in the transfer processes, including CPU interrupts and main memory bounce buffers. The net result is a significant reduction in latency in the transfer, and a reduction in your CPU and system memory utilization by removing them from the process altogether. The following figure shows how GDS streamlines the data transfer process.

flow diagram of three different variations of NFS over TCP vs RDMA vs RDMA with GPUDirect Storage

How ONTAP and NFS improvements help GDS optimize resources for AI/ML

ONTAP supports GDS use of NFS over RDMA, but it’s not your parents’ NFS. NFS has been around for decades, and although it’s ubiquitous in the enterprise, it’s commonly relegated to secondary or tertiary storage for workloads that don’t require the highest levels of performance. This assignment level is often because of experiences with older implementations of Ethernet and NFS. But significant enhancements to both technologies now enable NFS to deliver the performance that AI/ML workloads need without the dedicated and proprietary hardware or software clients that traditional high-performance computing (HPC) environments require. The use of RDMA with NFSv4.x eliminates TCP-related performance issues with NFS and enables high throughput and low latency by offloading data transfer processes from the storage system and host CPUs. As mentioned, GDS uses this capability to optimize performance and utilization of valuable GPU resources, but other CPU-based workloads can also take advantage of these benefits with no changes to application-level data access.

This level of performance is critical for efficient AI operations at scale, but performance isn’t always the biggest challenge that you face in trying to scale up your production ML/DL and analytics workloads. These workloads typically require complex data pipelines to continuously feed model training jobs with new and refined data. And managing the flow of data across core, edge, and cloud locations is vital to the integration of AI into standard business practices. NetApp ONTAP includes industry-leading data management tools that enable you to build automated hybrid cloud data pipelines that streamline the flow of data. Your data engineers and data scientists can spend more time working on their craft and spend less time waiting for data to be available where and when they need it.

Testing to prove optimized performance

Test setup

To demonstrate the performance enhancements available in ONTAP 9.12.1, the NetApp performance engineering team led by Sr. Performance Engineer, Rodrigo Nascimento built a unified storage cluster using four NetApp AFF A800 storage systems. Each consists of two HA storage controllers in a 4RU chassis that also contains up to 48 NVMe SSD drives, although only 24 were used in each system for this test.

The controllers were connected with a pair of standard 1RU NetApp cluster interconnect switches, for a total storage system footprint of only 18RU. The A800 controllers support up to four I/O PCIe cards each, and for this testing we populated each controller with two dual-ported 100 Gb/s Ethernet cards (NetApp P/N X1148, NVIDIA ConnectX-5 SmartNIC). Two ports from each controller were connected to a NVIDIA SN3700V Ethernet switch running NVIDIA Cumulus Linux 4.4.

For clients in this environment, we used two DGX A100 systems connected to the SN3700V switch with two 200 GbE ports each, and two DGX-1 systems connected to the SN3700V switch with four 100 GbE ports each. The drawing below shows the topology used for this evaluation.

connections of different blocks of DGX and AFF series

From a logical perspective, the network was provisioned with two separate VLANs including one port from each storage controller, and two logical interfaces were configured on each physical port. Each VLAN also contained one port from each DGX A100 system and two ports from each DGX-1 system. Host ports in each VLAN were configured with addresses in one of two subnet IP ranges and logical storage ports were configured in each subnet to allow each server to utilize all available bandwidth and avoid routing issues. To contain the data under test, we created eight ONTAP FlexGroups with eight constituents each, one FlexGroup per storage node, and created folders in each to isolate the data used by separate concurrent test instances. For details on how to configure ONTAP for RDMA and the necessary client mount options please see the ONTAP documentation.

Test results

The test that we used for this validation was gdsio, an NVIDIA tool for validation of GDS performance and functionality. It’s installed by default with the GDS packages at usr/local/cuda/gds/tools/. We obtained the following results by using sequential read I/O streams that were generated by multiple gdsio worker threads running on the clients concurrently. To one, obtain average performance data and two, to help ensure consistency, we ran multiple tests for 1 hour each.

The specific gdsio command parameters that we used were:

-s 256m—The size of the files that each gdsio thread
-I 1024k—The transfer block size.
-x 0—The transfer mode = GPUDirect.
-I 0—The transfer type = read.
-T 3600—The run time = 1 hour.
-D <mount point>—Mount points were divided across the available storage ports on each host.
-d <GPU device> -- specifies the GPU for each job, jobs were run concurrently on every GPU in each host.
-n <numa node>—The optimal nonuniform memory access (NUMA) node for each GPU as determined by nvidia-smi topo -m.
-w <number of threads>—The number of threads per mount point was adjusted on a per-host basis as the system was scaled.

By using the configuration and test parameters described previously, the ONTAP storage cluster delivered 171GiBps as reported by the gdsio tool. The following graph shows how the performance of this ONTAP storage cluster scaled in a very linear fashion as more storage nodes were added.

bar diagram of ontap gds performance scaling

Deploy ONTAP for high performance and scalability for AI

As you can see, the ONTAP system is clearly able to deliver performance in the range required for the most I/O intensive applications such as large image rendering, video rendering, and data analytics applications that can benefit from the GDS acceleration library. Of course since this cluster was only eight nodes out of a possible 24 nodes supported in an ONTAP cluster, additional nodes can be easily added to scale bandwidth even further. As I mentioned before though, performance like this is often a requirement for GPU-accelerated workloads, but in most cases it’s not the biggest challenge customers face when turning AI experimentation into production operations. ONTAP not only delivers performance as good or better than any other solution available, it’s the only competitive storage platform that also includes native data management tools that streamline workflows for data science teams and integrate multi-site/multi-platform data pipelines. With new performance and data management features expected in upcoming ONTAP releases, NetApp will raise the bar again not just for performance but for total workflow solutions that help companies realize the benefits of AI faster and with better results.

Find out how NetApp ONTAP helps companies like yours realize the benefits of AI faster and with better results now and into the future. Check out netapp.com/ai.

David Arnette

David Arnette is a Sr. Technical Marketing Engineer focused on Netapp infrastructure solutions for Artificial Intelligence and Machine Learning. He has published numerous reference architectures for enterprise applications on the FlexPod converged infrastructure and has over 18 years’ experience designing and implementing datacenter storage and virtualization solutions.

View all Posts by David Arnette

Next Steps

Blogs

Brush up on the latest trends and developments in cloud, on premises, and everywhere in between. This is where it all gets real, with a cherry on top.

Get to reading

Community

Explore a wide range of open forums where you can post questions, share answers and just generally get smart on all the NetApp technologies that matter most to you.

Join the discussion