In the past, compute and direct-attached storage have been used to feed data to AI workflows. But scaling with traditional storage can lead to disruption and downtime for ongoing operations. Disruptions affect the productivity of data scientists and data engineers. Downtime or slow AI performance can set off a chain reaction that reduces developer productivity and causes operational expenses to spin out of control.
Advances in individual and clustered GPU computing architectures have made NVIDIA DGX systems the preferred platform for workloads such as high-performance computing (HPC), deep learning (DL), video processing, and analytics. Maximizing performance in these environments requires a supporting infrastructure, including storage and networking, that can keep the NVIDIA GPUs featured in DGX systems fed with data. Dataset access must be provided at ultralow latencies with high bandwidth.
NetApp® EF-Series AI tightly integrates DGX A100 systems, NetApp EF600 all-flash arrays, and the BeeGFS parallel file system with state-of-the-art InfiniBand networking. NetApp EF600 AI simplifies artificial intelligence deployments by eliminating design complexity and guesswork. You can start small and scale seamlessly from science experiments and proofs-of-concept to production and beyond.
EF600 powered BeeGFS building blocks have been verified with up to eight DGX A100 systems. By adding more of these building blocks, the architecture can scale to multiple racks supporting many DGX A100 systems and petabytes of storage capacity. This approach offers the flexibility to alter compute-to-storage ratios independently based on the size of the data lake, the DL models that are used, and the required performance metrics.
Investing in state-of-the-art compute demands state-of-the-art storage that can handle thousands of training images per second. You need a high-performance data services solution that keeps up with your most demanding DL training workloads.
The NetApp EF600 all-flash array gives you consistent, near-real-time access to data while supporting any number of workloads simultaneously. To enable fast, continuous feeding of data to AI applications, EF600 storage systems deliver up to 2 million cached read IOPS, response times of under 100 microseconds, and 42GBps sequential read bandwidth in one enclosure. With 99.9999% reliability from EF600 storage systems, data for AI operations is available whenever and wherever it’s needed.
BeeGFS is a parallel file system that provides flexibility, which is key to meeting diverse and evolving AI workloads. NetApp EF-Series storage systems supercharge BeeGFS storage and metadata services by offloading RAID and other storage tasks, including drive monitoring and wear detection.
Mike McNamara is a senior product and solution marketing leader at NetApp with over 25 years of data management and cloud storage marketing experience. Before joining NetApp over ten years ago, Mike worked at Adaptec, Dell EMC, and HPE. Mike was a key team leader driving the launch of a first-party cloud storage offering and the industry’s first cloud-connected AI/ML solution (NetApp), unified scale-out and hybrid cloud storage system and software (NetApp), iSCSI and SAS storage system and software (Adaptec), and Fibre Channel storage system (EMC CLARiiON).
In addition to his past role as marketing chairperson for the Fibre Channel Industry Association, he is a member of the Ethernet Technology Summit Conference Advisory Board, a member of the Ethernet Alliance, a regular contributor to industry journals, and a frequent event speaker. Mike also published a book through FriesenPress titled "Scale-Out Storage - The Next Frontier in Enterprise Data Management" and was listed as a top 50 B2B product marketer to watch by Kapos.