As the landscape of generative AI (GenAI) continues to evolve, maintaining data privacy and security remains a paramount concern for organizations like yours. At NetApp, we are committed to keep innovating to help our customers use GenAI models with their enterprise data not only to enhance the capabilities of AI, but also to safeguard private and sensitive information.
In July of this year, we launched one such innovation, NetApp® BlueXP™ workload factory, to help you accelerate development of GenAI applications that are grounded in knowledge from enterprise datasets. By using BlueXP workload factory, within minutes, you can connect unstructured datasets in your NetApp ONTAP® storage to augment large language models (LLMs) that are available through Amazon Bedrock. BlueXP workload factory deploys an end-to-end retrieval-augmented generation (RAG) pipeline that ingests and transforms unstructured datasets and creates and stores vector embeddings. It also provides a rich API to build applications such as virtual assistants that can derive knowledge from the connected datasets.
And we keep innovating. Today, we are thrilled to announce the integration of NetApp BlueXP workload factory with BlueXP classification service. This new integration is designed to filter out personally identifiable information (PII) from enterprise datasets before they are ingested into the RAG pipeline. This capability sets a new standard for the safeguarding of privacy in GenAI applications.
GenAI is revolutionizing how organizations harness their data, creating powerful insights and driving innovation. However, the inclusion of PII in datasets poses significant privacy risks and challenges in keeping your customer data safe and meeting regulatory standards. Leakage of private information into the models and applications increases the risk of data theft and ransomware attacks, possible penalties from regulators, and loss of business. By introducing robust data guardrails to enforce stringent privacy controls for RAG pipelines, our latest update to BlueXP workload factory helps you overcome these challenges.
Data guardrails for knowledge bases
When you create a knowledge base in BlueXP workload factory, you can now enforce data guardrails that identify and exclude PII from the data sources that are connected to the knowledge base. This capability prevents personal information from being ingested into the knowledge base, maintaining the integrity and privacy of the data.
To use the data guardrails feature, you must deploy the BlueXP classification service. Data guardrails rely on BlueXP classification to detect, to filter, and to redact private information during the data ingestion process, making it an integral part of the workflow. BlueXP classification is a free service from NetApp that identifies and classifies PII within your datasets that reside on NetApp storage. Such data includes credit card numbers, email addresses, IP addresses, passwords, national identifiers, and so on.
The integration with workload factory supports detecting and redacting PII detected in the file formats supported by knowledge bases. To get a comprehensive list of private and sensitive data that BlueXP classification can detect, review the BlueXP classification documentation.
How it works
The PII guardrail feature is seamlessly integrated into the knowledge base creation workflow in BlueXP workload factory. To configure guardrails, however, you need to first deploy the BlueXP classification service (version 1.36 or later) in your AWS account and Virtual Private Cloud (VPC) where you have the Amazon FSx for NetApp ONTAP file system providing the source datasets to be connected to the knowledge bases. BlueXP classification runs on an m6i.4xlarge instance that’s deployed in your VPC. BlueXP workload factory automatically discovers the BlueXP classification service and sets the communication channel between the NetApp AI engine that’s deployed in your VPC and the classification instance.
After BlueXP classification has been deployed, during the creation of knowledge bases in BlueXP workload factory, you can configure data guardrail enforcement.
As the following figure shows, BlueXP workload factory ingests the text documents into the knowledge base, and each document chunk is first passed to the classification engine to detect personal information. If the classification engine finds any such information, it is redacted and replaced with “<PII REMOVED>.” The chunk is then passed to the embedding model, which converts the chunk to vectors that are stored in the knowledge base vector store. The data guardrails filter out any PII data before embedding into the vector database, so only nonsensitive information is ingested into the RAG pipeline and is available to the embeddings and language models.
Applications such as a virtual assistant that are built by using a knowledge base that was configured with data guardrails cannot expose any PII (detected by BlueXP classification) data to end users. The following figure shows how a virtual assistant cannot retrieve any PII data as it has already been redacted.
With the integration of BlueXP workload factory and BlueXP classification, your organization can use your data to enhance user interactions without compromising data privacy and without increasing the risk of exposing any personal data to language models. When you combine this capability with Amazon Bedrock Guardrails applied at inference or agents, you can add extra safeguards, such as content filters, denied topics, and world filters.
At NetApp, we understand the importance of data security and safeguarding data privacy, especially when enterprise data is connected to GenAI models and applications. By integrating BlueXP classification with BlueXP workload factory, we provide a powerful RAG platform that simplifies making GenAI relevant to your enterprise context and that keeps sensitive information protected. We are excited to see how this new capability will empower organizations like yours to build more secure and effective AI applications.
To find out more about BlueXP classification capabilities, review the BlueXP classification documentation. To get started on creating secure RAG pipelines with your enterprise datasets, sign up for BlueXP workload factory and review the BlueXP workload factory documentation.
Puneet is a Senior Director of Product Management at NetApp where he leads product management for FSx for NetApp ONTAP service offering with AWS with specific focus on AI and Generative AI solutions. Before joining NetApp, Puneet held multiple product leadership roles at Amazon Web Services (AWS) and Dell Technologies in areas of hybrid cloud infrastructure, cloud storage, scale-out and distributed systems, high performance computing and enterprise solutions, etc. In those roles he led product vision and strategy, roadmap planning and execution, partnerships, and go-to-market strategy.