Data deduplication is a process that eliminates excessive copies of data and significantly decreases storage capacity requirements.
Deduplication can be run as an inline process as the data is being written into the storage system and/or as a background process to eliminate duplicates after the data is written to disk.
At NetApp, deduplication is a zero data-loss technology that is run both as an inline process and as a background process to maximize savings. It is run opportunistically as an inline process so that it doesn’t interfere with client operations, and it is run comprehensively in the background to maximize savings. Deduplication is turned on by default, and the system automatically runs it on all volumes and aggregates without any manual intervention.
The performance overhead is minimal for deduplication operations, because it runs in a dedicated efficiency domain that is separate from the client read/write domain. It runs behind the scenes, regardless of what application is run or how the data is being accessed (NAS or SAN).
Deduplication savings are maintained as data moves around – when the data is replicated to a DR site, when it’s backed up to a vault, or when it moves between on premises, hybrid cloud, and/or public cloud.
Deduplication operates at the 4KB block level within an entire FlexVol® volume and among all the volumes in the aggregate, eliminating duplicate data blocks and storing only unique data blocks.
The core enabling technology of deduplication is fingerprints — unique digital signatures for all 4KB data blocks.
When data is written to the system, the inline deduplication engine scans the incoming blocks, creates a fingerprint, and stores the fingerprint in a hash store (in-memory data structure).
After the fingerprint is computed, a lookup is performed in the hash store. Upon a fingerprint match in the hash store, the data block corresponding to the duplicate fingerprint (donor block) is searched in cache memory:
The background deduplication engine works in the same way. It scans all the data blocks in the aggregate and eliminates duplicates by comparing fingerprints of the blocks and by doing a byte-by-byte comparison to eliminate any false positives. This procedure also ensures that there is no data loss during the deduplication operation.
There are some significant advantages to NetApp® deduplication:
Deduplication is useful regardless of workload type. Maximum benefit is seen in virtual environments where multiple virtual machines are used for test/dev and application deployments.
Virtual desktop infrastructure (VDI) is another very good candidate for deduplication, because the duplicate data among desktops is very high.
Some relational databases such as Oracle and SQL do not benefit greatly from deduplication, because they often have a unique key for each database record, which prevents the deduplication engine from identifying them as duplicates.
Deduplication is automatically enabled on all new volumes and aggregates on AFF systems. On other systems, deduplication can be enabled on a per-volume and/or per-aggregate basis.
Once enabled, the system automatically runs both inline and background operations to maximize savings.