Deduplication vs. compression: two concepts in data reduction

According to market research company IDC, the amount of data that exists in the world is doubled every 24 months. At this rate, the digital universe is set to have a volume of 44 zettabytes by 2020. This figure is calculating by converting the 44 trillion gigabytes of data produced or copied in one year. This development primarily affects storage techniques, backup procedures, and recovery systems, which must be capable of carrying and handling enormous loads of data. The concept of technical implementation focuses on the reduction of physically stored information, thereby reducing the costs of data management. These methods are essentially based on two concepts: data compression and deduplication. While the loss-free compression uses redundancies within a file to compress data, deduplication algorithms mirror data across files to avoid duplicates. The main application for deduplication is therefore data backups.

Deduplication

Deduplication is a process of data reduction that is essentially based on preventing data redundancies in the storage system. For this, a deduplication engine is used, which generates special algorithms to identify and eliminate redundant files and data blocks. This can be integrated into backup software or hardware used for storing information, or implemented as an intermediary appliance.

The aim of deduplication in storage techniques is to write only as much information on non-volatile storage media as is necessary to be able to reconstruct a file without losses. The more duplicates are deleted, the smaller the data volume that needs to be stored or transferred becomes. The identification of duplicates can be done at file-level with Git or Dropbox, for example. However, a more efficient method is the use of deduplication algorithms, which work on a sub-file level. To do this, files are first broken down into data blocks (chunks) and awarded unique checksums, or hash values. The tracking database, which contains every checksum, acts as the central supervisory entity.

The block-based deduplication methods can be broken down into two variations:

  • Deduplication with a fixed block size: for deduplications with a fixed block size, the algorithm divides files into sections of the exact same length. This is generally based on the cluster size of the file or RAID system (typically 4KB), but can also sometimes be configured manually. In this case, the individually-adjusted block size is adopted as the standard size for all data blocks.
  • Deduplication with variable block size: with deduplication with variable block sizes, on the other hand, no specific standard size is defined. Instead, the algorithm divides the data into different blocks, whose lengths vary depending on the type of data that is to be processed.

The type of block division has a massive influence on the efficiency of the data duplication. This is especially noticeable when deduplicated files are subsequently modified.

If a data block with fixed block boundaries is extended with additional information, the contents of all subsequent blocks are usually also relocated in relation to the fixed block boundaries. Although the changes only affect one data block, all subsequent segments of a file will also be classified as new due to the change in the block boundaries by the deduplication algorithm. Unless modified bytes are exactly one multiple of the fixed block size. Since these are saved as newly reclassified data blocks, both the computing effort and the capacity of the bandwidth are increased in the case of a backup with a deduplication with a fixed block size.

If, on the other hand, an algorithm uses variable block boundaries, the modifications of an individual data block have no effect on the next segments. Instead, the modified data block is simply extended and stored with the new bytes. This relieves the burden on the network, as less data is transferred during a backup procedure. However, the flexibility of the file changes is more computing-intensive, as the algorithm must first find out how the chunks are split up.

Identifying redundant chunks is based on the assumption that data blocks with identical hash values contain identical information. To filter out redundant chunks, the deduplication algorithm just needs to synchronize newly calculated hash values with the tracking database. If it finds identical checksums, the redundant chunks are replaced by identifying the storage location of the identical data block with a pointer. A pointer like this takes up far less space than the data block itself. The more chunks in a file can be replaced by placeholders, the less memory space is required. However, the efficiency of data reduction can’t by predicted by deduplication algorithms, as this is strongly dependent on the output file and its data structure. In addition, deduplication is only suitable for unencrypted data. Redundancies are intentionally avoided with encryption systems; this makes any pattern recognition impossible.

Deduplication can be carried out either in the storage destination or at the data source.

Source deduplication

If redundant data is removed before being transferred to the storage destination, it’s known as source deduplication. In this case, the deduplication engine is integrated into the backup software. Redundant information is eliminated directly in the data source’s file system. To do this, the backup software scans newly-created data blocks at regular intervals and compares them with the ones already saved on the backup server. If a redundant data block is found, it’s skipped during the next backup procedure. If a file is modified, the backup software simply transfers the new parts of the file.

The major advantage of these deduplication methods is that only new information is transferred to the backup server as part of the data backup. This lessens the bandwidth usage as well as the necessary capacity of the storage destination. But unlike deduplication at the storage destination, the source-based method requires a far higher computing effort, so that it is not suitable for every scenario.

Target deduplication

Target deduplication is when the data reduction process takes place outside of the data source. The deduplication engine is either integrated in the hardware array or upstream as an independent appliance. In both cases, the target deduplication reduces the required storage capacity, but unlike with source deduplication, target deduplication does not affect the amount of data transferred during a backup from the source system to the storage destination via LAN or WAN. Depending on which structure is chosen for the target deduplication, there are two types of target deduplication to choose from: post-processing deduplication and inline deduplication.

  • Post-processing deduplication: if the deduplication engine is integrated in the array, a non-manipulated version of the backup data is first saved on the storage device and then deduplicated. This is known as post-process deduplication. This kind of deduplication has the advantage of being a relatively fast way of transferring data to the storage destination. However, the data is not immediately available after the transfer as the backup process must first be completed before redundancies can be eliminated. Data on the hard drive is addressed three times before it is made available for a replication or retrieval. This increases the memory system’s workload. In addition, the backup system must first provide two separate memory zones: one for the output data and another for the deduplicated data. This requires far more physical memory than with inline deduplication. In contrast, however, post-process deduplication enables a more efficient data reduction based on variable data blocks.
  • Inline deduplication: if the deduplication engine is located before the storage medium, it’s possible to remove redundant data directly during the transfer, even before it is written on the storage medium. This reduces the written throughput and limits the deduplication on data blocks with fixed sizes. But it also significantly reduces the necessary amount of physical memory on the target system compared to a post-processing deduplication. In addition, data that has been deduplicated inline is immediately available; for example, for data recovery or replication.

Data compression

In data compression, files are converted into an alternative format, which is more efficient than the original. The aim of this process is to reduce the required memory space as well as the transfer time. A coding gain like this can be achieved with two different approaches:

  • Redundancy compression: in the case of a loss-free compression based on a redundancy reduction, data can be decompressed precisely after compression. Input and output data is therefore identical. This kind of compression is only possible when a file contains redundant information.
  • Irrelevance compression: in the case of a lossy compression, irrelevant information is deleted to compress a file. This is always accompanied by a loss of data. The original data can only approximately be recovered after an irrelevance compression. The data that is classified as irrelevant is discretionary. In an audio compression via MP3, for example, the frequency patterns removed are those that are assumed to be hardly or not at all heard by humans.

While compression on the storage system level is essentially loss-free, data losses in other areas, such as image, video, and audio transfers, are deliberately accepted to reduce file size.

Both the coding as well as the decoding require computational expense. This is primarily dependent on the compression method used. While some techniques are designed for the most compact representation of output data, others focus on reducing the required computation time. The choice of compression method is therefore always dependent on the requirements of the application.

In the case of a live video or sound transfer, for example, the fastest possible compression method and recovery of data is recommended. In this case, a comparatively low coding gain and possible quality losses are justifiable. It's a different case with files made available to a large number of users via a file server. Here, more time for the compression can be tolerated, so that a powerful compression algorithm can be used. This allows a high coding gain without losses to quality.

Loss-free data compression

While a deduplication algorithm searches for the same data sections in different files, and replaces redundant data segments with linking placeholders, methods of loss-free data compression work with representations. In this case, the sections that are repeated several times within one file are replaced by a far shorter representation.

This is shown in the following example:

Source text: ABCDEABCEEABCEE

Coding:  ABC = 1; EE = 2

Compressed text: 1DE1212

The source text with a length of 15 characters can be shortened to 7 characters with this coding; the coding gain amounting to more than 50%. However, the prerequisite for this kind of compression is that both the encoding and decoding systems recognize the coding.

Most loss-free methods repeat either characters or combinations of characters within a single file. Known repetition-based compression methods include the word coding and the Lempel-Ziv method (LZ77).

  • Word coding: compression methods based on word coding are best suited for text files. The basic principle here is replacing words with shorter representations. For this purpose, a list is first created and a value is assigned to every word in the file. An entire text can then be converted into numbers.

Source text: to be or not to be

Coding: to = 1; be = 2; or = 3; not = 4

Compressed text: 123412

Since in this compression method, in addition to the payload data, the word list must also be saved, the word coding is most effective for longer texts with many repetitions.

  • Lempel-Ziv method: the Lempel-Ziv method is a dictionary-based method that makes it possible to compress character strings based on redundancies. For this purpose, the compression algorithm uses content that’s been read as a dictionary would, so that it doesn’t have to be stored separately. The reference to an identical character string in the read data stream is performed by a so-called triple, which encodes the position and length of the part to be copied and the first non-coincident character. The matching is performed by a search buffer and a preview buffer. If there is no match for a character, the triple is (0, 0, x), where x is the corresponding character. The following example shows an LZ77 encoding of the word BANANA:

The characters or strings in the preview buffer are compared to the search buffer characters that have already been read. If matches are found, the corresponding characters are replaced by a referencing triple in the preview buffer.

In line 4, the characters ANA are replaced by the triple (2, 3, $) which means:

2 = Replace the current string with a string starting after position 2 in the search buffer (A).

3 = The replacement string is 3 characters long. As this is longer than the remaining search buffer containing the characters A and N, the replacing string is repeated (A, N, and again A).

$ = No more characters follow after the replacement, the compression is completed.

As most of the data can be compressed effectively using LZ77, successors of this algorithm – such as LZSS – are now very widely used, including the zip file format, PNG images, and PDFs.

Alongside repetition-based methods, frequency-based methods such as run-length coding or Huffman coding are also used for loss-free compression.

  • Run-length coding: run-length coding targets redundancies that result from a sequence of the same characters. This is illustrated by the schematic representation of a single image line of a black and white graphic.

Initial data: WWWWWBBBBWWBBWWWWWWBBBBBBWWW

A clear reduction of the data to be saved will result from repeated characters, which are recorded as a value pair of characters and the number of repetitions.

Compressed data: 5W4B2W2B6W6B3W

The run-length coding is used in image compression, but loses efficiency the more color states need to be encoded. While black and white outlines can be created with two color states, for an 8-bit greyscale image, 256 shades of gray are required. The likelihood that more than two successive pixels have the same color value therefore decreases significantly in complex images with subtle color gradations. Since the coding in the example is based on two-digit value pairs, a coding gain only occurs here when at least 2 identical characters follow one another in a sequence. A modified form of the run-length coding is part of the known JPEG image compression method.

  • Huffman coding: during data compression, Huffman coding first determines the frequency of all characters occurring in a file. In the next step, the characters are translated into a bit sequence, in which the length of representation is determined by the frequency. Graphically, Huffman coding is represented as a tree, with the characters encoded forming the leaves. This is demonstrated by the example sentence, ‘This is an example of a huffman tree’.
Characters Spaces a e f h i m n s t l o p r u x
Frequency 7 4 4 3 2 2 2 2 2 2 1 1 1 1 1 1

The respective binary coding of a character results from a path leading from the root to the leaf. Here, 0 marks a branching off to the left and 1 indicates a branching off to the right.

The following code can be used for the characters of the example sentence:

Spaces a e f h i m n
111 10   1101 1010 1000 111 10
s t l o p r u x
1011 110 11001 110 10011 11000 111 10010

While frequently occurring characters such as space, a, and e are only encoded with 3 bits in this example, the lesser used characters such as r, u, and x require 5 bits. In comparison, the ASCII character set, for example, uses 7 bits per character. However, in ASCII-compatible character sets, such as UTF-8 or ISO 8859-1 (Latin-1), an eighth bit is usually added for modern computers, which work with 8, 16, 32, or 64 bits. This demonstrates that texts can be saved far more compactly with Huffman coding.

The following binary sequence shows the example sentence ‘This is an example of a Huffman tree’ according to Huffman coding:

0110 1010 1000 1011 111 1000 1011 111 010 0010 111 000 10010 010 0111 10011 11001 000 111 00110 1101 111 010 111 1010 00111 1101 1101 0111 010 0010 111 0110 11000 000 000

In comparison, below is the bit sequence for the same content according to ASCII coding, with the added eighth bit (the first zero, respectively):

01110100 01101000 01101001 01110011 00100000 01101001 01110011 00100000 01100001 01101110 00100000 01100101 01111000 01100001 01101101 01110000 01101100 01100101 00100000 01101111 01100110 00100000 01100001 00100000 01101000 01110101 01100110 01100110 01101101 01100001 01101110 00100000 01110100 01110010 01100101 01100101

With Huffman coding, data can be reduced to up to 1/3 of its original size, depending on the distribution of character frequency. Users should note, however, that the tree must also be saved for all subsequent decoding.

Huffman coding has other uses besides compressing text files, however; the method can also be used during a JPEG image compression process and in a zip format that is a combination of LZSS and Huffman.

Lossy compression

While loss-free data compression is only possible when redundancies are found within a file, lossy data compression, which is based on irrelevance reduction, is always possible ­– if data loss is accepted. As a basic principle, data compression that is subject to loss always comprises of a decision as to which proportion of the original data is dispensable. The theoretical limit for maximum compression can be explored with the rate distortion theory, which describes the relationship between compression and signal quality.

Lossy compression methods are generally adapted to human perception. For example, during the compression of an audio file (i.e. MP3, AAC, or WMA), the frequency components that lie outside of the human auditory range are deleted. One common example of irrelevance reduction in the image sector is the JPEG method, which combines both loss-free and lossy compression methods. The latter includes, for example, the color conversion of the RGB spectrum into the YCbCr color model, the discrete cosinus transformation (DCT), as well as quantization.

Deduplication and data compression: a comparison

Deduplication is generally used by companies to implement a backup procedure or to optimize the storage methods in standard file systems. This is because deduplication systems work extremely efficiently if identical files are to be stored.

Data compression methods, on the other hand, generally cause a higher computing effort, requiring far more complex platforms. Storage systems work most effectively in a combination of both data reduction methods. With deduplication, redundancies are first deleted from the files that are to be saved, and the remaining data is subsequently compressed.