Convergent Encryption
% Remco Bloemen % 20140320, last updated 20140507
Value of deduplication
When storing large amounts of data it is efficient to store duplicate data only once, a process called deduplication. This is particularly interesting in large corporate networks where many machines contain identical documents. A randomly selected sample of 585 desktops at Microsoft has revealed of the total 685 GB of documents only 368 GB was unique. 1 This implies that up to half of the space can be saved by deduplication.
Even more space can be saved by finding duplicate content inside different, but similar files. Such files arise often when storing different versions of the same document. An experiment was done by Coblue: A LibreOffice writer document containing several megabytes of images had some changes made. Images where added, removed, rearranged and the text was edited. When storing the different versions together, the total storage space required was little more than the sum of the sizes of the large images. This result was surprising, because Open Document Format files are compressed and a single byte change could change the whole file (much like with encrypted files).
Deduplication has more advantages than saving space. When synchronizing machines, deduplication can minimize the amount of data that must be transfered. This reduces bandwidth consumption and speeds up the synchronization process. Faster synchronization in turn leads to less divergence and more resilience.
Content addressable storage
A simple but insecure method of deduplication is using a contentaddressable storage (CAS). A CAS has two operations:
put
: content address → dataget
: data → content address
To store a file or other large piece of binary data the put
method is
called which results in an short address. Ideally, calling put
with
the same data will always result in the same address and calling put
with different data will always result in a different address. Once
put
is called at least once, the data can then be retrieved with
get
. Note that get
requires the address, of the data, so the
interface nicely forces that put
is called first.
Often the content addresses are cryptographic hashes of the data. And the CAS is implemented as an associative array of addresses to data. When the determining the address length the birthday attack should be taken into account. As a guideline, 128 bits seem enough to prevent an attacker from colliding with a given preexisting file (possibly corrupting that file) and 256 bit seem enough to prevent an attacker from manufacturing two files that collide with each other (possibly confusing the system). This is assuming the brute force is the best attack on the hash function.
Resilience is an important quality of hashbased contentaddressable storage. If data is lost at one location, but a (H, X') pair is recovered it can be verified to be correct. This will work no matter whether the pair came from. A backup system, a shady figure in an ally or a fortuneteller, the source doesn’t have to be trustworthy.
Privacy and deduplication
The hashed associative array approach does not provide any privacy. Anyone with access to the raw associative array can list the contents of all the stored files. It is possible to encrypt all the files before storing them, but if the same file is encrypted with a different key, the encrypted files will be different and the deduplication will fail. This could be solved by encrypting all the files with the same key, but then everyone with some access would immediately have full access.
We need a solution with different encryption keys for every file but that still deduplicates files. This may seem paradoxical, until you bend the rules slightly to require different encryption keys for every different file. You can then have equal keys for equal files. Equal files, equal keys thus equal encrypted data and problem solved!
In convergent encryption this is done by taking a hash of the unencrypted file as the encryption key. This is safe because you need the key to decrypt the file, unless you have already have the unencrypted file, but then there is nothing left to secure.
Note
This is not to be confused with deterministic encryption, which means the same output for the same input. Allmost all modern cryptographic algorithms such as RSA, ECDSA and AES are deterministic. They are made nondeterministic by introducing a random salt or an initialization vector.
It was famously used by bitcasa to claim ‘infinite storage’, but it is also used by: Dropbox, Wuala, TahouLAFS, Microsoft Farsite, Permabit, Freenet, GNUnet, flud, MojoNation
Convergent encryption
First we need some ingredients. Take the cryptographic primitives
$$ \begin{aligned} \mathrm{H_A} &: &\{0,1\}^{*} &→ \{0,1\}^{l_A} \\ \mathrm{H_B} &: &\{0,1\}^{*} &→ \{0,1\}^{l_B} \\ \mathrm{E} &: &\{0,1\}^{l_K} × \{0,1\}^{*} &→ \{0,1\}^{*} \\ \mathrm{D} &: &\{0,1\}^{l_K} × \{0,1\}^{*} &→ \{0,1\}^{*} \end{aligned} $$
where $\mathrm{H_A}$ and $\mathrm{H_B}$ are cryptographic hash functions of length $l_A$ and $l_B$ respectively and $\mathrm{E}$ and $\mathrm{D}$ are symmetric encryption and decryption functions with key length $l_K$. This implies that
$$ \begin{aligned} \mathrm{H_A}(X_1) &= \mathrm{H_A}(X_2) &&⇔ &X_1 &= X_2 \\ \mathrm{H_B}(X_2) &= \mathrm{H_B}(X_2) &&⇔ &X_1 &= X_2 \\ \mathrm{D}(K_1, \mathrm{E}(K_2, X)) &= X &&⇔ & K_1 &= K_2 \end{aligned} $$
where the implications in the leftward direction are exact but in the rightward implications are only with cryptographic confidence. This will be important later in the security analysis. For permanent storage, we will use an associative array with interface
$$ \begin{aligned} \mathrm{Store} &:& \{0,1\}^{l_B} × \{0,1\}^{*} &→ ∅ \\ \mathrm{Retrieve} &:& \{0,1\}^{l_B} &→ \{0,1\}^{*} \end{aligned} $$
The convergent encryption store will have the interface
$$ \begin{aligned} \mathrm{Put} &:& \{0,1\}^{*} &→ \{0,1\}^{l_B} × \{0,1\}^{l_A} \\ \mathrm{Get} &:& \{0,1\}^{l_B} × \{0,1\}^{l_A} &→ \{0,1\}^{*} \end{aligned} $$
The $\mathrm{Put}$ operation takes a piece of information and stores it encrypted in the associative array. It is implemented as
$$ \begin{aligned} \mathrm{Put}(X) &≕ (H, K)\\ K &= \mathrm{H_A}(X) \\ X' &= \mathrm{E}(K, X) \\ H &= \mathrm{H_B}(X') \\ &\phantom{=} \mathrm{Store}(H, X') \end{aligned} $$
The $\mathrm{Get}$ operation is simply the inverse operation
$$ \begin{aligned} \mathrm{Get}(H, K) &≕ X \\ X' &= \mathrm{Retrieve}(H) \\ X &= \mathrm{D}(K, X') \end{aligned} $$
The $\mathrm{Get}$ operation can optionally verify the integrity of the data by checking $K = \mathrm{H_A}(X)$ and/or $H = \mathrm{H_A}(X')$. The later has an advantage we will show below.
The implementation given above requires three passes through the data. The passes have data dependencies so they can not be parallelized. It is possible to develop a twopass system. The hash H of the output stream X' can be calculated while the output stream is being produced. This is similar to authenticated encryption but differs in the use of a hash instead of a message authentication code. But it is not possible to construct a onepass system that deduplicates and has at least the security of convergent encryption. In a onepass system, the first couple of bytes can not rely on all the remaining bytes. But this is necessary to have a key with the entropy of the entire file.
Weaknesses
Assuming the attacker has access to the associative array and knows the functions used, what can he do?
The associative array only stores ciphertext $X'$ and the hashes thereof, $H = H_B(X')$. It never stores the plaintext $X$ or even the hash of the plaintext K.
It is not possible for an attacker to manipulate the stored information. If the attacker where to change a pair $(H, X')$ into $(H, X'')$ it can be detected because the relation $H = H_B(X')$ will no longer hold. This check only requires knowledge of $H_B$, the verifier does not even have to have the key to the data $K$. It is therefore advisable to implement this check at any place where such manipulation might occur.
It is possible for an attacker to remove information. If the attacker somehow knows the H it can remove the pair and make the information unavailable. Or it can simply remove the entire associative array. Or the system containing the associative array could catch fire. Fending off these attacks is about having backup systems in place and preventing unauthorized access. The inherent resilience of hashbased contentaddressable storage will solve the rest.
Confirmation attack
A more fundamental problem with convergent encryption is the confirmation attack. Here an attacker can check if a given key $H$ is in the associative array. If the attacker can do this, he can also check if a given plaintext $X$ is in the associative array by checking the presence of
$$ \begin{aligned} H = \mathrm{H_B}(\mathrm{E}(\mathrm{H_A}(X), X))) \end{aligned} $$
If no preventative measures are taken, this could allow an attacker to confirm if the user is in possession of a certain file, for example a banned book or a pirated movie.
Offline bruteforce attack
In an offline attack the attacker can try key combinations at his leisure without the risk of discovery or interference. It could for example mount a bruteforce attack by trying out all possible keys until the right one is found.
Determining when the right key is found is hard in conventional encryption systems. Every possible key will result in some sort of plaintext, correct or not. Determining what the correct plaintext looks like involves a heuristic. In general it is impossible to find such a heuristic, and this is what provides the onetime pad with its perfect security.
In convergent encryption it is easy to recognize the correct key. The correct key K will satisfy the equation
$$ \begin{aligned} K = \mathrm{H_A}(\mathrm{D}(K, X')) \end{aligned} $$
While theoretically interesting, offline bruteforce attacks on conventional symmetric cyphers are already possible in practice. Plaintexts often contain easily recognizable structures such as file headers. This can then be used as an effective heuristic to check if the correct key is found. Such an attack will work on any cipher where keys are significantly shorter than the messages, which in practice means anything but the onetime pad.
Learn the remaining attack
Perhaps the most important of the possible attacks is the learn the remaining attack. Suppose an attacker knows most of the file, for example if the file is a PDF form where the user needs to fill in sensitive information, say a PIN code. The attacker can now create all possible versions of the file X and check if it matches a ciphertext X' by encrypting and comparing or using the identity
$$ \begin{aligned} X = \mathrm{D}(\mathrm{H_A}(X), X')) \end{aligned} $$
In fact, the attacker does not even need to have the value $X'$, it is sufficient to check if a given key $H$ is in the associative array using the equation given above.
This attack is possible when $X$ is known to be a member of a small set. The set of possible $X$'s can then be exhaustively tried at the small cost of three hashing and one encryption operation per try. Or in information theoretical terms: when the relative entropy of the plaintext relative to the attacker is low.
Domain separation
Suppose we make $\mathrm{H_A}$ a keyed hash function with a key $K_A$. This will thwart attacks relying on knowledge of $\mathrm{H_A}$, unless the attacker knows the key. This includes all major the attacks mentioned above. The downside is that it also restricts deduplication to data encrypted with the same $K_A$.
The different $K_A$ effectively create different domains. Within such a domain, deduplication happens. But anyone within the domain can also confirm that certain files are stored, or even use the learn the remaining attack to find which of a small set of possible files is stored. From outside the domain, all the encryption keys K depend on the domain key $K_A$. The domains are thus at least as secure as when they where encrypted using just the domain key.
Where the domain boundaries should be, or when to use different $K_A$'s will be an important design decision. Larger domains will result in more deduplication and a more efficient system, but it will also leak some knowledge about what is stored in the system to a larger group.
Cryptographic tuning
For the system to function properly it relies on:
$H_B$ needs preimage resistance to prevent an attacker from manipulating data undetectably. Collision resistance would be required if the implementation can not handle hash collisions.
$H_A$ needs to be a good randomness extractor. Its role is to extract sufficient entropy from the plaintext to create a key. A hash that maps the plaintext distribution uniformly to the set of encryption keys will satisfy this.
See also

https://www.usenix.org/legacy/event/lisa10/tech/full_papers/Anderson.pdf

https://research.microsoft.com/pubs/74262/Sigmetrics2000.pdf

https://tahoelafs.org/pipermail/tahoedev/2008March/000449.html

http://thesimplecomputer.info/behindthecurtainofencryptedcloudstoragepage1/

http://paranoia.dubfire.net/2011/04/howdropboxsacrificesuserprivacyfor.html

http://patft.uspto.gov/netacgi/nphParser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=5778395.PN.&OS=PN/5778395&RS=PN/5778395

http://www.patents.com/us7437555.html

http://patft.uspto.gov/netacgi/nphParser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,412,462.PN.&OS=PN/7,412,462&RS=PN/7,412,462