# ETOOBUSY đźš€ minimal blogging for the impatient

# Cryptopals 6 - Break repeated-key XOR

**TL;DR**

After 5 challenges of *waxing on, waxing off* wereâ€™re finally there. In
the words of the sixth challenge itself:

It is officially on, now.This challenge isnâ€™t conceptually hard, but it involves actual error-prone coding. The other challenges in this set are there to bring you up to speed. This one is there to

qualifyyou. If you can do this one, youâ€™re probably just fine up to Set 6.

Wow, I have to say that I felt the pressure.

Before moving on, I can *confirm* what the page says in the note at the
bottom:

[â€¦] there arenâ€™t any blatant errors in this text. In particular: the â€śwokka wokka!!!â€ť edit distance really is 37.

This challenge is moreâ€¦ *challenging*, but also very well guided.
Well, at least I find it so. The goal is to leave us the tactic field,
while telling us the strategy and giving us some hints. Much
appreciated!

There are a few preparatory things that we have to address, letâ€™s tackle them immediately and then move on to the challenge itself. Strap on!

# Hamming distance

How many bits do we have to flip to turn one bit string into another bit string (both being of the same length, of course)? That number is the Hamming distance.

There are several ways to calculate it. I have a telecommunications
background, so my inclination is to calculate the *bitwise difference*
between the two sequences and then count the number of `1`

that it
contains, i.e. its *Hamming weight*. Letâ€™s start from it:

```
sub hamming_weight ($s) {
my $distance = 0;
for my $octet (split m{}mxs, $s) {
my $n = ord $octet;
while ($n) {
++$distance if $n & 0x01;
$n >>= 1;
}
}
return $distance;
}
```

Iâ€™m not sure this is the most straighforward way of calculating it, but itâ€™s easy and works. For each input byte I first calculate its corresponding integer value, then check-and-shift until Iâ€™ve run out of bits to count. An improvement might be to pre-calculate this weight in a state variable and just sum up the value for each input octet, but this is left as an exercise for the reader.

With this in our toolset, calculating the Hamming distance is easy.
The key insight is that the bitwise difference is *the same* as using
the xor operator:

```
sub hamming_distance ($s1, $s2) { hamming_weight($s1 ^. $s2) }
```

This applies because the *truth table* for the XOR operator yields `1`

if, and only if, the input bits are *different*. How handy!

# Key size guessing

To guess the key size, the key intuition is that English text should be
composed of letters (mostly), so the Hamming distance between two of
them should *not* be too much.

How does ths help, though?

Suppose that the key $K$ is $k$ chars long and we take two consecutive chunks $C_1$ and $C_2$ from the ciphertext that are exactly $k$ chars long. As weâ€™re dealing with repeated XOR by the same key, they have been produced like this:

\[C_1 = P_1 \oplus K \\ C_2 = P_2 \oplus K\]where $P_1$ and $P_2$ are consecutive chunks of $k$ chars from the
*plaintext*, and $\oplus$ is the XOR operator.

Letâ€™s XOR the two chunks. We obtain:

\[\begin{align} C_1 \oplus C_2 &= (P_1 \oplus K) \oplus (P_2 \oplus K) \\ &= P_1 \oplus K \oplus P_2 \oplus K \\ &= P_1 \oplus P_2 \oplus K \oplus K \\ &= P_1 \oplus P_2 \oplus (K \oplus K) \\ &= P_1 \oplus P_2 \oplus 0 \\ &= P_1 \oplus P_2 \end{align}\]The steps above leverage the fact that the XOR operator is associative (step 2, so it does not matter the order in which we apply it between operands) and commutative (step 3, so we can switch the middle $K$ with the middle $C_2$). The last XOR is put in evidence in the third step, but it will identically 0 because each bit will be XORed with itself; last, XORing with 0 means doing nothing, hence the result.

So, at the end of the day, XORing two consecutive chunks of ciphertext
is *the same* as doing it on the corresponding plaintexts, and *here* is
where our initial insight comes handy.

Of course longer keys will generally provide longer Hamming distances
between consecutive chunks, so the challenge text itself warns us about
*normalizing* this weight by the key length.

A little generalization of this process is provided in the following Perl function:

```
sub average_weight_for_size ($ciphertext, $klen) {
die "not enough data\n" if length($ciphertext) < 2 * $klen;
my $sum = 0;
my $denominator = 0;
my $previous = substr $ciphertext, 0, $klen, '';
while (length $ciphertext > $klen) {
my $current = substr $ciphertext, 0, $klen, '';
$sum += hamming_distance($previous, $current);
$denominator += $klen;
$previous = $current;
}
return $sum / $denominator;
}
```

Instead of just considering the first two $k$-long chunks from the input, we consider it using also the second pair, the third one, and so on. I have to admit that I didnâ€™t evaluate how much this is good with respect to just taking the first two, so by all means feel free to think itâ€™s silly and take a different route.

With this in our toolbox, we can assign a â€śscoreâ€ť to each candidate key lengths, by putting those with a lower average weight first as more probably than others:

```
sub guess_key_sizes ($ciphertext, $min_klen = 2, $max_klen = 40) {
return sort { $a->[1] <=> $b->[1] }
map { [$_, average_weight_for_size($ciphertext, $_)] }
$min_klen .. $max_klen;
}
```

For short keys, this might mean that a *multiple* of the real key length
might win instead. At the end of the day, what is the difference between
key `ICE`

and key `ICEICE`

for the algorithm weâ€™re going to break?

# Attack to the ciphertext!

Now that we have our best guess(es) for the key length, what should we do?

The key insight at this stage is that, for each consecutive chunk of that size:

- the first character/byte is encrypted with the same one-byte key (i.e. the first character/byte of the key)
- the second character/byte is encrypted with the same one-byte key (i.e. the second character/byte of the key)
- and so on, up to the last character/byte.

This means that we actually have to solve $k$ parallel problems like the one we solved in challenge 3.

It is good that we didnâ€™t code a sophisticated scoring function but just
used the frequency of each letter alone. For example, we might have kept
into consideration the frequency of appearance of two consecutive
letters, or more, in English text - but this would have worked for
*English text*. In this case, each *slice* encrypted with a single
character/byte is *not* English text, just letters taken from it.

So, again, good that we focused on the frequency of each letter alone.

Letâ€™s see how to attack the ciphertext assuming that we have the right key length:

```
sub attack_repeated_xor_bylen ($ciphertext, $klen) {
my @cipherbytes = split m{}mxs, $ciphertext;
my @plainbytes = ('') x @cipherbytes;
for my $offset (0 .. $klen - 1) {
my @same_key_bytes;
for (my $i = $offset; $i < @cipherbytes; $i += $klen) {
push @same_key_bytes, $cipherbytes[$i];
}
my $cipherslice = join '', @same_key_bytes;
my ($slice, $score) = single_char_decrypt($cipherslice);
@same_key_bytes = split m{}mxs, $slice;
for (my $i = $offset; $i < @cipherbytes; $i += $klen) {
$plainbytes[$i] = shift @same_key_bytes;
}
}
return join '', @plainbytes;
}
```

As we said, we got one single key character at a time, i.e. one single
*offset* from $0$ to $k - 1$.

We use this offset to pick the corresponding characters from the ciphertext, i.e. the first character is that at the offset, then the one after $k$ characters, then again the character after $k$ characters, and so on.

This was one of those rare occasions where I resume the C-style

`for`

loop in Perl. Call me nostalgic.

Putting together these characters provides us with the `$cipherslice`

that we assume has been encrypted with a 1-character key. This is where
our previously coded `single_char_decrypt`

comes to the rescue,
providing us the (plain)`$slice`

back.

At this point, we have to fit this slice back in the right place into
the `@plainbytes`

, using the same offset and spacing rules as we did for
the ciphertext.

To code our attack based on a specific key length, then, we have to
rearrange our ciphertext by *slices* and attack each one individually.
After having gone through all the offsets, `@plainbytes`

contains the
whole decrypted thing, so we can produce the `return`

value by joining
them.

# Extra: evaluate key length effectiveness

The last bit we consider is the possibility that we got the key length wrong. To do this, we can put an outer loop iterating over key lengths and trying to decode with the next probable key length, and so on.

There can be many approaches to this regard, I decided to go with a simple metric, assessing whether each word belongs or not to a list of words I downloaded from the Internet:

```
sub smells_like_text ($data) {
my %is_text_word = map { $_ => 1 } wordlist()->@*;
my ($n_words, $n_text_words) = (0, 0);
for my $word (split m{[\s,.;'"?!]+}mxs, lc $data) {
++$n_words;
++$n_text_words if $is_text_word{$word};
}
return $n_text_words / $n_words;
}
sub wordlist {
my $filename = dirname(__FILE__) . '/wlist.txt';
return [ split m{\n}mxs, slurp($filename) ];
}
```

If you need a list of wordsâ€¦ you might find A Public Domain List of Words useful đź™„

Dwarf Fortress rocks!

# Putting the pieces together

With our `smells_like_text`

function, we can now code an outer loop to
look for the *right* key, taking into consideration the posterior
evaluation of the text quality:

```
sub attack_repeated_xor ($ciphertext, $threshold = 0.7) {
for my $candidate (guess_key_sizes($ciphertext)) {
my ($size, $score) = $candidate->@*;
my $decrypted = attack_repeated_xor_bylen($ciphertext, $size);
my $decryption_score = smells_like_text($decrypted);
if ($decryption_score >= $threshold) {
return ($decrypted, $decryption_score) if wantarray;
return $decrypted;
}
}
return;
}
```

The 70% threshold was tuned with trial-and-error, so itâ€™s *totally* not
scientific. I was a bit surprised that the solution was at $76.7%$,
because I was expecting more; it depends on the adherence of the word
list to the text, anyway, and itâ€™s not surprising once weâ€¦ *see the
solution*.

# Final remarks

Itâ€™s been a long ride, but worth it. I wholeheartedly suggest you to go through the same, maybe taking hints/inspiration from my code above if you feel stuck; donâ€™t be shy to try things though, you might be surprised by your code.

Stay safe *and secure*!

*Comments? Octodon, Twitter, GitHub, Reddit, or drop me a line!*