TL;DR

Challenge 6 in Cryptopals.

After 5 challenges of waxing on, waxing off were’re finally there. In the words of the sixth challenge itself:

It is officially on, now.

This challenge isn’t conceptually hard, but it involves actual error-prone coding. The other challenges in this set are there to bring you up to speed. This one is there to qualify you. If you can do this one, you’re probably just fine up to Set 6.

Wow, I have to say that I felt the pressure.

Before moving on, I can confirm what the page says in the note at the bottom:

[…] there aren’t any blatant errors in this text. In particular: the “wokka wokka!!!” edit distance really is 37.

This challenge is more… challenging, but also very well guided. Well, at least I find it so. The goal is to leave us the tactic field, while telling us the strategy and giving us some hints. Much appreciated!

There are a few preparatory things that we have to address, let’s tackle them immediately and then move on to the challenge itself. Strap on!

Hamming distance

How many bits do we have to flip to turn one bit string into another bit string (both being of the same length, of course)? That number is the Hamming distance.

There are several ways to calculate it. I have a telecommunications background, so my inclination is to calculate the bitwise difference between the two sequences and then count the number of 1 that it contains, i.e. its Hamming weight. Let’s start from it:

sub hamming_weight ($s) {
   my $distance = 0;
   for my $octet (split m{}mxs, $s) {
      my $n = ord $octet;
      while ($n) {
         ++$distance if $n & 0x01;
         $n >>= 1;
      }
   }
   return $distance;
}

I’m not sure this is the most straighforward way of calculating it, but it’s easy and works. For each input byte I first calculate its corresponding integer value, then check-and-shift until I’ve run out of bits to count. An improvement might be to pre-calculate this weight in a state variable and just sum up the value for each input octet, but this is left as an exercise for the reader.

With this in our toolset, calculating the Hamming distance is easy. The key insight is that the bitwise difference is the same as using the xor operator:

sub hamming_distance ($s1, $s2) { hamming_weight($s1 ^. $s2) }

This applies because the truth table for the XOR operator yields 1 if, and only if, the input bits are different. How handy!

Key size guessing

To guess the key size, the key intuition is that English text should be composed of letters (mostly), so the Hamming distance between two of them should not be too much.

How does ths help, though?

Suppose that the key $K$ is $k$ chars long and we take two consecutive chunks $C_1$ and $C_2$ from the ciphertext that are exactly $k$ chars long. As we’re dealing with repeated XOR by the same key, they have been produced like this:

\[C_1 = P_1 \oplus K \\ C_2 = P_2 \oplus K\]

where $P_1$ and $P_2$ are consecutive chunks of $k$ chars from the plaintext, and $\oplus$ is the XOR operator.

Let’s XOR the two chunks. We obtain:

\[\begin{align} C_1 \oplus C_2 &= (P_1 \oplus K) \oplus (P_2 \oplus K) \\ &= P_1 \oplus K \oplus P_2 \oplus K \\ &= P_1 \oplus P_2 \oplus K \oplus K \\ &= P_1 \oplus P_2 \oplus (K \oplus K) \\ &= P_1 \oplus P_2 \oplus 0 \\ &= P_1 \oplus P_2 \end{align}\]

The steps above leverage the fact that the XOR operator is associative (step 2, so it does not matter the order in which we apply it between operands) and commutative (step 3, so we can switch the middle $K$ with the middle $C_2$). The last XOR is put in evidence in the third step, but it will identically 0 because each bit will be XORed with itself; last, XORing with 0 means doing nothing, hence the result.

So, at the end of the day, XORing two consecutive chunks of ciphertext is the same as doing it on the corresponding plaintexts, and here is where our initial insight comes handy.

Of course longer keys will generally provide longer Hamming distances between consecutive chunks, so the challenge text itself warns us about normalizing this weight by the key length.

A little generalization of this process is provided in the following Perl function:

sub average_weight_for_size ($ciphertext, $klen) {
   die "not enough data\n" if length($ciphertext) < 2 * $klen;
   my $sum = 0;
   my $denominator = 0;
   my $previous = substr $ciphertext, 0, $klen, '';
   while (length $ciphertext > $klen) {
      my $current = substr $ciphertext, 0, $klen, '';
      $sum += hamming_distance($previous, $current);
      $denominator += $klen;
      $previous = $current;
   }
   return $sum / $denominator;
}

Instead of just considering the first two $k$-long chunks from the input, we consider it using also the second pair, the third one, and so on. I have to admit that I didn’t evaluate how much this is good with respect to just taking the first two, so by all means feel free to think it’s silly and take a different route.

With this in our toolbox, we can assign a “score” to each candidate key lengths, by putting those with a lower average weight first as more probably than others:

sub guess_key_sizes ($ciphertext, $min_klen = 2, $max_klen = 40) {
   return sort { $a->[1] <=> $b->[1] }
      map { [$_, average_weight_for_size($ciphertext, $_)] }
      $min_klen .. $max_klen;
}

For short keys, this might mean that a multiple of the real key length might win instead. At the end of the day, what is the difference between key ICE and key ICEICE for the algorithm we’re going to break?

Attack to the ciphertext!

Now that we have our best guess(es) for the key length, what should we do?

The key insight at this stage is that, for each consecutive chunk of that size:

the first character/byte is encrypted with the same one-byte key (i.e. the first character/byte of the key)
the second character/byte is encrypted with the same one-byte key (i.e. the second character/byte of the key)
and so on, up to the last character/byte.

This means that we actually have to solve $k$ parallel problems like the one we solved in challenge 3.

It is good that we didn’t code a sophisticated scoring function but just used the frequency of each letter alone. For example, we might have kept into consideration the frequency of appearance of two consecutive letters, or more, in English text - but this would have worked for English text. In this case, each slice encrypted with a single character/byte is not English text, just letters taken from it.

So, again, good that we focused on the frequency of each letter alone.

Let’s see how to attack the ciphertext assuming that we have the right key length:

sub attack_repeated_xor_bylen ($ciphertext, $klen) {
   my @cipherbytes = split m{}mxs, $ciphertext;
   my @plainbytes = ('') x @cipherbytes;
   for my $offset (0 .. $klen - 1) {
      my @same_key_bytes;
      for (my $i = $offset; $i < @cipherbytes; $i += $klen) {
         push @same_key_bytes, $cipherbytes[$i];
      }
      my $cipherslice = join '', @same_key_bytes;
      my ($slice, $score) = single_char_decrypt($cipherslice);
      @same_key_bytes = split m{}mxs, $slice;
      for (my $i = $offset; $i < @cipherbytes; $i += $klen) {
         $plainbytes[$i] = shift @same_key_bytes;
      }
   }
   return join '', @plainbytes;
}

As we said, we got one single key character at a time, i.e. one single offset from $0$ to $k - 1$.

We use this offset to pick the corresponding characters from the ciphertext, i.e. the first character is that at the offset, then the one after $k$ characters, then again the character after $k$ characters, and so on.

This was one of those rare occasions where I resume the C-style for loop in Perl. Call me nostalgic.

Putting together these characters provides us with the $cipherslice that we assume has been encrypted with a 1-character key. This is where our previously coded single_char_decrypt comes to the rescue, providing us the (plain)$slice back.

At this point, we have to fit this slice back in the right place into the @plainbytes, using the same offset and spacing rules as we did for the ciphertext.

To code our attack based on a specific key length, then, we have to rearrange our ciphertext by slices and attack each one individually. After having gone through all the offsets, @plainbytes contains the whole decrypted thing, so we can produce the return value by joining them.

Extra: evaluate key length effectiveness

The last bit we consider is the possibility that we got the key length wrong. To do this, we can put an outer loop iterating over key lengths and trying to decode with the next probable key length, and so on.

There can be many approaches to this regard, I decided to go with a simple metric, assessing whether each word belongs or not to a list of words I downloaded from the Internet:

sub smells_like_text ($data) {
   my %is_text_word = map { $_ => 1 } wordlist()->@*;
   my ($n_words, $n_text_words) = (0, 0);
   for my $word (split m{[\s,.;'"?!]+}mxs, lc $data) {
      ++$n_words;
      ++$n_text_words if $is_text_word{$word};
   }
   return $n_text_words / $n_words;
}

sub wordlist {
   my $filename = dirname(__FILE__) . '/wlist.txt';
   return [ split m{\n}mxs, slurp($filename) ];
}

If you need a list of words… you might find A Public Domain List of Words useful 🙄

Dwarf Fortress rocks!

Putting the pieces together

With our smells_like_text function, we can now code an outer loop to look for the right key, taking into consideration the posterior evaluation of the text quality:

sub attack_repeated_xor ($ciphertext, $threshold = 0.7) {
   for my $candidate (guess_key_sizes($ciphertext)) {
      my ($size, $score) = $candidate->@*;
      my $decrypted = attack_repeated_xor_bylen($ciphertext, $size);
      my $decryption_score = smells_like_text($decrypted);
      if ($decryption_score >= $threshold) {
         return ($decrypted, $decryption_score) if wantarray;
         return $decrypted;
      }
   }
   return;
}

The 70% threshold was tuned with trial-and-error, so it’s totally not scientific. I was a bit surprised that the solution was at $76.7%$, because I was expecting more; it depends on the adherence of the word list to the text, anyway, and it’s not surprising once we… see the solution.

Final remarks

It’s been a long ride, but worth it. I wholeheartedly suggest you to go through the same, maybe taking hints/inspiration from my code above if you feel stuck; don’t be shy to try things though, you might be surprised by your code.

Stay safe and secure!

ETOOBUSY 🚀 minimal blogging for the impatient

Cryptopals 6 - Break repeated-key XOR