Cryptopals 3 - Single-byte XOR cipher

TL;DR

Challenge 3 in Cryptopals.

After getting our feet wet in the first two challenges, it’s time to do the first gear switching and decrypt some stuff.

As expected, we start soft with something that is just slightly more than a Caesar cipher. In that, once you know how it works, it’s pretty simple to crack: just try every possible rotation out of the 25 possible ones and stop when the text you get makes any sense.

In this case we’re considering a one-character (which is the same as one byte in this series, sorry Unicode!) key that we use for XOR-ing with each character in the ciphertext, but it’s otherwise the same: we apply exactly the same mapping to each character. So here, too, we might just try all the 256 alternatives and stop when it makes sense.

But we live in Artificial Intelligence Machine Learning times, so why not let the computer do the job? One of the most simple ways to do this is to compute a score out of each candidate plaintext and get the one with the best score:

#!/usr/bin/env perl
use v5.24;
use warnings;
use experimental 'signatures';
no warnings 'experimental::signatures';
use List::Util 'sum';
use CryptoPals qw< decode_base16 >;

my $input = '1b37373331363f78151b7f2b783431333d78397828372d363c7837'
          . '3e783a393b3736';

my ($decrypted, $score) = single_char_decrypt(decode_base16($input));
say $decrypted;

sub single_char_decrypt ($ciphertext, $scorer = undef) {

   # Adjust the scoring function depending on inputs
   $scorer //= \&score_by_frequency;
   if (ref $scorer eq 'HASH') {
      my $weight_for = $scorer;
      $scorer = sub { score_by_frequency($_[0], $weight_for) };
   }

   # Iterate through all possible one-char alternatives. We are assuming
   # that one char is one byte here, which basically means English.  The
   # key for decryption is a sequence of that one character that is the
   # same length as the ciphertext, hence $clen here
   my $clen = length($ciphertext);
   my ($best, $best_score);  # keep best scoring solution along the way
   for my $ord (0 .. 255) {
      my $repeated_key = chr($ord) x $clen;
      my $cleartext    = $ciphertext ^ $repeated_key;
      my $score        = $scorer->($cleartext);
      ($best, $best_score) = ($cleartext, $score)
        if !$best_score || $best_score < $score;
   } ## end for my $ord (0 .. 255)

   # depending on context we also give the score back
   return ($best, $best_score // 0) if wantarray;
   return $best;

} ## end sub single_char_decrypt

We’ve factored decode_base16 into an external module for ease of reuse.

The single_char_decrypt implements the scaffolding for our analysis: it gets the $ciphertext and tries out all the possible one-char (one-byte) alternatives, collecting the best along the way. Scoring is left to an input $scorer function, which gets some default but leaves space for trying out other alternatives.

One of the classical approaches, and most effective too (high gain with low expense) is to score a candidate plaintext according to the expected frequency of letters in a language. That is, if the most frequent letter in English is E, we would expect that the average text should have many of them compared to other letters, right?

In our case, then, we’ll calculate the score by just adding the score for each letter, assigning the score based on the occurences of the letter from some reference corpus. In our case we’re considering what is available in page English Letter Frequency (based on a sample of 40,000 words) - there are other sources with slightly different numbers, but not too much.

# Simple scoring function, each letter has an associated weight. By default
# we get the English letters frequency, but the hash with the weights for
# each letter can be passed in. Letters will be considered in uppercase.
sub score_by_frequency ($data, $weight_for = undef) {

   # Ignore data that contains non-printable non-space chars
   return 0 unless $data =~ m{\A [[:print:][:space:]]+ \z}mxs;

   state $english_weight_for = {

      # https://pi.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
      E   => 21912, # ETAOIN SRHDLU joke...
      T   => 16587,
      A   => 14810,
      O   => 14003,
      I   => 13318,
      N   => 12666,
      S   => 11450,
      R   => 10977,
      H   => 10795,
      D   => 7874,
      L   => 7253,
      U   => 5246,
      C   => 4943,
      M   => 4761,
      F   => 4200,
      Y   => 3853,
      W   => 3819,
      G   => 3693,
      P   => 3316,
      B   => 2715,
      V   => 2019,
      K   => 1257,
      X   => 315,
      Q   => 205,
      J   => 188,
      Z   => 128,

      # set space the same as the most frequent letter, many words are
      # followed by a space
      ' ' => 21912,

   };
   $weight_for //= $english_weight_for;

   return sum map { $weight_for->{uc $_} // 0 } split m{}mxs, $data;
} ## end sub score_as_english_text ($data)

As indicated, we’re expecting fully printable texts here, so we toss away whatever has non-printable characters. Interestingly, POSIX says that class :print: does not contain spaces, even though they seem pretty printable to me. Whatever.

This calculation operation reminded me of a scalar product, although I’m not sure I can express why. Not that it makes a real difference though.

One last thing, texts in English do not only contain characters, but spaces and punctuation too. I decided to ignore punctuation, but spaces appear frequently so I decided to give them the same frequency of the most frequent letter (E in the English case). The rationale is that many words are followed by a space, so why not.

The example above works… but I’ll not print the result and spoiler it!

Stay safe and secure!

ETOOBUSY 🚀 minimal blogging for the impatient