ETOOBUSY 🚀 minimal blogging for the impatient
Grouping in NVdB
TL;DR
This might be controversial: grouping words after expansions.
Let’s see where we are now in this series:
- I like Passphrases and I’d like them to be more widespread
- To this end, I think that using words that people find relatable is crucial. No, I don’t have any study to point to, just the fact that the contrary didn’t work well because most people tried to fit something memorable in their passwords, and I can only guess we remember better things we know.
- For Italian, I looked at Nuovo vocabolario di base della lingua italiana to take inspiration for a good starting list
- Later I went down the Expanding words in NVdB rabbit hole to expand the list of words (and the possible entropy) without sacrificing the breadth of how much these words are known around.
At this point I managed to move from $7176$ up to a whopping $44160$ distinct words. This is $15.4$ bits of entropy per word, yay!
Well, not so fast.
First of all, how good would be a password like
gatta.gatto.gatti.gatte
? I’m not particularly fond of it:
- it’s lame and something that a human might definitely land on of left on their own “randomness”
- it’s even slightly difficult to remember. What order did I put them? Wait, no, maybe it was male first? Or plural?
Additionally, many of those words might be very short (one to three
letters) or long (seven or more letters). I wouldn’t like a password
like a.e.i.o
(which would be possible), and most people would refuse
to use nasceranno.svegliandosi.giardinaggio.identificazione
(52
characters). Heck, I would refuse it too!
Hence, my proposal is to group words that are similar, remove a group after it’s been used once, and choose randomly inside a group according to some algorithm that limits the number of short and long words that can be admitted in a password.
Here’s where the sets come handy. The data structure $set_for
keeps
one key for each word, pointing to another hash that is used as a set
(words are keys in the set, all pointing to a conventional 1
that is
unused). Here’s how it is initialized:
my $set_for = {map { lc($_) => {lc($_) => 1} } list_of_words_...
Each expansion adds new words, making sure to keep the relation across
each single expansion. So, for example, gatto
will end up in the set
for gatta
and vice-versa. The details are in the following function,
which is used by both expansion functions we saw in previous posts:
sub add_related ($set_for, $first, @other_words) {
$first = lc $first;
my $root = $set_for->{$first} // die "WTF?!?";
DEBUG " add_related: starting from <$first><$root>";
for my $w (@other_words) {
my $word = lc $w;
DEBUG " add_related: analysing <$word>";
if (my $wroot = $set_for->{$word}) {
DEBUG " add_related: wroot<$wroot>";
next if refaddr($root) eq refaddr($wroot); # already merged
($root, $wroot) = ($wroot, $root) # swap if...
if scalar(keys $root->%*) < scalar(keys $wroot->%*);
for my $item (keys $wroot->%*) {
$root->{$item} = 1;
$set_for->{$item} = $root;
}
} ## end if (my $wroot = $set_for...)
else {
DEBUG " add_related: NEW SET";
$set_for->{$word} = $root;
$root->{$word} = 1;
}
} ## end for my $w (@other_words)
} ## end sub add_related
Instead of creating separate sets, we merge them into one and point both words to it.
After the two expansions, some words might appear in multiple otherwise unrelated sets, so we can remain on the safe side and merge them all. This will end up in a partition of the whole set of words, and it’s what the following function does:
sub sets_to_groups ($sets) {
my @groups;
my %seen;
for my $set (values $set_for->%*) {
next if $seen{refaddr($set)}++;
my @words = sort { $a cmp $b }
grep { !m{[^a-z]}mxs } keys $set->%*;
next unless @words;
push @groups, \@words;
} ## end for my $set (values $set_for...)
return \@groups;
}
All words are actually already in the same sets if related, so we just
have to detect the separated ones. This is where refaddr
comes into
play, as it allows us to figure out if two “sets” are (exactly) the
same.
After passing through sets_to_groups
, we end up with an array of
arrays, each containing a list of one of the groups. And… there are
$6090$ such groups.
We’ll call them… raw groups. For reasons that will be discussed at due time.
Stay safe!