ETOOBUSY 🚀 minimal blogging for the impatient
A Public Domain List of Words
TL;DR
In A Public Domain List of Adjectives we found about how you can get some words out of Dwarf Fortress. Now let’s turn all of that in a JSON file.
What word types are available?
The file in Dwarf Fortress has a regular shape, like this:
language_words
[OBJECT:LANGUAGE]
[WORD:ABBEY]
[NOUN:abbey:abbeys]
[FRONT_COMPOUND_NOUN_SING]
[REAR_COMPOUND_NOUN_SING]
[THE_NOUN_SING]
[REAR_COMPOUND_NOUN_PLUR]
...
[ADJ:ace]
[ADJ_DIST:1]
[FRONT_COMPOUND_ADJ]
[THE_COMPOUND_ADJ]
...
[VERB:act:acts:acted:acted:acting]
[STANDARD_VERB]
...
[PREFIX:after]
[FRONT_COMPOUND_PREFIX]
[THE_COMPOUND_PREFIX]
So, it seems that:
- interesting stuff is inside bracket pairs;
- word types appear first, followed by a colon (e.g.
NOUN,VERB, …) - there is other stuff matching this pattern (e.g.
WORDandPREFIX).
Let’s do some Perl magic to extract all candidates:
$ perl <language_words.txt \
-E '$/=undef; $_=<>; $x{$_}=1 for m{\G.*?\[(\w+):.*?\]}gmxs; say for keys %x'
VERB
PREFIX
OBJECT
ADJ_DIST
WORD
NOUN
ADJ
A little breakdown:
$/=undefturns on slurp mode, i.e. all the input file (language_word.txt, provided as standard input) is read in one single string;$_=<>reads the whole file into Perl’s topic variable, i.e. the variable where most operators apply in lack of an explicit variable;- the match
m{\G...}gmxsis a global match (modifierg) to catch all stuff that is enclosed in brackets and has at least one colon character inside; - the match being global, it returns all the matching captures in list
context, so we iterate over it with
forand set a flag in hash%x(with$x{$_}=1 for m{...}gmxs) - last, we print out all the collected keys.
Looking at the output list, I’d bet on VERB, NOUN, and ADJ as our
targets.
Turn that into JSON
It’s time to take a closer look at the different word types:
ADJalways include one item, so that’s it. We will just put that as a string, inside an array;NOUNalways includes two items, possibly empty, the first one being the singular form and the other one the plural form;VERBis the more complicated, with five parts: present, present (third person), past, past participle, ing form.
Time to code:
There’s also a local version, as usual.
Slurping the input data has no mysteries for us (line 8).
When matching the data with modifier g, Perl happily goes through all
matches. In this case, we have two capturing groups, one for the type and
one for its payload, which will end up filling array @pairs in this
order (i.e. type, payload, type, payload, …).
The main loop iterates until there are items in @pairs, extracting two
items at each round and putting them into $type and $payload
respectively (line 13). The rest of the loop is pretty straighforward:
depending on the $type, a different part of the collecting hash is
populated.
Last, we leverage JSON::PP to print out the collected data as JSON.
So…
… we now have our public domain word list, in JSON format. Do you want a version available as of release 0.44.12 of Dwarf Fortress? Be my guest and find it here: words as json. In case you’re wondering yes, it has been pretty-printed using jq 😅.