Info
This post was originally intended for a Patreon audience.
Procedurally Generated Haikus
Attempt #1
My first approach to making procedurally generated haikus was to find a database that had a list of words to syllables, then work from there.
- Count all of the vowel phonemes for each word in CMUDict and put the results in a database as the number of syllables in a word
- Use the
2in12id.txt
file from the 12Dicts project to determine the parts of speech for each word - Write an algorithm to generate all possible sequences of syllable counts that would add up to a certain number (like 5 or 7), then assign each item in the sequence a hardcoded part of speech based on the number of items in the sequence. (For example, a potential sequence is “3A 4N”, which means a three-syllable adjective and then a four-syllable noun).
- Based on this list, select a random word from the database that matches the part of speech and syllable count.
This approach worked okay-ish, as you can see below:
sheepish languages
strong euphemisms crossed truth
terse hence unworthy
slickly shards feared elks
maritime advertisements
quickly maids boats wilds
loth spry objectives
cockamamie eyedropper
perilously voles
palatial meatloaf
corny affiliations
broke newspaperman
The only problem here is that I had to hand select these haikus; otherwise, most sounded like complete gibberish rather than proper English. Obviously, this approach wasn’t going to be good enough.
Part of the problem here is that the part of speech wasn’t determined very well. Since words have so many different meanings and uses and I only had one entry per word, often times a word that looked like a noun in a haiku would be used as a verb.
Attempt #2
My second approach was to find a database of words to part of speech, since that was more important than syllables to me now, and then work from there.
- Select all of the words and their part of speech from the WordNet database. (
select lemma, pos from words left join senses on words.wordid = senses.wordid left join synsets on senses.synsetid = synsets.synsetid where pos = 'n'
) - Remove all words from the database that have numbers, a space, hyphen, dot, or forward/backward slash. The reason for this is because the tool in the next step won’t be able to parse this input. (delete from words where word regexp
[ 0-9\-\\/.]+
) - Run a list of all the words from the database through CMUDict’s g2p-seq2seq program. What’s kind of cool about this step is that a neural net is doing to figure out how to pronounce all of the words. (
g2p-seq2seq --decode words.txt --model g2p-seq2seq-cmudict > output.txt
) - Count all of the vowel phonemes for each word from the output of the last tool and put the results in a database as the number of syllables in a word.
- Write a web crawler to download a bunch of haikus from the only haiku database I could find called tempslibres and use it to record a word frequency count in the database.
This approach is… better-ish?
opening today
lost glossy cackle chainsaw
reverse circular
changed mausoleum
endoparasitic place
familiar tandem
wear headline iris
descending litter nail loon
fewer tornado
tiger forage queue
nice grand scratch familiar laugh
incomplete homeless
The words seem more modern and the haikus make a little more sense. This time I didn’t even curate the selection either! Now that the sufficient data is compiled, I can spend more work on making the haiku structure generator better.