Eleanor Rigby, picks up the rice
In the church where a wedding has been
Lives in a dream
Eleanor Rigby, by the Beatles (Lennon-McCartney, 1966).
The song Eleanor Rigby has a distinctive syllabic pattern and rhyming scheme, and it's fun to try and come up with additional verses. But what if we could find them automatically, instead?
Since the people behind Clitoris Vulgaris (@clitoscope), a Twitter bot designed to "generate new species of clitoris by projecting botanical illustrations onto a 3D model", gave a talk at work about how they built it, I've been interested in the idea of making a bot of my own. Creating something to look for tweets that could be a new verse seemed doable as a first attempt.
I'd initially looked at using NLTK to do the language processing; one
corpus available for it, the Carnegie Mellon Pronouncing Dictionary, seemed
perfect for the project. However, I wasn't sure how to go about running an
NLTK-based app on Cloud Foundry, as it needs to download the corpus once
install-ed. Fortunately, someone else had already done the hard work for me
and wrapped that single corpus in a library named
this made the bot trivial to deploy. This allows for both determining the
syllables in the words used and whether two given words rhyme.
Using the Tweepy library made interacting with Twitter simple; I
subclassed the existing
StreamListener with my own
extractor (for extracting the text to process from a tweet) and a
filterer (for filtering out tweets that should be retweeted). This keeps it
very general for reuse later. Tweets are not exactly written in formal English
and contain things like usernames, hashtags and URLs that can't necessarily be
pronounced, so the current extractor takes the longest string of "clean
characters" (the letters A-Z plus some basic punctuation characters, not
including characters like
# with Twitter-specific meanings) from each
tweet to pass to the filtering.
The actual classification happens in
PhraseMatcher.__call__, and the
overall process is pretty straightforward:
- Given a phrase string, e.g.
'here are some example words';
- Create a list of the individual words and their lengths in syllables, e.g.
[('here', 1), ('are', 1), ('some', 1), ('example', 3), ('words', 1)]
- If any word couldn't be processed or the total syllable count doesn't match
the pattern we're looking for, return
- Try to fit the words into the required syllable pattern, assuming that we want lines to break on words;
- If the phrase doesn't fit into the syllable pattern, return
- Return whether or not the phrase matches the required rhyming scheme.
The scheme is defined with two sequences: one of the number of syllables in each line; and one of the required rhyming scheme. For example, the scheme to match a verse of Eleanor Rigby is:
ELEANOR_RIGBY = PhraseMatcher( syllable_pattern=(5, 4, 9, 4), rhyming_scheme=(None, None, 0, 0), )
- there should be four lines, with five, four, nine and four syllables respectively (for a total of 22); and
- the third and fourth lines must rhyme (but we don't care whether or not the first two lines rhyme with anything).
This representation makes the bot configurable for different phrases, including the other example I give in the README (note that here the rhyme scheme is marked with letters - it gets converted to a dictionary mapping group ID to line indices, so anything hashable is fine):
# https://www.flickr.com/places/info/12591829 napoli = [13.8509, 40.5360, 14.6697, 41.0201] # "when the moon hits your eye like a big pizza pie" that_is_amore = PhraseMatcher((3, 3, 3, 3), (None, 'a', None, 'a')) start_listening(napoli, that_is_amore)
The very first retweet is probably still the best so far:
It's kinda shitty when people say they care then act like they don't give two shits about you
Charley Emmett (@Ch4rl3y_B0y) April 6, 2017
It's kinda shitty, when people say
They care then act like they don't give two
Shits about you
Aside from an overly-strict definition of rhyming (some rhymes from the actual song, like "grave" and "saved", aren't considered valid), the classification seems to be working pretty effectively. The extraction is not as good; it doesn't necessarily find the real content of the status correctly. For example, one of the weakest matches so far is:
Away to @NewportSarries tomorrow
End of the season is in sight
And it's going to be a glorious day 🌞just what you want as a front row
Malpas RFC (@MalpasRugbyClub) April 7, 2017
This is being classified as:
Tomorrow end of, the season is
In sight and it's going to be a
Most of the content of the tweet is being discarded, so it's not obvious from reading it where the verse is.
A few improvements I've thought of:
To make it more obvious what the matched "verse" is, I could switch from retweeting to quoting, so the bot can include the match in its response. However, that might make it seem more invasive.
Instead of the separate extraction and classification steps, the extraction could be based on the longest series of pronounceable words in the tweet. This would mean using the default no-op
_all_textextractor in the
RetweetListenerand moving the extraction into the
It might be nice to include pronounceable user names and hashtags in the match, although splitting those with multiple words in may prove tricky.