Here’s a challenge for the programmers out there. Once upon a time, when I was a younger man with “hacking” time on my hands, I would have already been all over this. Instead, now I’ll put it out into the universe in the hopes that someone else is in a position to take up the challenge.
A recent post on Reddit claimed to be a recitation of the sonnets over some video from one of the Halo computer games. The problem, it was just classic computer “text to speech (TTS)”. Which, you’ve probably experienced, is painful to listen to – no pacing, no inflection, no sense at all of what it’s reading. I suggested to the original poster that it was like listening to somebody read the dictionary.
But then I got an idea. Text to speech technology has actually gotten much better than it was in the old days. Machine learning has given the engine some degree of understanding of how words go together, and the whole point of punctuation. In fact, Amazon offers a cloud service known as Polly that is specifically all about “lifelike speech”.
So now I’m wondering, what would it take to tweak a TTS engine to make a reasonable recitation of the sonnets? Something that feels the iambic pentameter, and sounds like it’s actually reading poetry as it was intended to be read. Of course, it’s not going to be Alan Rickman’s quality, I get that. I’m just wondering if it can be better.
There are a couple of ways to go about this. One obvious one is to pick a sonnet, and then manually create a transcription of the text into the special codes that tell the TTS engine exactly what to do — emphasize this syllable over that one, pause longer here, make an “oo” sound here instead of an “oh”.
That by itself is maybe a couple of hours of work. A proof of concept, as they say in our biz. Tweak it, play it, go back and tweak it some more.
But then, can we learn from that? Can we bring machine learning into it? You’d probably need to do this for more than one sonnet, but I think you could train something fairly easily to look at those few, let’s say half a dozen, and extract the underlying patterns. Then turn it loose on the next couple and see what you get.
Like all machine learning, it would be an incremental exercise, constantly going back and throwing more training data at it until you start to be happy with the results. But how cool would it be if you had to train it on less (substantially less) than the entire 154, and before you were done it was reciting the remaining ones on its own?
Then, for the real fun, switch gears and throw some soliloquies at it and see how it does!
Who’s up for the challenge? The more I described it the more I wish I could tackle it myself. Maybe I’ll end up trying a manual transcription job anyway, just to kick it off and see where I get?