My two rap-related projects, Raplyzer, which analyzes the rhyme density of different rappers, and DeepBeat, which is a rap lyrics generating AI, were widely covered in the media last year. But with the fame come the haters. The purpose of this post is to prove that my haters are wrong! (For real: I honestly don’t consider anyone a hater, nor will there be any proofs in this post. Rather, I’ll present some quantitative evidence for the validity of the algorithms but also discuss their limitations.)
For those of you not yet familiar with Raplyzer and DeepBeat, I would recommend reading our paper or watching the short video below.
Next I’ll analyze some of the arguments presented against the algorithms by three different “haters”.
Hater #1: Twista
After the DeepBeat paper had gone viral, Bloomberg TV contacted me saying they want to make a segment on the algorithm featuring a rap battle between DeepBeat and Twista! I was obviously super excited, but unfortunately, DeepBeat wasn’t yet hooked up with the speech synthesizer so that it could have spoken for itself so I just had to send them a few lyric files.
In overall, Twista was not particularly impressed by DeepBeat’s skills (the full clip can be found here).
Fair enough, I understand his point of view: although there’s quantitative evidence that DeepBeat’s line selections correlate with human preferences (see the end of this article), there’s admittedly a lot of room for improvement. Furthermore, I would hardly be amused either if some rapper claimed to write 21% better machine learning papers than me 😉
Hater #2: Skippy Mac
This dude, MC Skippy Mac, posted the following video on Youtube, promising to present some “mathematical errors within the [Raplyzer] algorithm”.
First of all, I was obviously quite flattered when I discovered that somebody had made an 8-minute video discussing my work (in academia, you don’t get this detailed feedback too often – not even from your peer reviewers!). Let’s take a look at what he had to say (in the interest of space, I have to omit some of his arguments but I’d be happy to discuss them, e.g., in the comment section if somebody is interested).
“It doesn’t take advanced rap skills to make long multis [but] to make long multis while still making a crisp point, employing a clever word play, conveying a deep message throughout the song – shit like that.”
Well, this is almost verbatim from the last section of my original blog post but it’s a very important point and thus worth repeating. I’ve come across some aspiring rappers who try to force almost every word of their lyrics into a multisyllabic rhyme, and it often doesn’t sound good. However, if you can do so while conveying a coherent message, it shows advanced rap skills.
In my opinion, Shai Linne, who was ranked 4th, is a great example of the latter; it’s quite remarkable how he’s able to discuss deep theological topics and simultaneously deliver mad multisyllabic rhymes.
I believe that this sort of criticism mostly originates from all the news headlines saying something like “the best rapper alive, as decided by computers” and other misleading exaggerations. So one more time: Raplyzer only measures the technical quality of the lyrics – not the content.
“The algorithm would need to calculate end rhymes and internal rhymes separately cause they’re not equal in value.”
Just as a reminder of what’s the difference between the two, consider the following lines from Eminem’s Rap God:
Made a living and a killing off it
Ever since Bill Clinton was still in office
Here living – killing forms an internal rhyme whereas killing off it – still in office is a (multisyllabic) end rhyme.
I agree that it’s mostly the end rhymes that catch the listener’s attention, whereas internal rhymes, while being an important component of a good flow, are more subtle. However, if you want to come up with just a single score, you’re faced with the issue of having to decide how much more important exactly are the end rhymes which is something I wanted to avoid.
And then there’s the practical issue of how to automatically separate the two which Skippy Mac is also referring to in his video. I initially thought of simply using the line splits found in the lyric files but quickly noticed that they are very inconsistent; one user providing the lyrics might put 4 beats per line whereas another puts 8 beats per line. However, I think it should be possible to detect the line boundaries automatically by looking at the content of the lyrics and the locations of detected rhymes, but this, although an interesting exercise, would have been beyond the scope of the project.
By the way, speaking of different rhyme types, what I consider a bigger limitation of Raplyzer is that it currently doesn’t properly recognize imperfect rhymes. So you might have a long multisyllabic assonance rhyme which sounds legit but if there’s even a single non-matching vowel sound in the middle of the rhyme, it breaks the pattern and roughly halves the score of that rhyme – even if the non-matching vowel sound is unstressed. There are methods for detecting imperfect rhymes which could be used to improve Raplyzer.
“It sounds like you take the consonants out and match the vowel sounds to measure the multis – that’s gonna skew your data also. […] For instance, ‘fork’ and ‘scorch’ rhyme but ‘fork’ and ‘scotch’ do not”
On the contrary, Raplyzer correctly recognizes that scorch rhymes with fork but scotch does not. This is in fact a great example as it shows the importance of getting a phonetic transcription before running the analysis. The phonetic representations of the three words (with vowel sounds in bold) are the following according to eSpeak: skɔːɹtʃ, fɔːɹk, skɑːtʃ. You can see that the first two vowel sounds match while the third one does not.
“All of this means something that we knew before we read your article: obviously you can’t plug a text version of lyrics into an algorithm and get any sort of meaningful results”
Nowadays, machines can do many things which, in my opinion, are much more impressive than algorithmic analysis of rap lyrics (like automatic caption generation for images) so this is not obvious to me at all. I would instead argue that Raplyzer does capture a crucial component of what good rap lyrics partly consist of. First of all, I think this claim is supported by an observation that many rappers whom people commonly perceive as technically skilled rappers (like Rakim and Tech N9ne) can be found at the top of the rankings. Second, I conducted a small experiment to quantitatively evaluate the rhyme density measure.
Shortly after I had published the blog post, a rapper called Ahmen contacted me and asked me to analyze the lyrics of his first album. I promised to do so but on one condition: before I would reveal the results, I asked him to rank his own songs “starting from the most technical according to where you think you have used the most and the longest rhymes” which he kindly agreed to. Here is a comparison of Ahmen’s own rankings and the rankings assigned by the algorithm:
Track Title | Artist Rank | Raplyzer Rank | Rhyme Density |
---|---|---|---|
Ahmen vs Everybody | 1. | 1. | 1.542 |
And One | 2. | 4. | 1.214 |
Headphones | 3.-4. | 9. | 0.930 |
No Option | 3.-4. | 3. | 1.492 |
Team | 5. | 2. | 1.501 |
Hand Down | 6.-7. | 10. | 0.909 |
Troublemaker | 6.-7. | 7. | 1.047 |
I, Cypher | 8.-9. | 6. | 1.149 |
When I Was Up | 8.-9. | 5. | 1.185 |
My Legacy | 10. | 8. | 1.009 |
Samuel L Jackson | 11. | 11. | 0.904 |
You can see that the first and the last song are the same according to the algorithm and the artist. Furthermore, the correlation between the two ranking lists is statistically significant. Thus we can reject the null hypothesis of the rhyme density measure being independent of the rapper’s own notion of technically skilled lyrics.
Hater #3: Peer reviewers
The only “haters” that actually made my day worse were the reviewers who rejected the DopeLearning paper from the conference we initially submitted it to. I was quite disappointed even though I knew that the criticism they presented was mostly spot on.
Based on the reviews we did several changes to the paper, including changing the experimental setup for the next line prediction task and adding relevant citations. The main improvement was to perform human evaluations of the generated lyrics. Instead of merely showing some lyrics to human raters and asking whether they are dope or not (or whether they are computer generated or not), we figured out we need to come up with something more clever to collect enough data to draw meaningful conclusions.
Fortunately, a similar problem had been tackled before in the context of optimizing search engines using clickthrough data. This pushed us to implement deepbeat.org and adopt the following strategy: when a user clicks the “Suggest rhyming line” button and selects one of the suggested lines, the user indicates that the selected line is more suitable than the lines above it. This allows us to compare human preferences with the scores that the algorithm has assigned to the lines.
The following plot summarizes the results and shows that there is indeed a clear correlation between human preferences and algorithm preferences for suitable next lines. (An anonymized version of the dataset has been published.)

Probability of a deepbeat.org user to select the line with a higher score given the (binned) score difference of two lines.
After implementing the improvements we submitted the revised paper to the leading data mining conference, KDD, and got very positive reviews. So the next time you can hear DeepBeat performing will be in San Francisco in about two weeks! 🙂
Final words to all my haters: I honestly appreciate your feedback so please keep hating so that I can improve my work also in the future. And for now, I just wanna leave you with this video:
w00p, w00p, congrats!
LikeLike