Newcomb-Benford’s Law and April Fools on Reddit

April Fool’s Day is a day celebrated with pranks. It’s no different on the internet. On April first of this year data_irl, a reddit forum dedicated to comedic data visualizations, took over their more serious counterpart, the dataisbeautiful subreddit. For the day playful data_irl posts inundated the sober dataisbeautiful community. In response, the dataisbeautiful moderators posted the data from the takeover and used it as the data set in their monthly dataviz battle – a competition where users submit data visualizations.

The competition provided the dataset as a pastebin of post links. The first thing I did was visit a few of the sites at random. The first thing I noticed is that the posts were funny. The second thing I noticed that the scores were distributed over several orders of magnitude. Also, because of the mechanics of reddit’s ranking algorithm, high scoring posts would gain prominence and therefore garner more points. It seems like this is a situation where the Newcomb-Benford Law would apply. In short, the law dictates that in many natural systems the leading digit is more likely to be small. So, in this case, that means that posts with a score starting with the digit ‘1’ would be more prevalent than scores with the leading digit ‘2’ and so on.

To see if the scores obeyed Newcomb-Benford I used python with the PRAW library and wrote a quick script. The script scraped the score from each of the 567 posts and tallied leading digits. I was surprised. Not because the scores were consistent with the law but because of how well they followed it. Check it out:

I thought the result was surprising so I made a submission. I got an honourable mention for my post. Hope you enjoyed!

Sources:

pastebin, Newcomb-Benford Law.

Notes:

In summary, not including a Rickroll in a post even tangentially related to April Fools is hard.

Leave a Reply

Close Menu