A Better Way to Rank a Huge Number of Things

Vote-based news sites are biased towards posts that are already well ranked. As a result, they often miss good posts with just a few votes. We should use statistics to focus votes on the posts that we know the least about.

I’m really into crowd-sourced news sites like Hacker News or Reddit. I read them while I eat breakfast, when I want a break from work, and sometimes when I don’t have an excuse. It is from this perspective of admiration that I’d like to point out how much better these sites could be.

Problem: Cumulative Advantage

These sites often aspire to rank posts solely on quality – the best posts on top, the worst on bottom. However, votes are not equally distributed. The higher ranked a post, the more votes it tends to get. The posts that, by chance alone, do well on their first few votes are much more likely to escape obscurity than their equally awesome, but unlucky, counterparts. For rankings that aspire to be based on quality alone, this is a failing.

For sites with a user reputation system, this dynamic extends to users as well. The users with better reputations have an advantage. Their posts get more attention and are less likely to be overlooked. This advantage helps users with good reputations improve their reputations further. By making it harder for new users, these sites may even be inhibiting their own growth.

Problem: Blind Spot Caused by Limited Resources

This focus of voters on highly-ranked posts creates another, more straight-forward, problem. Votes aren’t where they are most needed: on the posts we don’t know anything about, which results in many very interesting posts simply being overlooked.

Solution: Encourage Voting on Posts We Know The Least About

Ideally votes would focus on the posts that might be good. This way we could discover if those posts are actually good. This would allow the site to notice awesome posts submitted from unknown sources.

The site would need to measure the uncertainty in its ranking, then direct visitors to vote on these uncertain posts. Happily, measuring uncertainty is possible. It is done all the time with confidence intervals and Bayesian credible intervals. The bigger the interval, the more uncertainty, and therefore, the more a post needs some votes. The vote-based news site would just need to consider intervals or probability distributions for the score of a post rather than particular values. This would allow the site to rank the posts by “best guess” point estimates and still know which posts it has more accurate estimates for.

Problem: The High Cost of Barrier-Based Spam Prevention

Spam is an inevitable force in open mass communication. Our desire to escape v1agra ads is partially why we use ranking in the first place; the highly-ranked posts are less likely to be spam. Still, there is so much spam that it can swamp a news site. So, news sites make it harder for people to submit spam by requiring user registration, using hard to read captchas, and mandating reputation limits prior to submission.

These barriers to spam work, but they are also hurdles to accepting high-quality posts. Like an e-mail address that only accepts messages from people already in your address book, spam barriers block a lot of good information.

Solution: Spam Filter

A good spam filter is better than these barriers. Spam filters don’t require people to remember passwords or decipher the text in blurry images.

By automating the easy judgments a crowd-sourced news site could remove the barriers. This would make the site more user-friendly and help focus votes on posts that might be good.

It Should Exist

I think crowd-sourced news sites should be better. They should use measures of uncertainty to direct votes towards posts that have potential to be good but that haven’t been noticed yet. They should remove barriers.

So, I built a prototype of such a site. It features:

  • A more efficient Bayesian Estimator to rank posts
    • 3,000 votes aren’t necessary where 10 will do
  • It chooses which posts visitors vote on
    • Based on uncertainty
    • Makes the sites harder to game
  • Removes barriers (No accounts)
    • Neither you nor the site wants to keep track of your password
    • Makes it easier for people to vote and submit good articles
  • Automation of easy decisions
    • Uses votes from similar posts to estimate the ranking of new posts

Check it out: DecidingData.com
Then let me know what you think.

If you like my work, consider connecting to me on LinkedIn.

If you have any suggestions about this article, let me know.