Let’s take a walk through the history of machine learning at Reddit from its original days in 2006 to where we are today, including the pitfalls and mistakes made as well as their current ML projects and future efforts in the space. Based on a talk given by Anand Mariappan, the Senior Director of ML at Reddit, at ODSC West 2018, we’ll cover data platform, ML efforts around anti-evil, abuse, feed ranking, recommendations, user and subreddit similarity, as well as business results achieved and how they measure those through our experiments systems.
[Related Article: IBM Research Launches Explainable AI Toolkit]
First, What is Reddit?
Reddit, at its core, is a social network of communities that helps users find people with similar passions and connect around those ideas. It’s designed to allow open interaction and shared information, without the other pressures of modern social media. It has incredible engagement and is a bigger platform than even users know it to be.
This is all achieved through four different features.
Pseudonymous: By letting users create profiles without their real information, Reddit has allowed them to have honest conversations that they may not be able to have in their real life. While there have been some issues surrounding people bullying because they don’t perceive repercussions, it’s been an overwhelmingly positive feature that is key to Reddits user interface.
Community and Rules: Part of the solution to anonymous bullying is the idea of communities and rules. To participate in a Reddit thread, you join communities and have to follow the posted community rules. If a user breaks the rules, either by bullying or by spamming or self promotion or whatever other rule they may have, the moderators are able to remove a user.
Voting: Voting is also an integral part of the Reddit space, which both encourages genuine quality content, and helps the Reddit team recommend and track content to other users who may be interested in it.
HotPage: Finally, Reddit uses a “HotPage” as their home page—a place for all of the top content of the day to aggregate and be shown across users with different interests. This serves to, again, reward good content, encourage meaningful discussion, and diversify the content users are seeing.
Together, this creates a platform that is user-friendly, awards real interaction, and makes, what Reddit calls, a weaponized procrastination tool (which, don’t we all want that sometimes?).
The History of Reddit
Reddit launched in 2005, right in between Facebook and Twitter, and it was originally built to be a competitor of del.icio.us, which was a link saving website. And at that point, Reddit was basically just the HotPage and a huge accumulation of content with no real organization.
The first big change they wanted to include was to create a personalized recommended tab for users to browse through, which would be based on users with similar upvotes.
To try and accomplish this, they built a matrix of all up and down votes, intended to show people content that had a lot of upvotes similar to the ones they were engaging with themselves. But it didn’t quite work how they wanted it to, because the median upvote was close to 1 so there was no real diversity in what people saw on the HotPage, and, therefore, no real diversity in what could be recommended. Just using upvotes didn’t work.
After that, Reddit introduced SubReddits, or communities. Instead of just one big pile of content, things could be sorted and categorized. The posts became pockets of information, grouped in individual topics that people could follow. With that, the data team could also understand how the communities interacted and cross posted. You could see the overlap between, say r/gamers and r/twoxchromosomes, to see what women gamers were posting about, and then, you could better recommend content to them.
The SubReddits were a great way to organize the content, but the machine learning team they’d also need well organized and defined data points if they want to be able to better incorporate machine learning into the recommendation platform. They introduced a data pipeline, and went through a couple changes until they got it right.
In 2014 the team added the Midas Enrichment layer. This was able to collect interesting signals that helped understand what users would want to view on their feeds. With this, they tried applying geolocation and the type of server the users to the data, but realized neither were very good at predicting preferred content.
In 2016 Reddit incorporated the Miskey machine learning recommendation service, which looks at user interactions, pulls in data, and is able to collect insights to make good recommendations for content users may like.
In 2017, Reddit added BigQuery which was simple to manage and connect their data to. They were finally able to truly bring a data culture to Reddit. With BigQuery they understood that all decisions have to come through the content data they were receiving
Now that they had the machine learning well enough in place, their next step was better on-boarding. They wanted to make it easier for people to create accounts and find content they would be interested in. They got a cartographer, which is basically a mix between a librarian and a data analyst, who looked at all the different subreddits and created categories that could house the different SubReddits. Things like health, gaming, cities, sports, could all hold more specific threads. With that change, new users could select what they’re interested in and Reddit could recommend them content from there.
Beyond aiding new users, it could recommend similar-but-not-the-same content, which encouraged a sort-of ‘Reddit Rabbit Hole” of interesting content. This aided in retention and time spent on the site, and gave users a better experience.
They were also able to begin testing out different logistic regression models for the homepage at the same time, with ease. Now, Reddit is is running around five different models at any given time, all to help understand what model is the best platform for users.
What they learned
The main take-away from all this history was the need to understand your data. Originally, they used a “kitchen sink” model, which just collected every type of data they could get their hands on, but that just muddled the results and made it difficult to understand what actually helped their platform.
[Related Article: Taking Your Machine Learning from 0 to 10]
Despite this success, machine learning at Reddit will always be striving for better. They’ve recently switched to TensorFlow models to serving their machine learning. And they’re working on making their HotPage even more relevant. They’re doing this by moving into streaming infrastructure, where they can take user interactions and signals to provide more relevant content. In keeping their community and information-sharing focus, this new version of the HotPage is based on meaningful discussions, instead of just memes with a lot of upvotes.
By being able to learn from their mistakes, machine learning at Reddit has been able to constantly make smarter decisions about their data and create a better user experience.