Summary
- Reddit bans AI firms from scraping its data for free, and demands licensing agreements for access.
- Google was the first major tech company to sign a $60 million data license agreement with Reddit.
- Reddit’s new data policy aims to protect user privacy and restrict access to data for commercial purposes.
Publicly available data over the internet is the primary source for AI companies to train their large language models and chatbots like ChatGPT and Google Gemini. Once you type a query in an AI chatbot, the answers are formatted based on already available data on the internet. Like regular users, that data is accessible to AI companies. As it turns out, that’s not the case for Reddit anymore, and the platform is banning AI firms from scrapping its data for free.
What is Reinforcement learning from human feedback?
Reinforcement learning has been a game changer in artificial intelligence, allowing machines to continuously improve their performance
Reddit’s recent move follows the company’s last year’s announcement about licensing its data to AI companies. In February, Google was the first major tech company to sign a data license agreement with Reddit, paying the social media company around $60 million per year.
Reddit announced its new “Public Content Policy” on Thursday as a guideline on how the platform shares its user data with other companies (via TechCrunch). Reddit also started a subreddit dedicated to researchers working with its data.
Reddit demands AI companies sign license agreements to access its data
Most of Reddit’s revenue stems from selling ads and API usage by developers. Meanwhile, Reddit, now a publicly traded company, needs more revenue streams to entice investors. Since the platform serves as a data aggregation center, it can make money by selling that data to customers, most notably companies behind AI chatbots like Google and OpenAI. Reddit’s IPO prospectus report indicated that the platform has made $203 million through licensing its data so far, and the number is expected to grow.
It’s important to note that Reddit’s new policy on data scraping is primarily aimed at companies using the data for commercial purposes, such as training AI chatbots and large language models. However, the platform is committed to maintaining a space for researchers and non-commercial entities. Reddit’s data will still be freely available to these users, and the company has even established a dedicated subreddit, r/RedditForResearchers, to cater to their needs.
While we will continue to block known bad actors, we need to do more to restrict access to Reddit public content at scale to trusted actors who have agreed to abide by our policies. We also need to continue to ensure that users, mods, researchers, and other good-faith, non-commercial actors have access.
Reddit’s new data policy is not just about restricting access to its data. It’s also about protecting user privacy. The platform emphasizes that users have the right to opt out of sharing their data with AI companies. Furthermore, Reddit partners are strictly prohibited from misusing content for spam, harassment, or conducting activities such as “background checks, facial recognition, government surveillance, or [to] help law enforcement do any of the above.” This policy is designed to ensure that user data is handled responsibly and with respect to privacy concerns.