Reddit, Slack, Google, Facebook, Instagram — these companies use our data — directly or indirectly — to train the next generation of AI language models. Yet, I don’t remember anyone asking our permission, and in doing so, these companies have proven the adage that customers’ data is their main product.




For much of the internet generation, companies have offered products for free or at little cost to entice customers into their ecosystems. Products such as Gmail, YouTube, Facebook, Reddit, and others appear to be free but collect user data that can be used to serve ads or even sold off in aggregated bundles.

While these business models were once acceptable, the rapid advancement of AI has brought forth a much larger and more pressing issue that carries significant implications for the future of our privacy.


Understanding AI and LLMs

ai-image



The current generation of AIs are based on LLMs (large language models), which recognize, understand, and generate human language. Built using machine learning, they are trained on humongous data sets and can generate human-like text, recognize images, answer questions, or process audio and video in real time.

LLMs comprise three key parts: parameters, weights, and tokens. Parameters form the variables that the model learns during the training process. Weights determine the strength of connections between variables. Tokens form the basic input and output, i.e., the natural-language text, audio, and video we feed into an LLM and receive in response.

Let’s take a chef: a customer asks for a particular dish (the input token), and the chef then puts a series of ingredients into a pan to create the dish. The dish at the end is the output token, but the specific mix of ingredients used to make it are the parameters, and the specific recipe represents the weight. Every chef can create that dish (assuming it’s very basic), but to differing degrees, based on their knowledge, training, and experience.


Related

What is generative AI?

An agent of the human will, an amplifier of human cognition. Discover the power of generative AI

Let’s consider this from someone asking Gemini or ChatGPT-4o for a recipe. An LLM can only learn this based on its dataset. The more recipes it has ingested — equivalent to the more times a chef has made the dish — the more it can predict how to make a tasty dish. The result is that the best LLMs will have the best recommendations, especially when you give it several ingredients and ask for a recipe.


We have a looming AI problem

personal-user-data


The biggest problem with the above is the sheer volume of data required to train LLMs. Here are some examples: OpenAI used 1 million hours of YouTube video data to train GPT-4 (which is not its latest model; that’s GPT-4o). Google DeepMind used approximately 10 trillion words scraped from the web to train its Gemini model. Meta has used the images, videos, and texts you upload to its platforms to train its generative AI models.

However, it doesn’t end there: Google paid Reddit $60 million to scrape all of Reddit for its AI. This quickly morphed into Reddit being one of the primary sources for the AI Overviews feature. However, to Google’s detriment, AI lost resoundingly in the battle of AI vs. human internet users. Just ask anyone Googling glue pizza or how to eat rocks.



That money went to Reddit and likely came about because many of the most popular search terms are often followed by the word Reddit as users look for the human answer. Yet, none of the millions of users on Reddit will see any of that money, which is made especially strange given it is those users who have worked for free to build a platform that Reddit can monetize and capitalize upon.

Reddit is just one example of companies exploiting its users’ data. Meta has the world’s largest platforms: Facebook, Instagram, and WhatsApp. Elon Musk is training X AI’s GrokAI on Twitter, one of the biggest real-time information sources. None of these companies are paying users for this, and many also push users to sign up for subscriptions which means users are paying to give their data to these companies, yet none of these subscriptions let you opt out of your data being used.


You could argue that all these platforms are free and your data is fair game. I somewhat agree when you’re not paying for the platform, but what about when you’re paying, and you’re still the product?

This is where we should draw the line. The inspiration behind this post? Slack — a business-focused service that requires a paid subscription for many of its core features — is training its AI using company data, much of which is likely quite sensitive.


When is enough enough?

Meta and Google's logo arranged in a composition



This leads to a further question: when should we say “Enough is Enough”? We’ve already seen Google Gemini create an AI teammate; although created under the guise of reducing friction and communication between different teams, it’s easy to envision it evolving to replace full-time jobs. Google’s AI Overviews are also destroying the role of journalists and fact-checkers, although, as a lawsuit by many publishers suggests, this started long ago with Google’s other business practices.

Companies using our data for their benefit without compensating users isn’t new. Lou Montulli created the digital cookie in 1994, and within a year, ads that targeted specific consumer demographics became the norm. For over two decades, digital customer privacy wasn’t a priority, and without GDPR (an EU ruling in 2018), we’d likely still have no notion of privacy. Instead, we now have companies monetizing user data by ingesting everything you’ve ever posted on the web to train their AI.



AI will inevitably transform our digital lives, not necessarily in a good way. Although companies like OpenAI have struck deals with large publishers (with large budgets) like Vox Media, most people won’t benefit. Instead, everyday users will still be the product. The solution seems straightforward: find a way to compensate users. Given that Google, Meta and others have threatened to stop serving content in specific states and countries to avoid paying publishers, there’s little to no chance of companies paying users for their data. So, if we aren’t going to be reimbursed for our knowledge that is being used by these multinational corporations to profit so bigly, then as the headline of this article states, companies need to stop using our personal data to train AI. Because if we continue the current path, the only ones left to make the free content/data we consume will be the very corporations that stole ours.