AI is Hungry: Data Depletion and the Future of Artificial Intelligence
What I Learned Writing a 600-Chapter Novel
I'm currently writing a 600-chapter martial arts novel. "Eternal Life Fire Emperor Technique" – a project that would have been impossible to manage alone. Anyone who's written a long-form novel knows: consistency is the hardest part. You plant a seed in chapter 3 that must pay off in chapter 300. A supporting character's power level mentioned in chapter 50 needs to make sense in chapter 400. Human memory has its limits.
So I use AI. Claude, Grok. They've become my collaborators. I build the skeleton. "In this episode, the protagonist awakens a secret technique. It needs to connect to the master's dying words from chapter 50. And it should flow naturally from that sword technique in chapter 150." Then AI writes the draft. And I refine it. I smooth the sentences, adjust the emotional beats, and inject my voice.
The result? Since this is my first work, I'm focused more on completion than literary excellence, but the content is surprisingly good. Still, my hand is essential. AI maintains perfect logical consistency, but misses subtle emotional nuances. A character's tone might feel slightly off, or a tense scene might fall flat. So I always read through again, revise, and add my touch.
My blog works the same way. I don't speak English well, but I run an English blog. And this blog is designed to be discovered by AI. My English blog gets 60-100 daily visitors. I cover investment analysis, AI trends, and economic issues – all created through AI collaboration. I develop ideas in conversations with Grok, then hand editing to Claude. My efficiency has doubled, and quality has improved.
But lately, I've been caught on one question: "How long can these AIs stay this smart?" And that question led me deeper. What fuels AI? Data. So what happens when that data runs out?
The Food is Disappearing from AI's Table
In 2024, experts started talking about 'data depletion.' I didn't believe it at first. The internet churns out infinite content daily – how could AI run out of food? But looking closer, the story was different.
According to a 2024 report by Epoch AI, the current stock of high-quality human-generated text data on the internet is about 300 trillion tokens [1]. A token is the minimum unit AI uses to understand text – roughly 0.75 words. 300 trillion tokens sounds enormous, right?
The problem is AI's appetite. Training a model like GPT-4 consumes about 13 trillion tokens. And the next generation? It eats more. When scale increases 10x, data needs increase 10x. At this rate, researchers estimate high-quality internet data could be depleted between 2026 and 2032 – that's 1 to 7 years from now.
What's worse is web data access is getting harder. Since 2023, many websites have tightened their terms of service. "Don't train AI on our data." Sites like Reddit, Twitter (X), and The New York Times started blocking crawlers from OpenAI and Anthropic. By 2023, access to public web data had dropped to about 40% [2].
Goldman Sachs' Chief Data Officer warned in early 2025: "Web data is already exhausted. AI companies are now filling their models with 'slop'" [2]. Here, slop means synthetic data – data that AI itself created.
How Does AI Actually Learn?
There's something crucial to understand here. AI doesn't 'learn' like we do. It doesn't remember things either. AI compresses patterns.
Consider this example. You type "when the sun rises, it gets" and AI responds "brighter." This isn't because AI 'understands' the relationship between sun and brightness. During training, AI read hundreds of billions of sentences and learned the statistical pattern that "brighter" has a high probability of following "when the sun rises."
This method is remarkably effective. But it has fundamental limits. AI cannot create patterns that aren't in its training data. If the training data only had "when the sun rises, it gets darker" repeated a million times, AI would answer that way. Regardless of truth.
That's why data 'quality' matters. What is high-quality human data? Text that is logically consistent, fact-based, and contains diverse contexts. Books, papers, news articles, expert writing. These things.
What I think is most important here is 'diverse contexts.' Just as we need balanced nutrition from varied foods, the same applies to data for AI.
The problem is AI companies are now struggling to access this high-quality data. So they're seeking alternatives. First, use more low-quality data. Second, reuse the same data repeatedly (overtraining). Third, create synthetic data.
And all three cause problems.
When Food Goes Bad: Model Collapse
Synthetic data is data created by AI. For example, if you ask GPT "write a story about a medieval knight," the text GPT produces becomes synthetic data. This text is then used to train the next generation of AI.
Sounds like a good idea at first. You can generate infinite data. But there's a fatal trap here. Researchers call it 'Model Collapse.'
Here's an analogy I like. Take a photo, photocopy it, then photocopy that copy, and keep copying... What happens? It gets progressively blurrier, colors fade, details disappear. AI works the same way.
When AI learns from human data, it learns human diversity. Some people write formally, others casually. Some writing is humorous, some serious. This diversity makes AI 'human-like.'
But when AI learns from AI-generated data? It only reinforces the average patterns it already knew. Rare expressions, creative metaphors, unique styles gradually vanish. Output becomes increasingly repetitive, generic, and cliché-ridden.
According to Stanford research, each 10% increase in synthetic data ratio can degrade model performance by 20-30% [2]. And in 2025, we're already seeing these signs in some AI outputs. Repetitive phrases, overly safe answers, lack of creativity.
I notice it when writing my novel. When I repeatedly ask AI for similar scenes, at some point the sentences start sounding alike. Especially, they become short and monotonous. I have to enrich and enhance those parts.
Will Making It Bigger Solve the Problem?
So what if we just make models bigger? Increase parameters, boost computing power? This has been the core strategy of AI development for the past decade. The so-called 'scaling law.'
2017 GPT-1: 117 million parameters
2019 GPT-2: 1.5 billion parameters
2020 GPT-3: 175 billion parameters
2023 GPT-4: (unpublished, but estimated) 1.7 trillion parameters
They grew 10x each time, and performance improved 10x. But when parameters increase 10x, data needs increase 10x too. And we're already short on data.
Anthropic CEO Dario Amodei estimated in a 2024 interview that data depletion has about a 10% chance of halting AI progress [2]. That might sound low, but it's a risk that could shake the entire AI industry.
What's more serious is that simply increasing size can't reach true intelligence. Meta's AI research chief Yann LeCun said in early 2025: "Continual learning alone can't even create cat-level intelligence. True intelligence comes from world models and hierarchical planning" [9].
What does this mean? Current AI only learns text patterns. But real intelligence must understand how the world works. Physical laws, causation, the flow of time. And it must be able to plan long-term and adapt.
Is the Path to AGI Blocked?
AGI – Artificial General Intelligence. General-purpose AI that can perform diverse tasks like humans. Many people talk about AGI as the next goal. Some argue that data depletion might actually accelerate the transition to AGI.
The logic goes like this: If AI can't consume more data, it must learn to teach itself. Learn from experience like humans, generate knowledge independently, run self-improvement loops.
OpenAI's o1 and o3 series are attempts in this direction. These models use something called 'test-time compute.' Instead of immediately producing an answer, they internally explore multiple paths, verify them, then output the best answer. Like how humans try various approaches when solving difficult problems.
But Ilya Sutskever, OpenAI's co-founder, admitted frankly in early 2025: "o1 and o3 successfully bypassed the data depletion problem. But the path to ASI (artificial superintelligence) is blocked" [10].
Why blocked? Energy. Reasoning models like o3 consume enormous computing power. A single complex inference uses 100 times more energy than GPT-4. This isn't sustainable.
And benchmark scores are showing limits too. There's a test called ARC-AGI created by François Chollet. It measures true reasoning ability, not simple memorization. The latest 2025 models score around 45-55%, while humans score 85-90% [12]. The gap isn't closing.
What About Using User Data?
So what about our data? My posts on LinkedIn, my conversations with ChatGPT, my Google search history. Can't they use this for training?
Actually, many companies already do. LinkedIn, OpenAI, Google all have clauses in their terms of service saying "we can use your data for AI training." They offer opt-out options, but most people don't know and skip past.
There are two problems. First, quality. Average user chats and social media posts aren't as systematic as professional writers' books or scientists' papers. They contain spelling errors, grammar mistakes, factual errors.
Second, privacy and hallucination. According to Stanford research, training on data mixed with sensitive personal information increases AI hallucinations by 15-25% [2]. AI learns "John's email is john@email.com," then later guesses "Jane's email is probably jane@email.com."
This worries me too. I share my novel drafts, blog drafts, investment notes with AI. Is this stored somewhere and used to train other people's models? Still, I share sensitive information with AI. Because I think the chances of my information being distorted or identifying me are extremely low. What meaning could a few pieces of my data have mixed among millions of people's data? Giving up this convenience seems like a bigger loss.
Where is the Future Heading?
So what will AI's future look like? I don't think we'll see extreme scenarios. AI won't suddenly stop, nor will it explosively reach superintelligence.
Instead, gradual change will come. First, AI development will slow. After 2025, the model performance improvement curve will flatten. The era of scaling laws ends, and the era of efficiency begins.
Second, specialized AIs will proliferate. Instead of general-purpose AI, we'll see many small models specialized for specific fields. Medical AI, legal AI, coding AI. These achieve high performance in their domains with less data.
Third, human-AI collaboration becomes standard. Like how I write my novel – AI creates the draft, humans refine it. This is the most efficient and realistic approach.
From an investment perspective, let me add one thing: energy sector deserves more attention than AI company stocks. According to the IEA (International Energy Agency), data center electricity demand is projected to double between 2025 and 2030 [6]. Whether AI slows or accelerates, computing will keep growing and electricity will keep being needed. But this is personal observation, not investment advice.
Closing: AI is Still Hungry
Writing this long-form novel taught me how to work with AI. AI is an amazing tool, but not perfect. And it probably won't be perfect for a while.
Data depletion is a crisis, but also potentially an opportunity. It will force AI companies to think about what real intelligence is, instead of just scaling up. And we humans need to learn how to use AI as a tool, not just depend on it.
My novel would have been impossible without AI. But it never would have been completed with AI alone. AI can build the skeleton, but adding flesh and breathing in soul remains human work.
AI is hungry. But the quality of food we give it matters. And perhaps hunger will make AI smarter. Don't people get more creative when they're hungry?
The future is neither utopia nor dystopia. It's just change. What matters is the choices we make within that change.
Investment Disclaimer: Investment-related content mentioned in this article represents personal analysis and opinion, not financial advice. Investment decisions should be made based on your own judgment and responsibility.
References
[1] Epoch AI - "Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data" (2024) A 2024 report by Epoch AI showing that high-quality text data on the internet is approximately 300 trillion tokens, and if current AI training trends continue, data depletion could occur between 2026-2032. This research analyzed various data sources including web text, books, and academic papers to quantify the physical limits of AI scaling. Source: https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
[2] Goldman Sachs - "AI: Too Much Spend, Too Little Benefit?" (2025) A 2025 AI industry analysis report by Goldman Sachs addressing data depletion and the risks of synthetic data. It notes that web data accessibility has decreased to 40%, citing a data officer's "slop" comment, and highlights the industry's structural risks. Also includes Stanford research data showing that a 10% increase in synthetic data can cause 20-30% performance degradation. Source: https://www.goldmansachs.com/pdfs/insights/goldman-sachs-research/ai-in-a-bubble/report.pdf
[3] UBS - "Global AI Capital Expenditure to Exceed $500 Billion" (2025) A 2025 UBS forecast report on AI infrastructure investment, predicting global AI-related capital expenditure will reach $423 billion in 2025. It analyzes corporate investment plans for data centers, semiconductors, and cloud infrastructure, while raising concerns about overheated investment. Source: https://finance.yahoo.com/news/ai-capex-exceed-half-trillion-093015889.html
[4] McKinsey Global Survey - "The State of AI in 2025" McKinsey's survey of 1,500 corporate executives worldwide on AI adoption, with 65% reporting 20-30% productivity improvements from AI adoption. However, 55% of companies cite data quality and ethical issues as major obstacles. Source: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
[5] Stanford HAI - "AI Index Report 2025" Stanford University's Human-Centered AI Institute annual AI trends report, revealing that generative AI investment reached $33.9 billion in 2024, an 18.7% increase from the previous year. It comprehensively covers AI industry data including research paper publications, corporate adoption rates, and ethical discussions. Source: https://hai.stanford.edu/ai-index/2025-ai-index-report
[6] IEA - "Electricity 2025: Data Center Demand" The International Energy Agency's 2025 electricity demand forecast, predicting that data center power consumption will double between 2025-2030. The main driver is increased computing power needed for AI training and inference, which will place significant burden on global power grids. Source: https://www.iea.org/reports/electricity-2025/demand
[7] Deloitte - "2026 Power and Utilities Industry Outlook" Deloitte's power and utility industry outlook report, predicting that grid modernization investment due to AI and data center demand growth will reach $3.6-6 billion in 2026. Key investment areas include renewable energy integration, smart grid technology, and energy storage systems. Source: https://www.deloitte.com/us/en/insights/industry/power-and-utilities/power-and-utilities-industry-outlook.html
[8] BloombergNEF - "Global Renewable Energy Investment 2025" Bloomberg New Energy Finance's 2025 renewable energy investment report, revealing that global renewable energy investment reached $386 billion, a 10% increase from the previous year. Major investment areas include solar, wind, and battery storage systems, with AI data center electricity demand being one factor driving investment expansion. Source: https://about.bnef.com/insights/clean-energy/global-renewable-energy-investment-reaches-new-record-as-investors-reassess-risks/
[9] Yann LeCun - "The Limits of Continual Learning" (Wall Street Journal, 2025) Meta AI research chief Yann LeCun's views from a Wall Street Journal interview, pointing out that current continual learning approaches alone cannot even implement cat-level intelligence. He emphasizes that true AI advancement requires World Model and Hierarchical Planning capabilities. Source: https://www.wsj.com/tech/ai/yann-lecun-ai-meta-0058b13c
[10] Ilya Sutskever - "A Look Ahead: 2025 and Beyond" (The FAI, 2025) An article by OpenAI co-founder Ilya Sutskever for The Future of AI Institute, acknowledging that OpenAI's o1/o3 series successfully bypassed the data depletion problem, but the path to ASI (artificial superintelligence) remains unclear. It provides a balanced analysis of the possibilities and limitations of test-time compute. Source: https://www.thefai.org/posts/2025-a-look-ahead
[11] DeepMind - "Taking a Responsible Path to AGI" (2025) Google DeepMind's AGI development roadmap and ethical guidelines document, covering the step-by-step approach to reaching AGI and the safety and ethical issues to consider at each stage. It emphasizes the importance of responsible AI development, arguing that safety and transparency should be pursued alongside performance. Source: https://deepmind.google/blog/taking-a-responsible-path-to-agi/
[12] François Chollet - "ARC-AGI Benchmark Results 2025" The 2025 results of the ARC-AGI benchmark developed by Anthropic researcher François Chollet, showing that the latest AI models score around 45-55% while humans score 85-90%, demonstrating that a significant gap still exists. This benchmark is designed to measure true abstract reasoning ability, not simple memorization. Source: https://arcprize.org/arc-agi
Comments
Post a Comment