The training data gold rush hit a wall. Frontier labs spent years hoovering up everything on the public internet, Wikipedia, Reddit, GitHub, Common Crawl, and now the well is functionally dry. GPT-4 was trained on roughly 13 trillion tokens of text. The entire crawlable web is estimated at around 5, 8 trillion tokens. The math doesn't work anymore. You can't build a better model by scraping harder.
Meanwhile, the next generation of AI problems, voice agents that understand regional dialects, robotics models that watch humans manipulate objects, video models that need to understand unscripted human behavior, require structured, real-world, rights-cleared data that doesn't exist on the internet at all. Nobody uploaded a video of themselves loading a dishwasher with full provenance records attached. Nobody consented to having their Urdu dialect used to train a speech model.
