AI Companies Will Soon Exhaust Almost All Internet Data

In 2006, Fei-Fei Li, then at the University of Illinois and now at Stanford, realized that Internet data mining could help transform AI research. Linguistic research had identified 80,000 sets of nominal synonyms, or “synsets”: groups of synonyms that described the same type of thing. The billions of images on the Internet, Li surmised, could offer hundreds of examples of each synonymy. If a sufficient number were gathered, a resource for AI training would be available far superior to anything known up to that point. “Many people pay attention to the models,” said the researcher. “Let’s pay attention to the data.” The result was ImageNet.

The Internet not only offered images, but also resources to label them. After getting search engines to provide photos of what they considered dogs, cats, chairs, or whatever, those images were inspected and annotated by people hired through Mechanical Turk, a crowdsourcing platform owned by Amazon that allows you to make money by making routine tasks. The result was a database of millions of selected and verified images.

The more text you give the system to train, the better it does.

Using some parts of ImageNet for training, a program called AlexNet demonstrated in 2012 the remarkable potential of “deep learning”, that is, neural networks with many more layers than those used until then. It was the beginning of the rise of AI, and a labeling industry designed to provide it with training data.

The subsequent development of large language models (GML) also depended on Internet data, but in a different way. The classic training exercise of a GML does not consist of predicting which word best describes the content of an image, but rather predicting what a word cut from a fragment of text is based on the other words that surround it.

In that type of training there is no need for labeled and verified data; the system can skip words, make guesses, and evaluate your answers in a process known as “self-supervised training.” Of course, a lot of data is needed. The more text you give the system to train, the better it does. Since the Internet offers hundreds of trillions of words of text, it became for GMLs the equivalent of what the multiple eras in which carbon has been randomly deposited in geological sediments have been for modern industry: something susceptible to be refined and turned into a miracle fuel.

The “data wall”

In AI research, the use of Common Crawl, an archive of much of the open Internet that includes 50 billion web pages, became widespread. More recent models have been supplemented by data from a growing number of sources, such as Books3, a widely used collection of thousands of books. Now, the textual appetite of machines has grown at a rate that the Internet cannot match. Epoch AI, a research company, estimates that by 2028, the entire wealth of high-quality textual data available on the Internet will have been used. This is what is known in the sector as the “data wall”. How to deal with this wall is one of the great unknowns of AI, and perhaps it is the one that can most slow down its progress.

One approach is to focus on the quality of the data rather than its quantity. AI labs don’t just train their models with the entire Internet. They filter and sequence the data to maximize model learning. Naveen Rao of Databricks, an AI company, says that is the “main differentiator” between AI models on the market. “Truthful information” about the world is clearly important; and also a good amount of “reasoning”. That makes academic books, for example, especially valuable. However, establishing balance between data sources remains something of an arcane art. What’s more, the order in which the system finds the different types of data also matters. If we group all the data on a given topic, such as mathematics, at the end of the training process, the model can specialize in mathematics and forget other concepts.

A robotic hand, pretending to be AI, and a human hand.

Getty Images

These considerations become even more complex when the data are not only about different topics, but also have different forms. Due in part to the lack of new textual data, leading models such as OpenAI’s GPT-4.0 and Google’s Gemini now additionally exercise image, video, and audio files during their self-supervised learning. Training with video is the most difficult given the high density of data points in the files. Current models typically stick to a subset of frames to keep things simple.

Regardless of the models used, ownership is increasingly recognized as a problem. Many times, the material used in GML training is protected by copyright and is used without the consent of the copyright holders and without payment for them. Some AI models hide behind paywalls. The modelers claim that such behavior falls within the “fair use” exemption under US copyright law. Reading copyrighted material, they say, should be allowed when learning AI models, just as it is allowed for humans. However, as technology analyst Benedict Evans has said, “a difference of scale” can lead to “a difference of principle.”

The best taggers can earn up to $100 per
hour

Different rights holders are adopting different tactics. Getty Images has sued Stability IA, an image generation company, for unauthorized use of its image bank. The New York Times is suing OpenAI and Microsoft for copyright infringement on millions of articles. Other newspapers have reached agreements to license their content. News Corp, owner of The Wall Street Journal, signed a deal worth $250 million over five years. (The Economist has not commented on its relationship with AI companies.) Other text and video sources are doing the same. Stack Overflow, a question and answer site for programmers, Reddit, a social networking site, and X (formerly Twitter) now charges for access to their training content.

The situation differs between jurisdictions. Japan and Israel take a permissive stance to promote their AI sectors. The European Union lacks a generic concept of “fair use”, so it could be stricter. Where markets are established, different types of data will have different prices: models will need access to up-to-date real-world information to keep up to date.

The capabilities of the models can also be improved by feeding the version produced by self-supervised learning (known as the pre-trained version) additional data in a post-training phase. “Supervised refinement,” for example, consists of feeding a model with pairs of questions and answers collected or created by people. In this way, the models are taught what the correct answers are like. On the other hand, “reinforcement learning with human feedback” (ARRH) tells them whether the answer satisfied the questioner (a subtly different question).

Increase scale

As pre-workout data runs out on the Internet, post-workout becomes more important. Labeling companies like Sale AI and Surge AI make hundreds of millions of dollars a year collecting post-training data. Scale recently raised $1 billion at a $14 billion valuation. Things have evolved since the days of Mechanical Turk: the best taggers earn up to $100 an hour. However, while post-training helps produce better models and is sufficient for many commercial applications, it is ultimately only incremental in nature.

Instead of slowly rolling back the data wall, another solution would be to skip it completely. One approach is to use synthetic, machine-created and therefore unlimited data. A good example is AlphaGo Zero, a model created by DeepMind, a Google subsidiary. The company’s first successful model for playing Go was trained using data on millions of plays from amateur games. AlphaGo Zero does not use pre-existing data. Instead, he learned Go by playing 4.9 million games against himself over three days and writing down the winning strategies. This “reinforcement learning” taught him how to respond to his opponent’s moves by simulating a large number of possible responses and choosing the one with the best chance of winning.

AI Companies Will Soon Exhaust Almost All Internet Data

admin_l6ma5gus

An influencer condemned for putting her three-year-old daughter in cold water to stop her tantrums: "It is the most effective way to calm her down"

Leave a Reply Cancel reply

Recommended

Argentina-El Salvador: where to watch the match on TV, live stream, lineups and forecast for the FIFA friendly

Sprint Race without points for Acosta: “Some problems in the race” | FP

Popular News

6 Strategies to Grow Your Business Locally with Leaflet Distribution

How Anime Culture Affects People’s Romantic Preferences

Business Opportunities in a Post-Covid World

Everything You Should Know About Vehicle GPS Tracking

Manage Your Business Communication

For Ads: [email protected]

AI Companies Will Soon Exhaust Almost All Internet Data

The more text you give the system to train, the better it does.

The “data wall”

The best taggers can earn up to $100 per hour

Increase scale

Leave a Reply Cancel reply

Recommended

Popular News

For Ads: [email protected]

The best taggers can earn up to $100 per
hour