In 2006, Fei-Fei Li, then at the University of Illinois and now at Stanford, realized that Internet data mining could help transform AI research. Linguistic research had identified 80,000 sets of nominal synonyms, or “synsets”: groups of synonyms that described the same type of thing. The billions of images on the Internet, Li surmised, could offer hundreds of examples of each synonymy. If a sufficient number were gathered, a resource for AI training would be available far superior to anything known up to that point. “Many people pay attention to the models,” said the researcher. “Let’s pay attention to the data.” The result was ImageNet.
The Internet not only offered images, but also resources to label them. After getting search engines to provide photos of what they considered dogs, cats, chairs, or whatever, those images were inspected and annotated by people hired through Mechanical Turk, a crowdsourcing platform owned by Amazon that allows you to make money by making routine tasks. The result was a database of millions of selected and verified images.
The more text you give the system to train, the better it does.
Using some parts of ImageNet for training, a program called AlexNet demonstrated in 2012 the remarkable potential of “deep learning”, that is, neural networks with many more layers than those used until then. It was the beginning of the rise of AI, and a labeling industry designed to provide it with training data.
The subsequent development of large language models (GML) also depended on Internet data, but in a different way. The classic training exercise of a GML does not consist of predicting which word best describes the content of an image, but rather predicting what a word cut from a fragment of text is based on the other words that surround it.
In that type of training there is no need for labeled and verified data; the system can skip words, make guesses, and evaluate your answers in a process known as “self-supervised training.” Of course, a lot of data is needed. The more text you give the system to train, the better it does. Since the Internet offers hundreds of trillions of words of text, it became for GMLs the equivalent of what the multiple eras in which carbon has been randomly deposited in geological sediments have been for modern industry: something susceptible to be refined and turned into a miracle fuel.
The “data wall”
In AI research, the use of Common Crawl, an archive of much of the open Internet that includes 50 billion web pages, became widespread. More recent models have been supplemented by data from a growing number of sources, such as Books3, a widely used collection of thousands of books. Now, the textual appetite of machines has grown at a rate that the Internet cannot match. Epoch AI, a research company, estimates that by 2028, the entire wealth of high-quality textual data available on the Internet will have been used. This is what is known in the sector as the “data wall”. How to deal with this wall is one of the great unknowns of AI, and perhaps it is the one that can most slow down its progress.
One approach is to focus on the quality of the data rather than its quantity. AI labs don’t just train their models with the entire Internet. They filter and sequence the data to maximize model learning. Naveen Rao of Databricks, an AI company, says that is the “main differentiator” between AI models on the market. “Truthful information” about the world is clearly important; and also a good amount of “reasoning”. That makes academic books, for example, especially valuable. However, establishing balance between data sources remains something of an arcane art. What’s more, the order in which the system finds the different types of data also matters. If we group all the data on a given topic, such as mathematics, at the end of the training process, the model can specialize in mathematics and forget other concepts.
These considerations become even more complex when the data are not only about different topics, but also have different forms. Due in part to the lack of new textual data, leading models such as OpenAI’s GPT-4.0 and Google’s Gemini now additionally exercise image, video, and audio files during their self-supervised learning. Training with video is the most difficult given the high density of data points in the files. Current models typically stick to a subset of frames to keep things simple.
Regardless of the models used, ownership is increasingly recognized as a problem. Many times, the material used in GML training is protected by copyright and is used without the consent of the copyright holders and without payment for them. Some AI models hide behind paywalls. The modelers claim that such behavior falls within the “fair use” exemption under US copyright law. Reading copyrighted material, they say, should be allowed when learning AI models, just as it is allowed for humans. However, as technology analyst Benedict Evans has said, “a difference of scale” can lead to “a difference of principle.”
The best taggers can earn up to $100 per
hour
Different rights holders are adopting different tactics. Getty Images has sued Stability IA, an image generation company, for unauthorized use of its image bank. The New York Times is suing OpenAI and Microsoft for copyright infringement on millions of articles. Other newspapers have reached agreements to license their content. News Corp, owner of The Wall Street Journal, signed a deal worth $250 million over five years. (The Economist has not commented on its relationship with AI companies.) Other text and video sources are doing the same. Stack Overflow, a question and answer site for programmers, Reddit, a social networking site, and X (formerly Twitter) now charges for access to their training content.
The situation differs between jurisdictions. Japan and Israel take a permissive stance to promote their AI sectors. The European Union lacks a generic concept of “fair use”, so it could be stricter. Where markets are established, different types of data will have different prices: models will need access to up-to-date real-world information to keep up to date.
The capabilities of the models can also be improved by feeding the version produced by self-supervised learning (known as the pre-trained version) additional data in a post-training phase. “Supervised refinement,” for example, consists of feeding a model with pairs of questions and answers collected or created by people. In this way, the models are taught what the correct answers are like. On the other hand, “reinforcement learning with human feedback” (ARRH) tells them whether the answer satisfied the questioner (a subtly different question).
Read also
In ARRH, users feed a model with information about the quality of the results offered, which is then used to adjust its parameters, or “weights.” User interactions with chatbots, such as a thumbs up or thumbs down, are especially useful for ARRH. That creates what experts call a “data flywheel” whereby more users lead to more data that feeds back into the adjustment of a refined model. AI startups look closely at what types of questions users ask their models, and then collect data to fine-tune them around those topics.
Increase scale
As pre-workout data runs out on the Internet, post-workout becomes more important. Labeling companies like Sale AI and Surge AI make hundreds of millions of dollars a year collecting post-training data. Scale recently raised $1 billion at a $14 billion valuation. Things have evolved since the days of Mechanical Turk: the best taggers earn up to $100 an hour. However, while post-training helps produce better models and is sufficient for many commercial applications, it is ultimately only incremental in nature.
Instead of slowly rolling back the data wall, another solution would be to skip it completely. One approach is to use synthetic, machine-created and therefore unlimited data. A good example is AlphaGo Zero, a model created by DeepMind, a Google subsidiary. The company’s first successful model for playing Go was trained using data on millions of plays from amateur games. AlphaGo Zero does not use pre-existing data. Instead, he learned Go by playing 4.9 million games against himself over three days and writing down the winning strategies. This “reinforcement learning” taught him how to respond to his opponent’s moves by simulating a large number of possible responses and choosing the one with the best chance of winning.
Read also
A similar approach could be used for GMLs to write, for example, a mathematical proof, step by step. A GML could construct a response by first generating many first steps. An independent “helper” AI, trained with data from human experts to judge quality, would identify which is the best answer and which is worth further progress. That feedback produced by the AI is a form of synthetic data, and can be used to further train the first model. Over time you could have a higher quality response than if the GML responded all at once, and it would be a much improved GML. This ability to improve the quality of the output by taking more time to think is like the slower, more deliberative “system 2” thinking of humans, described in a recent talk by Andrej Karpathy, co-founder of OpenAI. Currently, GMLs employ “system 1” thinking, generating a response without deliberation, similar to a person’s reflex response.
The difficulty lies in extending the approach to settings such as healthcare or education. In games, there is a clear definition of what winning is and it is easier to collect data on whether a move is advantageous or not. In other areas, it is more complicated. Data about what is a “good” decision is usually gathered from experts. However, that is expensive, time-consuming, and only a half-solution. And how do you know if a particular expert is right?
It is clear that access to more data – extracted from specialized sources, generated synthetically or provided by human experts – is key to maintaining the rapid progress of AI. Like oil fields, the most accessible reserves of data have been depleted. The challenge now is to find new ones… or sustainable alternatives.
© 2024 The Economist Newspaper Limited. All rights reserved Translation: Juan Gabriel López Guix
#Companies #Exhaust #Internet #Data