From birth, babies begin to receive visual and auditory stimuli, essential to learn something essential in their lives: language. Between six and nine months, they begin to talk, associating sounds with real-world objects and concepts. By the time they reach the age of two, they usually have a vocabulary of approximately 300 words. But how does this learning process develop? A team of researchers from New York University studied recordings of a child's daily life during his first year of life to find the answer. The experiment not only confirmed the connection between visual and linguistic representation – that is, what is seen and the word that corresponds to it – but also contributed to the development of an artificial intelligence (AI) model, which has managed to recognize different objects in a similar way to how children do.
“Large AI systems are trained and powered by an astronomical amount of data. We are talking about billions of words to be able to develop a language system,” explains Wai Keen Vong, doctor in psychology and computer science, who has coordinated the study that was published this Thursday in the journal Science. “However, humans need only a few thousand words to achieve an efficient communication system,” he adds. From this contrast, interest was born in investigating whether an AI would be capable of learning to talk in the same way as children: observing their environment, listening to the people around them and connecting dots between what they see and hear.
Early language acquisition is a widely debated topic and for which several hypotheses have been proposed. Traditionally, these types of studies have been conducted in controlled laboratory settings, resulting in discoveries that often do not extrapolate effectively to more dynamic and varied real-world contexts. “The novelty of this analysis lies in the fact that we were able to work with first-hand data, derived from a real learning situation,” emphasizes Vong.
To this end, Vong's team analyzed 61 hours of the life of Sam, an Australian boy who for a year and a half — from six to 25 months of age — wore a helmet with a camera that recorded the interactions he had with their parents and grandparents on a daily basis. In reality, he recorded only 1% of the time he spent awake during the duration of the experiment. Even so, hundreds of images have been achieved that reproduce exactly what the child was seeing, accompanied by the linguistic expressions of his family, which explained the nature of the objects that surrounded him. “For example, during mealtime, the camera on his head recorded the image of a spoon, at the same time that his mother asked him something related to that utensil. And so on, with dozens of everyday objects,” explains Vong.
The connection between these two mediums is almost never obvious. In fact, the researcher recognizes that part of the challenge for babies is to understand exactly what word is associated with the object with which they are interacting. “Most of the time, parents are not labeling every object. For every ball Sam was looking at, his parents didn't tell him 'this is a ball', 'look at the ball'. He listened to the words in a natural context, and the difficulty is precisely to find out, within a more or less long sentence, which word corresponds to the round object with which she was playing,” Vong points out.
Train an AI like a baby
After observing the child's behavior, the researchers were able to confirm that he learned the meaning of the words by connecting the visual stimulus—that is, the image presented to him—with the response of his family members, who repeated the corresponding word. With these results, they have moved on to the second phase of the experiment: verifying whether an AI would be able to learn to recognize objects in the same way that Sam did.
The artificial intelligence model, called CVCL (Child's View for Contrastive Learning, contrastive learning from the child's perspective), has been trained with 64 visual categories—utensils, toys, animals, among others—and the transcription of what Sam was hearing while looking at these objects. Once this database was created, the researchers began testing to see if the AI was capable of identifying the images. According to Vong, the model—with limited sensory information and relatively generic learning mechanisms—provides a computational basis for investigating how children acquire their first words and how those words can connect to the visual world.
“We found that CVCL can learn to make connections between images and text from limited fragments of a single child's experience,” the authors highlight in the study. In some cases, the objects appeared on a white background, while in others in an environment with more stimuli. In fact, the model's classification accuracy was 61.6%, and remained high even when images other than Sam's recordings, on which the AI had not been trained, were inserted into the system. “The results confirm our hypothesis that with only two impulses, which are what the child sees and what she hears, it is possible to achieve and accelerate this type of learning,” highlights Vong.
Study how speech is born
Antonio Rodríguez Fornells, researcher at the Institute of Neurosciences of the University of Barcelona, points out the novel aspect of the study, which opens the way to understanding, through computational simulations, what are the minimum learning mechanisms that children use to face the challenge of learning. a language: “Previous studies on babies in developmental psychology provide key information with very novel experiments, but the lack of neuroscience or neuroimaging studies on them (due to the difficulty of applying these techniques in babies) does not allow so much progress.” in neuroscience to clarify the brain mechanisms that support these language acquisition processes,” explains this neuroscientist.
Furthermore, it recognizes that the simulations proposed in the article support certain previously proposed theories of language. “Among them, that simply with simple associative learning mechanisms (that allow linking images and words) in a natural learning environment (such as that experienced by children when they are born and in the first months of their life) is enough to be able to learn these relationships and generalize the content of meaning,” adds Rodríguez Fornells.
Even so, the study has some limitations. The CVCL model was trained with recordings from a single head-mounted camera of a single child, and learned through speech transcriptions rather than direct speech, which omits important nuances such as intonation and emphasis. “It must also be remembered that the learning of the model was passive, based on recordings, without active interaction with the environment, which is different from how children learn in real environments,” the authors of the research acknowledge.
You can follow SUBJECT in Facebook, x and instagramor sign up here to receive our weekly newsletter.
#Recordings #oneyearold #baby39s #life #train #learn #words