We ask a lot of ourselves when we are babies. Somehow we must go from sensory masses to mobile, rational and attentive communicators in just a few years. There has been much debate about how babies achieve this. Some scientists have argued that most of our language acquisition can be explained by associative learning, since we associate sounds with sensitivity, much like dogs associate the ringing of a bell with food. Others say that there are features built into the human mind that have shaped the forms of all language. Still others say that young children build their understanding of new words on top of their understanding of other words.

This discourse advanced recently when Tammy Kwan and Brenden Lake brought blackberries from a bowl to the mouth of their almost 2-year-old daughter, Luna. A camera was attached to her hat.

“Babooga,” he said, pointing to the berries. Kwan gave him the rest and a camera light flickered.

For an hour each week for the past year, Lake, a psychologist at New York University whose research focuses on human and artificial intelligence, has been holding a camera to Luna and recording things from her point of view as she plays. His goal is to use the videos to train a language model using the same sensory information that a young child is exposed to. He hopes to create better tools to understand both AI and ourselves.

“We see this research as finally establishing that link between those two areas of study,” Lake said.

There are many challenges in using AI models to understand the mind. The two are markedly different. Modern language and multimodal models—such as OpenAI’s GPT-4 and Google’s Gemini—are assembled into neural networks with little built-in structure and have improved as a result of greater computational power and larger training data sets.

These models can analyze pixels in images, but they can’t taste berries or feel hunger, types of learning experiences important for children.

Researchers can do their best to code a child’s entire sensory stream, but crucial aspects of the phenomenon will undoubtedly be overlooked.

“What we’re seeing is just the residue of an active learner,” said Michael Frank, a psychologist at Stanford University in California who has been trying to capture the human experience on camera. His lab is working with more than 25 children across the US, including Luna, to record his experiences.

Humans are not simple data receptacles, as neural networks are, but intentional animals. Everything we see, every object we touch, every word we hear is combined with the beliefs and desires we have at the moment.

“There’s a deep relationship between what you’re trying to learn and the data that’s coming in,” said Linda Smith, a psychologist at Indiana University. “These models simply predict. “They take what is given to them and take the next best step.”

In February, Lake and his collaborators created the first AI model trained on a child’s experiences. The model was published in Science and from 60 hours of video it was able to relate different moments with words. Type “car” and the model will display a first-person video of the child sitting in his car seat.

For Lake and others, intertwined questions—How human can we make AI? What makes us human?—present the most interesting research. Trying to answer the first question, by modeling social interactions, intentions and biases, collecting video images from a front-facing camera, is coming close to answering the second.

“If the field can get to the point where models are trained solely on data that a single child saw, and perform well on a large set of tasks, that would be a huge scientific achievement,” Lake said.