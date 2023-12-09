For several centuries, people have struggled with a very difficult task: learning to control machines with their voices. To do this, it was necessary to come up with a way by which soulless mechanisms could parse a person’s words. And if the first technologically significant experiments of the mid-20th century now look frankly ridiculous, then with the development of artificial intelligence, voice recognition made a colossal leap in development. “Lenta.ru” together with an online cinema Okko tells how this technology came about, how it works and how it will change the future of humanity.

No longer fantasy

Just a couple of decades ago, when the first smartphones were just being designed, it was impossible to imagine that a person would contact his things using his voice. A user who was told that he would be using his phone to build a route home or clarify the details of a historical event would twirl his finger at his temple.

But today, artificial intelligence that understands other people and is able to maintain a dialogue with them is everywhere. He answers calls from spammers, helps buy airline tickets and receive government services, translates medical reports into text, and can understand whether the person on the other end of the line is exactly the person he claims to be.

Moreover, you can already clone your voice (which is often creepy), watch videos in unfamiliar languages, getting a fairly accurate translation, and communicate with people on the streets of foreign cities by simply voicing your question to the application. And a generation of people has already grown up for whom scenes with R2D2 and C3PO in Star Wars are not fantasy at all, but everyday life. Humanity has been moving towards this moment for quite a long time. The last active phase of research has been ongoing for more than 70 years.

Scary experiments

Before trying to teach machines to understand voice, prominent minds decided that the devices themselves should first speak. The first such experiments were carried out by the German scientist Christian Kratzenstein, who worked in Russia, in the second half of the 18th century. He became interested in sound wave physics through his friendship with the great mathematician Leonhard Euler. For his invention, which could “pronounce” several vowel sounds after physical impact, Kratzenstein received a prize from the St. Petersburg Academy of Sciences.

The following decades were filled with the most bizarre experiments. Scientists have tried to make various mechanical devices speak by introducing air into them in a certain way and at various intervals. Some of them were downright sinister, such as a separate artificial female head that was capable of imitating human speech and “breathing.” True, its creator Joseph Faber, having not achieved success, became obsessed with his brainchild and went crazy.

Everything changed in the 20th century. Machines that made sounds were no longer outlandish, and purely mechanical developments were replaced by electronic ones. Some, such as the Voder synthesizer, developed by Bell Laboratories, could not only make any sounds, but also dared sing.

After this, humanity proceeded to the next stage. The same corporation in 1952 introduced the first device that recognized numbers spoken by a person. It was called Audrey – “Audrey”. The mechanism compared sound signals with pre-recorded samples, focusing on a kind of map of intonations, stresses and tones that were used to create the standard. “Audrey” did not have any of the abilities the world needed, but his contribution to the development of speech recognition was decisive. Including technologically: it was thanks to “Audrey” that people realized that the device should be able to compare the sound stream with the maximum number of templates pre-built into it.

Smart shoe box

Only 16 words – but what a huge progress! In the 1960s, the development of the IBM Shoebox (translated as “shoe box”) understood already 16 words and could perform basic arithmetic as directed by a person. Within 10 years, the Harpy machine will be invented in the USA, which knew more than a thousand words. Although at a primitive level, she could divide the incoming sound signal into individual phonemes, and then compare them with template values.

After 15 years, humanity learned to create text documents by voice. The same giant IBM created the Tangora typewriter, which could recognize 20 thousand words and several sentences. She could independently decide whether the combination of sounds she heard was a full word or only part of it.

At this point, the technology was finally formed: voice recognition, which was based on comparison with established samples, went further – to the subsequent conversion of sound into text

Then came the era of the universal Internet, personal computers and increasing memory capabilities, and this turned everything upside down. In the second half of the 1990s, the American company Dragon Systems introduced the Dragon NaturallySpeaking program, which could recognize a continuous stream of speech, translating it into text. The only condition is at a speed of no more than 100 words per minute.

Smart voice assistants

Apple and Google have entered the era of smartphones with voice technology most confidently. By the end of the 2010s, devices could not only understand what was asked of them and provide answers to questions, but also use other services to do this. For example, indicate the location of the nearest restaurant by contacting GPS services. Voice recordings no longer needed to be stored on the device itself due to the development of cloud technologies, and machine learning coupled with artificial intelligence made it possible to develop speech services by training them on huge amounts of data.

“Over the past 60 years, people have adapted to computers. Over the next 60 years, computers will adapt to us. It is our voices that will show them the way. This will be a revolution that will change everything,” stated founder and editor-in-chief of Multiplex Magazine Brian Rommele.

In large companies, calls are answered by bots, which, although not always perfectly, understand what they want from the other end of the line; and they are not shy with robots talk even top officials of states, not to mention ordinary Russians, who are helped by artificial intelligence, dressed in the form of androids, in obtaining government services.

How voice recognition works

The technology is based on the same principle: the voice heard by the device must be converted into text. It is with this text that the navigator in the car, digital assistant or smart home systems will then work. The last 70 years of development have meant that commands take milliseconds to decipher. This is due to the fact that behind any such technology there is a smart neural network.

If it’s completely primitive, then the engineers involved in the creation of this or that application give the still young and inexperienced network both a voice and a specially marked text that this voice pronounces. This is necessary for artificial intelligence to understand what the probability is that a specific letter is pronounced. After this, letters – also based on probabilities – are assembled into words, and words into sentences. Well-trained neural networks that are behind any major application can recognize punctuation marks in intonation.

“One day there will come a time when algorithms will understand not only what is said, but also how exactly it is said. The intonation that gives meaning to the spoken word will become part of the comprehension process. This can be done to determine the speaker’s mood, whether he is in trouble, or how strongly or weakly he believes in what he is saying.” predicted in 2017, Vice President of Voices Chris Kirby, not yet knowing that his predictions will come true in just a few years.

The more successful the application is, the longer and more carefully the training data sets are collected. They may contain outlandish and rare words and expressions, pronounced with the most unimaginable intonations and accents. Moreover, learning does not stop even after the development is published. She adapts to what she hears, analyzes all her conversations with the owner of the device and becomes smarter literally before our eyes.

Artificial intelligence can analyze the speaker’s intentions, since neural network algorithms and applications trained on large amounts of data can quickly determine what exactly the client means based on cause-and-effect relationships Alexander KhazaridiExpert from CG “Polylogue”

But, in essence, we are talking about the same comparison with templates. The correctness of this method was realized back in the 1950s. Only the very base of these patterns has expanded to volumes that the human brain is not able to imagine.

On the verge of total change

Voice recognition technology remains the focus of attention among experts and businesses. According to Market Research, the global market size for speech and voice recognition technologies in 2022 made up $9.4 billion. In the coming years, it will grow by an average of 24.4 percent annually. Such indicators are characteristic of the most promising technologies.

This is explained by the fact that corporations are developing IT in order to outcompete each other. But to do this, they need to give people the most advanced, cheap and easy-to-use technologies possible. Therefore, experts are waiting for a new peak in everything related to voice analytics.

“In the future, virtual assistants will dominate our daily lives as voice will help us communicate with our home appliances, right down to our kitchen appliances. We will also see a rapid increase in the number of voice-activated devices that will control our workplaces,” sure head of software development company ISHIR Rishi Khanna.

A little more – and voice technologies will see total penetration into smart homes, experts predict. Most likely, in our lifetime it will become possible to dictate to your refrigerator a list of purchases that it will send to the nearest store. And the smart home itself will be able to adjust lighting, temperature and music, guessing the mood of the resident by their voice.

Digital assistants are currently used mainly for information search, but in the future they will adapt socially. Voice assistants will be able to understand a person’s individual characteristics, calculate their emotional status and offer comfort, support or help manage stress.

The development of cross-border communications will reach a point where people from different countries will no longer have to communicate through translation applications. Artificial intelligence will sooner or later be able to play the role of a simultaneous translator in everyday communication, broadcasting the words of the interlocutor in an understandable language directly into the earphone in real time

In addition, voice control will help in the development of virtual and augmented reality. Users will be able to control virtual worlds using speech commands, as well as better and more natural contact with other people in them.

But what humanity stands to gain most is how voice recognition merges with technologies and inventions it doesn’t even know about yet. After all, no one could have imagined the emergence of intelligent voice assistants just 30 years ago.

